Re: [PATCH 0/4] Add support to clear v1 free space cache for btrfs check

2016-05-16 Thread Qu Wenruo

Also to Ivan P, the initial reporter of the problem.

Now "btrfsck --clear-cache" should help you to prevent kernel warning on 
free space cache problem.


Thanks,
Qu

Qu Wenruo wrote on 2016/05/17 11:47 +0800:

The patchset can be also fetched from github, just in case mail list
blocks the last patch, which contains binary file.
https://github.com/adam900710/btrfs-progs.git fsck_clear_cache

Just as describe in the first patch, btrfs kernel "clear_cache" mount
option can't rebuild all free space cache, due to the design of free
space cache.
(Although I pretty like the idea to delay the load of free space cache,
as it hugely reduce the mount time of large fs)

So this makes users to report error in mail list, complaining
"clear_cache" doesn't wipe the annoying kernel warning on corrupted free
space cache.

Since kernel can't handle it, and user consider it more like an error,
we'd better handle it like an error, to fix it in btrfs check.

So this patchset adds the ability to clear free space cache, and add
test case for it.

The clear procedure is different from kernel, it will remove all free
space cache inodes and its contents(with backrefs), and set
cache_generation of super block to -1.
So there will be no free space cache at all and kernel will be happy
with that.

This patch also enhances btrfs_previous_item() to use min_objectid to
return as early as possible.

Lu Fengqi (1):
  btrfs-progs: tests: add 020-bad-free-space-cache

Qu Wenruo (3):
  btrfs-progs: corrupt-block: Add ability to corrupt free space cache
file
  btrfs-progs: ctree: return earlier for btrfs_previous_item
  btrfs-progs: fsck: Add support to clear free space cache

 Documentation/btrfs-check.asciidoc |   8 ++
 btrfs-corrupt-block.c  | 124 -
 cmds-check.c   |  58 +-
 ctree.c|   2 +
 free-space-cache.c | 124 +
 free-space-cache.h |   4 +
 .../020-bad-free-space-cache/default_case.raw.xz   | Bin 0 -> 164068 bytes
 tests/fsck-tests/020-bad-free-space-cache/test.sh  |  16 +++
 8 files changed, 334 insertions(+), 2 deletions(-)
 create mode 100644 
tests/fsck-tests/020-bad-free-space-cache/default_case.raw.xz
 create mode 100755 tests/fsck-tests/020-bad-free-space-cache/test.sh




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1.1 3/4] btrfs-progs: fsck: Add support to clear free space cache

2016-05-16 Thread Qu Wenruo
Add a new option "--clear-space-cache" for btrfs check.

Unlike many may assume, kernel "clear_cache" will only rebuild *SOME* of
the free space cache, not *ALL*.

Or more specifically, it will only rebuild free space cache for block
groups that has read out the cache during "clear_cache" mount time.

And since kernel will not read out free space cache at mount, but only
when kernel needs to allocate space from the block group, so bad
free space cache will stay untouched for a long time.

Such kernel design is quite good, as it dramatically reduce the mount
time for large fs, so I prefer not to change that.

Instead, since a lot of user consider free space warning from kernel as
an error, we should handle them like an error, which means we should use
btrfsck to fix it, even it's harmless most of time.

This patch will use "--clear-space-cache" to clear all free space
cache, and modify cache_generation to -1.

Reported-by: Ivan P 
Signed-off-by: Qu Wenruo 
---
v1.1:
   Fix error that when free space cache inode is not found, we still
   remove one item from tree root.
---
 Documentation/btrfs-check.asciidoc |   8 +++
 cmds-check.c   |  58 -
 free-space-cache.c | 126 +
 free-space-cache.h |   4 ++
 4 files changed, 195 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-check.asciidoc 
b/Documentation/btrfs-check.asciidoc
index 74a2ad2..ea25582 100644
--- a/Documentation/btrfs-check.asciidoc
+++ b/Documentation/btrfs-check.asciidoc
@@ -78,6 +78,14 @@ respective superblock offset is within the device size
 This can be used to use a different starting point if some of the primary
 superblock is damaged.
 
+--clear-space-cache::
+clear all free space cache
++
+NOTE:
+Kernel mount option 'clear_cache' is only designed to rebuild free space cache
+which is modified during the lifetime of that mount option.
+It doesn't rebuild all free space cache, nor clear them out.
+
 DANGEROUS OPTIONS
 -
 
diff --git a/cmds-check.c b/cmds-check.c
index ec0bbfd..1f6aefb 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -9524,10 +9524,41 @@ const char * const cmd_check_usage[] = {
"print subvolume extents and sharing state",
"-r|--tree-root  use the given bytenr for the tree root",
"--chunk-rootuse the given bytenr for the chunk tree 
root",
+   "--clear-space-cache clear all free space cache(v1)",
"-p|--progress   indicate progress",
NULL
 };
 
+static int clear_free_space_cache(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_trans_handle *trans;
+   struct btrfs_block_group_cache *bg_cache;
+   u64 current = 0;
+   int ret = 0;
+
+   /* Clear all free space cache inodes and its extent data */
+   while (1) {
+   bg_cache = btrfs_lookup_first_block_group(fs_info, current);
+   if (!bg_cache)
+   break;
+   ret = btrfs_clear_free_space_cache(fs_info, bg_cache);
+   if (ret < 0)
+   return ret;
+   current = bg_cache->key.objectid + bg_cache->key.offset;
+   }
+
+   /* Don't forget to set cache_generation to -1 */
+   trans = btrfs_start_transaction(fs_info->tree_root, 0);
+   if (IS_ERR(trans)) {
+   error("failed to update super block cache generation");
+   return PTR_ERR(trans);
+   }
+   btrfs_set_super_cache_generation(fs_info->super_copy, (u64)-1);
+   btrfs_commit_transaction(trans, fs_info->tree_root);
+   
+   return ret;
+}
+
 int cmd_check(int argc, char **argv)
 {
struct cache_tree root_cache;
@@ -9543,13 +9574,15 @@ int cmd_check(int argc, char **argv)
int init_csum_tree = 0;
int readonly = 0;
int qgroup_report = 0;
+   int clear_space_cache = 0;
enum btrfs_open_ctree_flags ctree_flags = OPEN_CTREE_EXCLUSIVE;
 
while(1) {
int c;
enum { GETOPT_VAL_REPAIR = 257, GETOPT_VAL_INIT_CSUM,
GETOPT_VAL_INIT_EXTENT, GETOPT_VAL_CHECK_CSUM,
-   GETOPT_VAL_READONLY, GETOPT_VAL_CHUNK_TREE };
+   GETOPT_VAL_READONLY, GETOPT_VAL_CHUNK_TREE, 
+   GETOPT_VAL_CLEAR_SPACE_CACHE};
static const struct option long_options[] = {
{ "super", required_argument, NULL, 's' },
{ "repair", no_argument, NULL, GETOPT_VAL_REPAIR },
@@ -9566,6 +9599,8 @@ int cmd_check(int argc, char **argv)
{ "tree-root", required_argument, NULL, 'r' },
{ "chunk-root", required_argument, NULL,
GETOPT_VAL_CHUNK_TREE },
+   { "clear-space-cache", no_argument, NULL,
+  

[PATCH 2/4] btrfs-progs: ctree: return earlier for btrfs_previous_item

2016-05-16 Thread Qu Wenruo
Btrfs_previous_item() has a parameter to specify minimal objectid to
return.

But surprisingly it doesn't use it at all.
Although that's OK, but it would take years long for large tree, so
return it earlier.

Signed-off-by: Qu Wenruo 
---
 ctree.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/ctree.c b/ctree.c
index 079696e..3a9f417 100644
--- a/ctree.c
+++ b/ctree.c
@@ -2894,6 +2894,8 @@ int btrfs_previous_item(struct btrfs_root *root,
btrfs_item_key_to_cpu(leaf, _key, path->slots[0]);
if (found_key.type == type)
return 0;
+   if (found_key.objectid < min_objectid)
+   break;
}
return 1;
 }
-- 
2.8.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] btrfs-progs: tests: add 020-bad-free-space-cache

2016-05-16 Thread Qu Wenruo
From: Lu Fengqi 

To test if fsck can detect and clear free space cache error

Signed-off-by: Lu Fengqi 
---
 .../020-bad-free-space-cache/default_case.raw.xz   | Bin 0 -> 164068 bytes
 tests/fsck-tests/020-bad-free-space-cache/test.sh  |  16 
 2 files changed, 16 insertions(+)
 create mode 100644 
tests/fsck-tests/020-bad-free-space-cache/default_case.raw.xz
 create mode 100755 tests/fsck-tests/020-bad-free-space-cache/test.sh

diff --git a/tests/fsck-tests/020-bad-free-space-cache/default_case.raw.xz 
b/tests/fsck-tests/020-bad-free-space-cache/default_case.raw.xz
new file mode 100644
index 
..5e08de426c5cc5faa9ccd430432b9c37f4e072ff
GIT binary patch
literal 164068
zcmeI*^;eeNx-amDR-{2nC8RqA>6GpeDQQIM6j137K{}*Cq@`OL1t~!~M5R-@W$!We
zIOF`V~XYi!^3uDve)0mhimoY%ZQ-#PF5zTo$Kpoc)9E{#>mAS0-eh!F^c3&
z<)trbl_3IQe{p%4Q+p{ZDtFyUz(l<+hH)t;2;*dHfx>Tuqf1a2NwM6kCoD{vBL{oZ
z`^}FC8hp)pUD7Vsur{jDvPxtx%!Tkbo_rhU;xZv758eDW_e=+Pb`q?vkT5awf>N;|gam?{2x9MLE
zeUp;lr*CIb@fnr#*(qs|p<(vVd;4f8MEO5zWo@Q=@)kZ7RJb_c+Q(T_uQqFn<0z&{
zRGX_fa*CJuia9>py^$1!^-5Lr7m~=)Rm^x-c#fC)a>@ofXcoM0
zxOX}xG}RQ{v_w-^E9b7B9u7oM!FE)!@e7I*Y)gzdMnHKJ!d^!dxqD?
zY2G@P9USQeg{*~z$6l7
zu5~GUeZO{^{KJ(!R`NO3(c!lgRb>`oax}897FehonYY`Q>I&+FbF?Xzbo#ze6J=~?
zHjwFRv9grUuhEJON{u<^wpN;6xqhmv5c?BHU1F^K6AyX9b=1LR919|%4ee80FVUKm
zM4=m`>b?{vR6Pe%mfuv?M3Uu+S>jJ*PgRyRzQ>^7Crg)3|317RbEx$*#FdqEx?N4{
z)$K%d6r);1p{-nI?NO>zcSD7~oFZJk)ov?^w^4|+Y(SCDC&!;ikVEh#LIX)@O
zi}}?qgXd}t@ft`>Oo?nMPvb3?TWtRkmbMw2HEJ94<>l}G!4x}Qq-9*sO_5_J^Q%HWpO|gFRtx>?Y-Tqq#1#C|Cxx|Y-c68T0
zzSe=MmEQOQ(err_Vp1*Ia#s8gF|Ac?7jw@D?ap{AcjSwwlEwGua6Aym7
zM$?jcd$sEUzPAi^wt=h4j?hC@gtx7<{iGD)gfz-U+`=2
zEfnoERAZh?3-EGJTrP-GN$t=hFyat8Hb;|JH6`U9C1v<
zG*aj1v?9_>DWR>BIEAa{1tD2|p_}P3)nrcUm
z`WDo3xD{m|t`Eo}S?#(GP+X{o@>U&0`rZU%()IbHq!+@*hQL;&*C4cFL@US-ukF_v(;LmtUTYE
z*~MKipi{D?=kEfVlSQ+`4dsweIE7P6D86#
z{a-i#{!c~y@1}m6`?v!v>_1Rb{%c`FCH%XKFbvoq+_hl9V8DKFUj#D@X4s!G!_bk*
z5bwQ5Hw+>_^W!Y-CQ{ZRRV|5kRfl#lR5WUCzAkAo(RRwE5uA$h_ob+=qLNArc_b0G
zX}Cq*Zme>pFje-UHNP8@V13P`vzJoeuh;t})YCZQWfi(l+3miK%Mbn7bbff(;xeZ%
z#pfkD(ynr+2-aim_RZnA*|4`cAs?*597S0QyBYBoRZ_5{*pTu9A8s?`S*6fdCgjz8
zVtTyiJ0-;trDbehwAw}}z|~bb&2rb=2J_?br+mJQw`HHHQv7XgO5;5PYF%ie*ohBx
z)58vXrnadRuRo5
z>@P-^o~sEX-_l>~BFp4v$rf;tTM6-V)Gy#|iNtWr$=rLYH8tD3)~51~D6EZMaD`Rx
zk|>T+DaPAW`a{>x-1(hXn??`$_G2wGq)ZvA$MP|pekEM$92(_tMf;`)G+g50N6lbS
zJ)f!mxh!0dk+HSiKiO

[PATCH 0/4] Add support to clear v1 free space cache for btrfs check

2016-05-16 Thread Qu Wenruo
The patchset can be also fetched from github, just in case mail list
blocks the last patch, which contains binary file.
https://github.com/adam900710/btrfs-progs.git fsck_clear_cache

Just as describe in the first patch, btrfs kernel "clear_cache" mount
option can't rebuild all free space cache, due to the design of free
space cache.
(Although I pretty like the idea to delay the load of free space cache,
as it hugely reduce the mount time of large fs)

So this makes users to report error in mail list, complaining
"clear_cache" doesn't wipe the annoying kernel warning on corrupted free
space cache.

Since kernel can't handle it, and user consider it more like an error,
we'd better handle it like an error, to fix it in btrfs check.

So this patchset adds the ability to clear free space cache, and add
test case for it.

The clear procedure is different from kernel, it will remove all free
space cache inodes and its contents(with backrefs), and set
cache_generation of super block to -1.
So there will be no free space cache at all and kernel will be happy
with that.

This patch also enhances btrfs_previous_item() to use min_objectid to
return as early as possible.

Lu Fengqi (1):
  btrfs-progs: tests: add 020-bad-free-space-cache

Qu Wenruo (3):
  btrfs-progs: corrupt-block: Add ability to corrupt free space cache
file
  btrfs-progs: ctree: return earlier for btrfs_previous_item
  btrfs-progs: fsck: Add support to clear free space cache

 Documentation/btrfs-check.asciidoc |   8 ++
 btrfs-corrupt-block.c  | 124 -
 cmds-check.c   |  58 +-
 ctree.c|   2 +
 free-space-cache.c | 124 +
 free-space-cache.h |   4 +
 .../020-bad-free-space-cache/default_case.raw.xz   | Bin 0 -> 164068 bytes
 tests/fsck-tests/020-bad-free-space-cache/test.sh  |  16 +++
 8 files changed, 334 insertions(+), 2 deletions(-)
 create mode 100644 
tests/fsck-tests/020-bad-free-space-cache/default_case.raw.xz
 create mode 100755 tests/fsck-tests/020-bad-free-space-cache/test.sh

-- 
2.8.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] btrfs-progs: corrupt-block: Add ability to corrupt free space cache file

2016-05-16 Thread Qu Wenruo
One user in mail list reported free space cache error where clear_cache
mount option failes to fix.

To reproduce such problem, add support to corrupt free space cache
file(old free space cache implement, or space_cache=v1).

With this patch, we can reproduce the problem quite easily:
Not suitable for a btrfs-progs or xfstest test case, as btrfsck won't
treat it as an error.

==
$ fallocate  test.img -l 2G
$ mkfs.btrfs test.img
$ sudo mount test.img  /mnt/btrfs/
$ sudo xfs_io  -f -c "pwrite 0 16M" /mnt/btrfs/tmp
$ sudo umount /mnt/btrfs
$ ./btrfs-corrupt-block -s content -l 136708096 ../test.img
  Here 136708096 is the second data chunk bytenr.
  (The first data chunk is the 8M one from mkfs, which doesn't have free
   space cache)

$ ./btrfsck test.img
Checking filesystem on ../test.img
UUID: 9542051b-8150-42e5-a9e6-95829656485c
checking extents
checking free space cache
btrfs: csum mismatch on free space cache  <<< Found space cache problem
failed to load free space cache for block group 136708096
checking fs roots
checking csums
checking root refs
found 16941058 bytes used err is 0
total csum bytes: 16384
total tree bytes: 163840
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 139567
file data blocks allocated: 16908288
 referenced 16908288

$ sudo mount -o clear_cache test.img /mnt/btrfs
$ sudo umount /mnt/btrfs
$ ./btrfsck test.img
Checking filesystem on ../test.img
UUID: 9542051b-8150-42e5-a9e6-95829656485c
checking extents
checking free space cache
btrfs: csum mismatch on free space cache <<< Problem not fixed by kernel
failed to load free space cache for block group 136708096
checking fs roots
checking csums
checking root refs
found 16941058 bytes used err is 0
total csum bytes: 16384
total tree bytes: 163840
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 139567
file data blocks allocated: 16908288
 referenced 16908288
==

Btrfs-progs fix will follow soon.

Reported-by: Ivan P 
Signed-off-by: Qu Wenruo 
---
 btrfs-corrupt-block.c | 124 +-
 1 file changed, 123 insertions(+), 1 deletion(-)

diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c
index d331f96..eb27265 100644
--- a/btrfs-corrupt-block.c
+++ b/btrfs-corrupt-block.c
@@ -115,6 +115,9 @@ static void print_usage(int ret)
fprintf(stderr, "\t-C Delete a csum for the specified bytenr.  When "
"used with -b it'll delete that many bytes, otherwise it's "
"just sectorsize\n");
+   fprintf(stderr, "\t-s \n");
+   fprintf(stderr, "\t   Corrupt free space cache file(space_cache v1), 
must also specify -l for blockgroup bytenr\n");
+   fprintf(stderr, "\tcan be 'zero_gen','rand_gen' or 
'content'\n");
exit(ret);
 }
 
@@ -1012,6 +1015,100 @@ out:
return ret;
 
 }
+
+u64 rand_u64(void)
+{
+   u64 ret = 0;
+   int n;
+
+   for (n = 0; n < sizeof(ret) / sizeof(RAND_MAX); n++)
+   ret += rand() << (n * sizeof(RAND_MAX) * 8);
+   return ret;
+}
+
+#define CORRUPT_SC_FILE_ZERO_GEN   1
+#define CORRUPT_SC_FILE_RAND_GEN   2
+#define CORRUPT_SC_FILE_CONTENT3
+int corrupt_space_cache_file(struct btrfs_fs_info *fs_info, u64 block_group,
+int method)
+{
+   struct btrfs_root *tree_root = fs_info->tree_root;
+   struct btrfs_path path;
+   struct btrfs_key key;
+   struct btrfs_disk_key location;
+   struct btrfs_free_space_header *sc_header;
+   struct btrfs_file_extent_item *fi;
+   struct extent_buffer *node;
+   struct extent_buffer *corrupted_node;
+   u64 disk_bytenr;
+   int slot;
+   int ret;
+
+   key.objectid = BTRFS_FREE_SPACE_OBJECTID;
+   key.type = 0;
+   key.offset = block_group;
+
+   btrfs_init_path();
+
+   /* Don't start trans, as this will cause generation different */
+   ret = btrfs_search_slot(NULL, tree_root, , , 0, 0);
+   if (ret) {
+   error("failed to find free space cahce file for block group 
%llu, ret: %d\n",
+ block_group, ret);
+   goto out;
+   }
+   slot = path.slots[0];
+   node = path.nodes[0];
+   sc_header = btrfs_item_ptr(node, slot, struct btrfs_free_space_header);
+   if (method == CORRUPT_SC_FILE_ZERO_GEN ||
+   method == CORRUPT_SC_FILE_RAND_GEN) {
+   u64 dst_gen;
+
+   if (method == CORRUPT_SC_FILE_ZERO_GEN)
+   dst_gen = 0;
+   else
+   dst_gen = rand_u64();
+
+   btrfs_set_free_space_generation(node, sc_header, dst_gen);
+   /* Manually re-calc csum and write to disk */
+   ret = write_tree_block(NULL, tree_root, node);
+   goto out;
+   }
+
+   btrfs_free_space_key(node, sc_header, );
+   btrfs_disk_key_to_cpu(, );

[PATCH 3/4] btrfs-progs: fsck: Add support to clear free space cache

2016-05-16 Thread Qu Wenruo
Add a new option "--clear-space-cache" for btrfs check.

Unlike many may assume, kernel "clear_cache" will only rebuild *SOME* of
the free space cache, not *ALL*.

Or more specifically, it will only rebuild free space cache for block
groups that has read out the cache during "clear_cache" mount time.

And since kernel will not read out free space cache at mount, but only
when kernel needs to allocate space from the block group, so bad
free space cache will stay untouched for a long time.

Such kernel design is quite good, as it dramatically reduce the mount
time for large fs, so I prefer not to change that.

Instead, since a lot of user consider free space warning from kernel as
an error, we should handle them like an error, which means we should use
btrfsck to fix it, even it's harmless most of time.

This patch will use "--clear-space-cache" to clear all free space
cache, and modify cache_generation to -1.

Reported-by: Ivan P 
Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-check.asciidoc |   8 +++
 cmds-check.c   |  58 -
 free-space-cache.c | 124 +
 free-space-cache.h |   4 ++
 4 files changed, 193 insertions(+), 1 deletion(-)

diff --git a/Documentation/btrfs-check.asciidoc 
b/Documentation/btrfs-check.asciidoc
index 74a2ad2..ea25582 100644
--- a/Documentation/btrfs-check.asciidoc
+++ b/Documentation/btrfs-check.asciidoc
@@ -78,6 +78,14 @@ respective superblock offset is within the device size
 This can be used to use a different starting point if some of the primary
 superblock is damaged.
 
+--clear-space-cache::
+clear all free space cache
++
+NOTE:
+Kernel mount option 'clear_cache' is only designed to rebuild free space cache
+which is modified during the lifetime of that mount option.
+It doesn't rebuild all free space cache, nor clear them out.
+
 DANGEROUS OPTIONS
 -
 
diff --git a/cmds-check.c b/cmds-check.c
index ec0bbfd..1f6aefb 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -9524,10 +9524,41 @@ const char * const cmd_check_usage[] = {
"print subvolume extents and sharing state",
"-r|--tree-root  use the given bytenr for the tree root",
"--chunk-rootuse the given bytenr for the chunk tree 
root",
+   "--clear-space-cache clear all free space cache(v1)",
"-p|--progress   indicate progress",
NULL
 };
 
+static int clear_free_space_cache(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_trans_handle *trans;
+   struct btrfs_block_group_cache *bg_cache;
+   u64 current = 0;
+   int ret = 0;
+
+   /* Clear all free space cache inodes and its extent data */
+   while (1) {
+   bg_cache = btrfs_lookup_first_block_group(fs_info, current);
+   if (!bg_cache)
+   break;
+   ret = btrfs_clear_free_space_cache(fs_info, bg_cache);
+   if (ret < 0)
+   return ret;
+   current = bg_cache->key.objectid + bg_cache->key.offset;
+   }
+
+   /* Don't forget to set cache_generation to -1 */
+   trans = btrfs_start_transaction(fs_info->tree_root, 0);
+   if (IS_ERR(trans)) {
+   error("failed to update super block cache generation");
+   return PTR_ERR(trans);
+   }
+   btrfs_set_super_cache_generation(fs_info->super_copy, (u64)-1);
+   btrfs_commit_transaction(trans, fs_info->tree_root);
+   
+   return ret;
+}
+
 int cmd_check(int argc, char **argv)
 {
struct cache_tree root_cache;
@@ -9543,13 +9574,15 @@ int cmd_check(int argc, char **argv)
int init_csum_tree = 0;
int readonly = 0;
int qgroup_report = 0;
+   int clear_space_cache = 0;
enum btrfs_open_ctree_flags ctree_flags = OPEN_CTREE_EXCLUSIVE;
 
while(1) {
int c;
enum { GETOPT_VAL_REPAIR = 257, GETOPT_VAL_INIT_CSUM,
GETOPT_VAL_INIT_EXTENT, GETOPT_VAL_CHECK_CSUM,
-   GETOPT_VAL_READONLY, GETOPT_VAL_CHUNK_TREE };
+   GETOPT_VAL_READONLY, GETOPT_VAL_CHUNK_TREE, 
+   GETOPT_VAL_CLEAR_SPACE_CACHE};
static const struct option long_options[] = {
{ "super", required_argument, NULL, 's' },
{ "repair", no_argument, NULL, GETOPT_VAL_REPAIR },
@@ -9566,6 +9599,8 @@ int cmd_check(int argc, char **argv)
{ "tree-root", required_argument, NULL, 'r' },
{ "chunk-root", required_argument, NULL,
GETOPT_VAL_CHUNK_TREE },
+   { "clear-space-cache", no_argument, NULL,
+   GETOPT_VAL_CLEAR_SPACE_CACHE},
{ "progress", no_argument, NULL, 'p' },
  

Re: BTRFS Data at Rest File Corruption

2016-05-16 Thread Chris Murphy
On Mon, May 16, 2016 at 5:44 PM, Richard A. Lochner  wrote:
> Chris,
>
> It has actually happened to me three times that I know of in ~7mos.,
> but your point about the "larger footprint" for data corruption is a
> good one.  No doubt I have silently experienced that too.

I dunno three is a lot to have the exact same corruption only in
memory then written out into two copies with valid node checksums; and
yet not have other problems, like a node item, or uuid, or xattr or
any number of other item or object types all of which get checksummed.
I suppose if the file system contains large files, the % of metadata
that's csums could be the 2nd largest footprint. But still.

Three times in 7 months, if it's really the same vector, is just short
of almost reproducible. Ha. It seems like if you merely balanced this
file system a few times, you'd eventually stumble on this. And if
that's true, then it's time for debug options and see if it can be
caught in action, and whether there's a hardware or software
explanation for it.


> And, as you
> suggest, there is no way to prevent those errors.  If the memory to be
> written to disk gets corrupted before its checksum is calculated, the
> data will be silently corrupted, period.

Well, no way in the present design, maybe.



>
> Clearly, I won't rely on this machine to produce any data directly that
> I would consider important at this point.
>
> One odd thing to me is that if this is really due to undetected memory
> errors, I'd think this system would crash fairly often due to detected
> "parity errors."  This system rarely crashes.  It often runs for
> several months without an indication of problems.

I think you'd have other problems. Only data csums are being corrupt
after they're read in, but before the node csum is computed? Three
times?  Pretty wonky.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Scrub status regression (repost kinda)

2016-05-16 Thread al
Hi All,

I'm still getting this regression on scrub. Scrub status no longer reports
the progress of the scrub. It always used to.

Anyone else experiencing the same regression?

Only quirky part of my setup which may throw btrfs off is that I create and
mount the first subvol as the root of the install. I repeat it is only
recently 3m that it has started to cease reporting.

The scrub itself proceeds normally and using '-B' returns the usual and
complete output.

# btrfs scrub start /   
scrub started on /, fsid fdd6a335-6edf-4a74-a909-03c8102cc8f5 (pid=4111)



# btrfs scrub status /  
scrub status for fdd6a335-6edf-4a74-a909-03c8102cc8f5
no stats available
total bytes scrubbed: 0.00B with 0 errors


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL] Btrfs for 4.7

2016-05-16 Thread Chris Mason
On Mon, May 16, 2016 at 04:14:23PM +0200, David Sterba wrote:
> Hi,
> 
> please queue the following branch to 4.7.
> 
> New features or user-visible changes:
> * device delete by id, a v2 ioctl for device deletion (this was held back from
>   4.6 pull due to possibly related crashes that haven't appeared for a long
>   time)
> * DUP allowed on multiple-device filesystem (help in case we want to go to
>   SINGLE from RAID1/.. but do not want to lose all redundancy while 
> converting)
> * GETFLAGS/SETFLAGS/GETVERSION ioctls work on 32bit
> 
> Developer-visible changes:
> * definitions that are shared with userspace are moved to the uapi directory
> 
> Other:
> * assorted bugfixes and cleanups
> 
> The branch is merged from several topic branches, grouped by functionality
> or purpose. The patches have been in for-next, most of them for a few weeks at
> least, tested together with Filipe's branch.

Great, thanks Dave!  I've got things merged up and I'm running more
tests.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS Data at Rest File Corruption

2016-05-16 Thread Richard A. Lochner
Chris,

It has actually happened to me three times that I know of in ~7mos.,
but your point about the "larger footprint" for data corruption is a
good one.  No doubt I have silently experienced that too.  And, as you
suggest, there is no way to prevent those errors.  If the memory to be
written to disk gets corrupted before its checksum is calculated, the
data will be silently corrupted, period.

Clearly, I won't rely on this machine to produce any data directly that
I would consider important at this point.

One odd thing to me is that if this is really due to undetected memory
errors, I'd think this system would crash fairly often due to detected
"parity errors."  This system rarely crashes.  It often runs for
several months without an indication of problems.  

Rick Lochner


On Mon, 2016-05-16 at 16:43 -0600, Chris Murphy wrote:
> On Mon, May 16, 2016 at 5:33 AM, Austin S. Hemmelgarn
>  wrote:
> 
> > 
> > 
> > I would think this would be perfectly possible if some other file
> > that had a
> > checksum in that node changed, thus forcing the node's checksum to
> > be
> > updated.  Theoretical sequence of events:
> > 1. Some file which has a checksum in node A gets written to.
> > 2. Node A is loaded into memory to update the checksum.
> > 3. The new checksum for the changed extent in the file gets updated
> > in the
> > in-memory copy of node A.
> > 4. Node A has it's own checksum recomputed based on the new data,
> > and then
> > gets saved to disk.
> > If something happened after 2 but before 4 that caused one of the
> > other
> > checksums to go bad, then the checksum computed in 4 will have been
> > with the
> > corrupted data.
> > 
> I'm pretty sure Qu had a suggestion that would mitigate this sort of
> problem, where there'd be a CRC32C checksum for each data extent (?)
> something like that anyway. There's enough room to stuff in more than
> just a checksum per 4096 byte block. That way there's three checks,
> and thus there's a way to break a tie.
> 
> But this has now happened to Richard twice. What are the chances of
> this manifesting exactly the same way a second time? If the chance of
> corruption is equal, I'd think the much much larger footprint for
> in-memory corruption is data itself. Problem is, if the corruption
> happens before the checksum is computed, the checksum would say the
> data is valid. So the only way to test this would be passing all file
> from this volume and a reference volume through a hash function and
> comparing hashes, e.g. rsync -c option.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: corrupt node with extent csum item results in bogus data being read

2016-05-16 Thread Chris Murphy
I realized this could be a much bigger problem for single copy
metadata on SSD case, rather than the example using raid1 where it's
unlikely for both copies of the node to be corrupt. So I filed a bug.

https://bugzilla.kernel.org/show_bug.cgi?id=118321
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS Data at Rest File Corruption

2016-05-16 Thread Chris Murphy
On Mon, May 16, 2016 at 5:33 AM, Austin S. Hemmelgarn
 wrote:

>
> I would think this would be perfectly possible if some other file that had a
> checksum in that node changed, thus forcing the node's checksum to be
> updated.  Theoretical sequence of events:
> 1. Some file which has a checksum in node A gets written to.
> 2. Node A is loaded into memory to update the checksum.
> 3. The new checksum for the changed extent in the file gets updated in the
> in-memory copy of node A.
> 4. Node A has it's own checksum recomputed based on the new data, and then
> gets saved to disk.
> If something happened after 2 but before 4 that caused one of the other
> checksums to go bad, then the checksum computed in 4 will have been with the
> corrupted data.
>

I'm pretty sure Qu had a suggestion that would mitigate this sort of
problem, where there'd be a CRC32C checksum for each data extent (?)
something like that anyway. There's enough room to stuff in more than
just a checksum per 4096 byte block. That way there's three checks,
and thus there's a way to break a tie.

But this has now happened to Richard twice. What are the chances of
this manifesting exactly the same way a second time? If the chance of
corruption is equal, I'd think the much much larger footprint for
in-memory corruption is data itself. Problem is, if the corruption
happens before the checksum is computed, the checksum would say the
data is valid. So the only way to test this would be passing all file
from this volume and a reference volume through a hash function and
comparing hashes, e.g. rsync -c option.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS Data at Rest File Corruption

2016-05-16 Thread Richard A. Lochner
Chris/Austin,

Thank you both for your help.

The sequence of events described by Austin is the only sequence that
seems to be plausible, given what I have been seen in the data (other
than an outright bug which I think extremely unlikely).

I will be moving these drives soon to a new system with ECC memory.  I
will definitely let you both know if I encounter this problem again
after that.  I do not expect to.

If I was really adventurous, I would modify the code to attempt to
detect this and run the patched version on my system to see if it is
possible to detect (and maybe even correct) it as it happens.
 Unfortunately, that does not appear to be a trivial exercise.

Rick Lochner

On Mon, 2016-05-16 at 07:33 -0400, Austin S. Hemmelgarn wrote:
> On 2016-05-16 02:07, Chris Murphy wrote:
> > 
> > Current hypothesis
> >  "I suspected, and I still suspect that the error occurred upon a
> > metadata update that corrupted the checksum for the file, probably
> > due
> > to silent memory corruption.  If the checksum was silently
> > corrupted,
> > it would be simply written to both drives causing this type of
> > error."
> > 
> > A metadata update alone will not change the data checksums.
> > 
> > But let's ignore that. If there's corrupt extent csum in a node
> > that
> > itself has a valid csum, this is functionally identical to e.g.
> > nerfing 100 bytes of a file's extent data (both copies,
> > identically).
> > The fs doesn't know the difference. All it knows is the node csum
> > is
> > valid, therefore the data extent csum is valid, and that's why it
> > assumes the data is wrong and hence you get an I/O error. And I can
> > reproduce most of your results by nerfing file data.
> > 
> > The entire dmesg for scrub looks like this:
> > 
> > 
> > May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > checksum error at logical 5566889984 on dev /dev/dm-6, sector
> > 8540160,
> > root 5, inode 258, offset 0, length 4096, links 1 (path:
> > openSUSE-Tumbleweed-NET-x86_64-Current.iso)
> > May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
> > bdev /dev/dm-6 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> > May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
> > unable to fixup (regular) error at logical 5566889984 on dev
> > /dev/dm-6
> > May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > checksum error at logical 5566889984 on dev /dev/mapper/VG-b1,
> > sector
> > 8579072, root 5, inode 258, offset 0, length 4096, links 1 (path:
> > openSUSE-Tumbleweed-NET-x86_64-Current.iso)
> > May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
> > bdev /dev/mapper/VG-b1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> > May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
> > unable to fixup (regular) error at logical 5566889984 on dev
> > /dev/mapper/VG-b1
> > 
> > And the entire dmesg for running sha256sum on the file is
> > 
> > May 15 23:33:41 f23s.localdomain kernel: __readpage_endio_check: 22
> > callbacks suppressed
> > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-
> > 6):
> > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
> > 
> > 
> > And I do get an i/o error for sha256sum and no hash is computed.
> > 
> > But there's two important differences:
> > 1. I have two unable to fixup messages, one for each device, at the
> > exact same time.
> > 2. I altered both copies of extent data.
> > 
> > It's a mystery to me how your file data has not changed, but
> > somehow
> > the extent csum was changed but also the node csum was recomputed
> > correctly. That's a bit odd.
> I would think this would be perfectly possible if some other file
> that 
> had a checksum in that node changed, thus forcing the node's checksum
> to 
> be updated.  Theoretical sequence of events:
> 1. Some file which has a checksum in node A gets written to.
> 2. Node A is loaded into memory to update the checksum.
> 3. The new checksum for the changed extent in the file gets updated
> in 
> the in-memory copy of node A.
> 4. Node A has it's own checksum recomputed based on the new data,
> and 
> then gets saved to disk.
> If something happened after 2 but before 4 that caused one of the
> other 
> checksums to go bad, then the checksum computed in 4 will have been
> with 
> the corrupted data.
> 
--
To unsubscribe 

[PATCH] Btrfs: fix handling of faults from btrfs_copy_from_user

2016-05-16 Thread Chris Mason
Hi everyone,

I've tried to cc most of the recent reports of this warning.  It was
pretty hard to track down, but I've got a test program for it now.  My
goal is to change xfs_io to add the madvise loop under
--do-horrible-things, instead of adding yet another special doio.c
function to xfstests.

I've run this through both my test case, and Dave's trinity code.

And now for the patch:

When btrfs_copy_from_user isn't able to copy all of the pages, we need
to adjust our accounting to reflect the work that was actually done.

Commit 2e78c927d79 changed around the decisions a little and we ended up
skipping the accounting adjustments some of the time.  This commit makes
sure that when we don't copy anything at all, we still hop into
the adjustments, and switches to release_bytes instead of write_bytes,
since write_bytes isn't aligned.

The accounting errors led to warnings during btrfs_destroy_inode:

[   70.847532] WARNING: CPU: 10 PID: 514 at fs/btrfs/inode.c:9350 
btrfs_destroy_inode+0x2b3/0x2c0
[   70.847536] Modules linked in: i2c_piix4 virtio_net i2c_core input_leds 
button led_class serio_raw acpi_cpufreq sch_fq_codel autofs4 virtio_blk
[   70.847538] CPU: 10 PID: 514 Comm: umount Tainted: GW 
4.6.0-rc6_00062_g2997da1-dirty #23
[   70.847539] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.9.0-1.fc24 04/01/2014
[   70.847542]   880ff5cafab8 8149d5e9 
0202
[   70.847543]     
880ff5cafb08
[   70.847547]  8107bdfd 880ff5cafaf8 24868120013d 
880ff5cafb28
[   70.847547] Call Trace:
[   70.847550]  [] dump_stack+0x51/0x78
[   70.847551]  [] __warn+0xfd/0x120
[   70.847553]  [] warn_slowpath_null+0x1d/0x20
[   70.847555]  [] btrfs_destroy_inode+0x2b3/0x2c0
[   70.847556]  [] ? __destroy_inode+0x71/0x140
[   70.847558]  [] destroy_inode+0x43/0x70
[   70.847559]  [] ? wake_up_bit+0x2f/0x40
[   70.847560]  [] evict+0x148/0x1d0
[   70.847562]  [] ? start_transaction+0x3de/0x460
[   70.847564]  [] dispose_list+0x59/0x80
[   70.847565]  [] evict_inodes+0x180/0x190
[   70.847566]  [] ? __sync_filesystem+0x3f/0x50
[   70.847568]  [] generic_shutdown_super+0x48/0x100
[   70.847569]  [] ? woken_wake_function+0x20/0x20
[   70.847571]  [] kill_anon_super+0x16/0x30
[   70.847573]  [] btrfs_kill_super+0x1e/0x130
[   70.847574]  [] deactivate_locked_super+0x4e/0x90
[   70.847576]  [] deactivate_super+0x51/0x70
[   70.847577]  [] cleanup_mnt+0x3f/0x80
[   70.847579]  [] __cleanup_mnt+0x12/0x20
[   70.847581]  [] task_work_run+0x68/0xa0
[   70.847582]  [] exit_to_usermode_loop+0xd6/0xe0
[   70.847583]  [] do_syscall_64+0xbd/0x170
[   70.847586]  [] entry_SYSCALL64_slow_path+0x25/0x25

This is the test program I used to force short returns from
btrfs_copy_from_user

void *dontneed(void *arg)
{
char *p = arg;
int ret;

while(1) {
ret = madvise(p, BUFSIZE/4, MADV_DONTNEED);
if (ret) {
perror("madvise");
exit(1);
}
}
}

int main(int ac, char **av) {
int ret;
int fd;
char *filename;
unsigned long offset;
char *buf;
int i;
pthread_t tid;

if (ac != 2) {
fprintf(stderr, "usage: dammitdave filename\n");
exit(1);
}

buf = mmap(NULL, BUFSIZE, PROT_READ|PROT_WRITE,
   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (buf == MAP_FAILED) {
perror("mmap");
exit(1);
}
memset(buf, 'a', BUFSIZE);
filename = av[1];

ret = pthread_create(, NULL, dontneed, buf);
if (ret) {
fprintf(stderr, "error %d from pthread_create\n", ret);
exit(1);
}

ret = pthread_detach(tid);
if (ret) {
fprintf(stderr, "pthread detach failed %d\n", ret);
exit(1);
}

while (1) {
fd = open(filename, O_RDWR | O_CREAT, 0600);
if (fd < 0) {
perror("open");
exit(1);
}

for (i = 0; i < ROUNDS; i++) {
int this_write = BUFSIZE;

offset = rand() % MAXSIZE;
ret = pwrite(fd, buf, this_write, offset);
if (ret < 0) {
perror("pwrite");
exit(1);
} else if (ret != this_write) {
fprintf(stderr, "short write to %s offset %lu 
ret %d\n",
filename, offset, ret);
exit(1);
}
if (i == ROUNDS - 1) {
ret = sync_file_range(fd, offset, 4096,

Re: [PATCH 3/7] Btrfs: check if extent buffer is aligned to sectorsize

2016-05-16 Thread Liu Bo
On Sat, May 14, 2016 at 06:30:52PM +0800, Qu Wenruo wrote:
> Hi Liu,
> 
> Thanks for your patch first.
> 
> On 05/14/2016 08:06 AM, Liu Bo wrote:
> > Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer
> > via alloc_extent_buffer().  An unaligned eb can have more pages than it
> > should have, which ends up extent buffer's leak or some corrupted content
> > in extent buffer.
> > 
> > This adds a warning to let us quickly know what was happening.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/extent_io.c | 4 
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index d247fc0..e601e0f 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -4868,6 +4868,10 @@ struct extent_buffer *alloc_extent_buffer(struct 
> > btrfs_fs_info *fs_info,
> > int uptodate = 1;
> > int ret;
> > 
> > +   WARN_ONCE(!IS_ALIGNED(start, fs_info->tree_root->sectorsize),
> > + KERN_WARNING "eb->start(%llu) is not aligned to 
> > root->sectorsize(%u)\n",
> > + start, fs_info->tree_root->sectorsize);
> > +
> 
> IMHO this is a quite big problem. As almost all other things rely on the
> assumption that extent buffer are at least sectorsize aligned.
> 

It won't cause too much trouble as reading eb's page can prevent btrfs
using this eb.

> What about warning and returning NULL? WARN_ONCE() only won't info user
> quick enough.

I'm OK with warning, but I just realized that warning doesn't show which
filesystem has problems, so btrfs_crit and -EINVAL is preferable.

> 
> BTW, after a quick glance into __alloc_extent_buffer(), it seems that we
> didn't check the return pointer of kmem_cache_zalloc(), since you're fixing
> things around that code, would you mind to fix it too?

It's not necessary to do that since it's using __GFP_NOFAIL.

Thanks,

-liubo

> 
> Thanks,
> Qu
> 
> > eb = find_extent_buffer(fs_info, start);
> > if (eb)
> > return eb;
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] Btrfs: replace BUG_ON with WARN in merge_bio

2016-05-16 Thread Liu Bo
On Mon, May 16, 2016 at 10:44:38AM +0200, David Sterba wrote:
> On Fri, May 13, 2016 at 05:07:00PM -0700, Liu Bo wrote:
> > We have two BUG_ON in merge_bio, but since it can gracefully return errors
> > to callers, use WARN_ONCE to give error information and don't leave a
> > possible panic.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/extent_io.c | 1 -
> >  fs/btrfs/inode.c | 6 --
> >  2 files changed, 4 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index e601e0f..99286d1 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2746,7 +2746,6 @@ static int merge_bio(int rw, struct extent_io_tree 
> > *tree, struct page *page,
> > if (tree->ops && tree->ops->merge_bio_hook)
> > ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio,
> > bio_flags);
> > -   BUG_ON(ret < 0);
> > return ret;
> >  
> >  }
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 5874562..3a989e3 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -1827,8 +1827,10 @@ int btrfs_merge_bio_hook(int rw, struct page *page, 
> > unsigned long offset,
> > map_length = length;
> > ret = btrfs_map_block(root->fs_info, rw, logical,
> >   _length, NULL, 0);
> > -   /* Will always return 0 with map_multi == NULL */
> > -   BUG_ON(ret < 0);
> > +   if (ret < 0) {
> > +   WARN_ONCE(ret < 0, KERN_ERR "ret = %d\n", ret);
> 
> btrfs_map_block is quite verbose about the errors so it's not needed to
> print it here.

OK.

> 
> Otherwise I'm not sure if all paths that go through the merge hook
> handle errors, eg. in btrfs_submit_compressed_read or _write. Some code
> is skipped if merge hook returns nonzero. But, the code expects either 0
> or 1, and when the BUG_ON(ret < 0) is removed suddenly the 'ret < 0' can
> be returned. Unexpected.

Right now btrfs_merge_bio_hook() only returns 1 or 0 and marks (ret < 0) as
 BUG_ON.

But compress code is ready to handle the error,

btrfs_submit_compressed_read/write() {
...
ret = merge_bio_hook()
if (ret || bio_add_page()) {
ret = btrfs_map_bio();
BUG_ON(ret);
...
}
...
}

So ret < 0 is handled, if there's any errors from merge_bio_hook(), it submits
 the current bio, so the way is sane to me.

The other consumer of merge_bio is submit_extent_page(), where it's OK
to return errors and callers are ready to handle them.

> 
> It's IMO better to push up the BUG_ON error handling only one caller at
> a time. That way it's easier to review the callgraph and call paths.

merge_bio() is a wrapper for tree->ops->merge_bio_hook(), which is the
same thing with btrfs_merge_bio_hook().  It makes no sense if we just
look at BUG_ON in btrfs_merge_bio_hook but keep BUG_ON() in merge_bio().

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About in-band dedupe for v4.7

2016-05-16 Thread David Sterba
On Fri, May 13, 2016 at 11:13:18AM +0800, Wang Shilong wrote:
> I am commenting not because of Qu's patch, of course, Qu and Mark 
> Fasheh
> Do a really good thing for Btrfs contributions.
> 
> Just my two cents!
> 
> 1) I think Currently, we should really focus on Btrfs stability, there
> are still many
> bugs hidden inside Btrfs, please See Filipe flighting patches here and
> there. Unfortunately,
> I have not seen Btrfs's Bug is reducing for these two years even we
> have frozen new features here.

So this means that number of bugs is increasing despite all the bugfixes
taht get merged? I wonder if you have numbers to back that or if it's
just handwaving. Complex code has bugs and it's not easy to discover
them, that's not surprising.

> we are adopting Btrfs internally sometime before, but it is really
> unstable unfortunately.
> 
> So Why not sit down and make it stable firstly?

And do you thing this does not happen? Most of patches that are merged
are fixes or updates related to fixes. The number of new big features is
at most one per release.

> 2)I am not against new feature, but for a new feature, I think we
> should be really
> careful now,  Especially if a new feature affect normal write/read
> path, I think following things can
> be done to make things better:
> 
> ->Give your design and documents(maybe put it in wiki or somewhere
> else) So that
> other guys can really review your design instead of looking a bunch of
> codes firstly. And we really
> need understand pros and cons of design, also if there is TODO, please
> do clarity out how we
> can solve this problem(or it is possible?).

The design document should come as th 0-th patch in a series. This
patchset had the overall idea described in
1438072250-2871-1-git-send-email-quwen...@cn.fujitsu.com, further
comments about details are scattered under the patchset revisions, so
this could be improved so everybody knows what's the consensus.

>->We need reallly a lot of testing and benchmarking if it affects
> normal write/read path.

Sure, new tests and test results are welcome but they do not come in
hundreds.

>->I think Chris is nice to merge patches, but I really argue next
> time if we want to merge new feautres
> Please make sure at least two other guys review for patches.

Thanks for the advice, but that's nothing new. People willing to do
reviews are scarce and overloaded. More reviews are not a problem,
everybody does it in a different style, the more the better coverage
we'll have in the end.

Doing reviews cannot be forced otherwise the quality would drop,
announcing 'reviewer of the month' would not help either. Everybody who
considers self a frequent contributor and knows some area should be also
self-motivated to review patches that touch it. Reviewing has a positive
educational effect as one has to make sure he understands what the old
code does and how it's going to be changed by the patch. There's usually
overlap with other areas so this slowly extends the expertise which has
a good impact on the developer community etc.

Maintainers do reviews, but not necessarily as deep as another developer
could do, because sometimes it's difficult, or too lengthy to get
familiar (again) with some particular code or code path.  This leads to
slow merging pace or worse merging buggy patches if the review misses
some dark corners.

I think I'm stating the obvious but it should not hurt to repeat this
from time to time.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About in-band dedupe for v4.7

2016-05-16 Thread David Sterba
On Thu, May 12, 2016 at 01:54:26PM -0700, Mark Fasheh wrote:
> On Wed, May 11, 2016 at 07:36:59PM +0200, David Sterba wrote:
> > On Tue, May 10, 2016 at 07:52:11PM -0700, Mark Fasheh wrote:
> > > Taking your history with qgroups out of this btw, my opinion does not
> > > change.
> > > 
> > > With respect to in-memory only dedupe, it is my honest opinion that such a
> > > limited feature is not worth the extra maintenance work. In particular
> > > there's about 800 lines of code in the userspace patches which I'm sure
> > > you'd want merged, because how could we test this then?
> > 
> > I like the in-memory dedup backend. It's lightweight, only a heuristic,
> > does not need any IO or persistent storage. OTOH I consider it a subpart
> > of the in-band deduplication that does all the persistency etc. So I
> > treat the ioctl interface from a broader aspect.
> 
> Those are all nice qualities, but what do they all get us?
> 
> For example, my 'large' duperemove test involves about 750 gigabytes of
> general purpose data - quite literally /home off my workstation.

Home directory is IMO a typicial anti-usecase for in-band deduplication.

> After the run I'm usually seeing between 65-75 gigabytes saved for a total
> of only 10% duplicated data. I would expect this to be fairly 'average' -
> /home on my machine has the usual stuff - documents, source code, media,
> etc.
> 
> So if you were writing your whole fs out you could expect about the same
> from inline dedupe - 10%-ish. Let's be generous and go with that number
> though as a general 'this is how much dedupe we get'.
> 
> What the memory backend is doing then is providing a cache of sha256/block
> calculations. This cache is very expensive to fill, and every written block
> must go through it. On top of that, the cache does not persist between
> mounts, and has items regularly removed from it when we run low on memory.
> All of this will drive down the amount of duplicated data we can find.
> 
> So our best case savings is probably way below 10% - let's be _really_ nice
> and say 5%.
> 
> Now ask yourself the question - would you accept a write cache which is
> expensive to fill and would only have a hit rate of less than 5%?

I'd ask myself a diffrent question: would I turn on a write cache if the
utilization is that low? No.

The in-memory cache is not going to be on by default, it's going to help
in spefific scenarios, as described in other replies.

> Oh and there's 800 lines of userspace we'd merge to manage this cache too,
> kernel ioctls which would have to be finalized, etc.

Most of the userspace code is going to be shared with the generic dedup.
I haven't reviewed the userspace because the ioctl interfaces are not
settled so the code can change, but now it should be enough to test the
feature.

> > A usecase I find interesting is to keep the in-memory dedup cache and
> > then flush it to disk on demand, compared to automatically synced dedup
> > (eg. at commit time).
> 
> What's the benefit here? We're still going to be hashing blocks on the way
> in, and if we're not deduping them at write time then we're just have to
> remove the extents and dedupe them later.
> 
> 
> > > A couple examples sore points in my review so far:
> > > 
> > > - Internally you're using a mutex (instead of a spinlock) to lock out 
> > > queries
> > >  to the in-memory hash, which I can see becoming a performance problem in 
> > > the
> > >  write path.
> > > 
> > > - Also, we're doing SHA256 in the write path which I expect will
> > >  slow it down even more dramatically. Given that all the work done gets
> > >  thrown out every time we fill the hash (or remount), I just don't see 
> > > much
> > >  benefit to the user with this.
> > 
> > I had some ideas to use faster hashes and do sha256 when it's going to
> > be stored on disk, but there were some concerns. The objection against
> > speed and performance hit at write time is valid. But we'll need to
> > verify that in real performance tests, which haven't happend yet up to
> > my knowledge.
> 
> This is the type of thing that IMHO absolutely must be provided with each
> code drop of the feature. Dedupe is nice but _nobody_ will use it if it's
> slow. I know this from experience. I personally feel that btrfs has had
> enough of 'cute' and 'almost working' features. If we want inline dedupe we
> should do it correctly and with the right metrics from the beginning.

That's why this feature is on slow-merge track and I'm glad that so many
people are interested. And I hope it's not just because Qu asked to
merge it. I'm pushing back on the interface side, so far I think the
feature will address real usecases.

To evaluate the feature we have to put into some more visible git tree,
not only to see how it plays together with other changes, but because
the number of people who would apply the patches from mailinglist is
low. Leaving out the patchset from a for-next is easy if we encounter
problems.

> This is slightly 

[PULL] Btrfs for 4.7

2016-05-16 Thread David Sterba
Hi,

please queue the following branch to 4.7.

New features or user-visible changes:
* device delete by id, a v2 ioctl for device deletion (this was held back from
  4.6 pull due to possibly related crashes that haven't appeared for a long
  time)
* DUP allowed on multiple-device filesystem (help in case we want to go to
  SINGLE from RAID1/.. but do not want to lose all redundancy while converting)
* GETFLAGS/SETFLAGS/GETVERSION ioctls work on 32bit

Developer-visible changes:
* definitions that are shared with userspace are moved to the uapi directory

Other:
* assorted bugfixes and cleanups

The branch is merged from several topic branches, grouped by functionality
or purpose. The patches have been in for-next, most of them for a few weeks at
least, tested together with Filipe's branch.

Thanks.


The following changes since commit 02da2d72174c61988eb4456b53f405e3ebdebce4:

  Linux 4.6-rc5 (2016-04-24 16:17:05 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris-4.7

for you to fetch changes up to 680834ca0ad4e9827048d4bda1e38db69c3dd1e4:

  Merge branch 'foreign/jeffm/uapi' into for-chris-4.7-20160516 (2016-05-16 
15:46:29 +0200)


Adam Borowski (1):
  btrfs: fix int32 overflow in shrink_delalloc().

Anand Jain (20):
  btrfs: rename btrfs_std_error to btrfs_handle_fs_error
  btrfs: remove unused function btrfs_assert()
  btrfs: move error handling code together in ctree.h
  btrfs: remove save_error_info()
  btrfs: create a helper function to read the disk super
  btrfs: create helper function __check_raid_min_devices()
  btrfs: clean up and optimize __check_raid_min_device()
  btrfs: create helper btrfs_find_device_by_user_input()
  btrfs: make use of btrfs_find_device_by_user_input()
  btrfs: enhance btrfs_find_device_by_user_input() to check device path
  btrfs: make use of btrfs_scratch_superblocks() in btrfs_rm_device()
  btrfs: introduce device delete by devid
  btrfs: optimize check for stale device
  btrfs: use fs_info directly
  btrfs: refactor btrfs_dev_replace_start for reuse
  btrfs: pass the right error code to the btrfs_std_error
  btrfs: s_bdev is not null after missing replace
  btrfs: cleanup assigning next active device with a check
  btrfs: fix lock dep warning, move scratch dev out of device_list_mutex 
and uuid_mutex
  btrfs: fix lock dep warning move scratch super outside of chunk_mutex

Ashish Samant (1):
  btrfs: Fix BUG_ON condition in scrub_setup_recheck_block()

Austin S. Hemmelgarn (1):
  btrfs: allow balancing to dup with multi-device

Chandan Rajendra (1):
  Btrfs: __btrfs_buffered_write: Pass valid file offset when releasing 
delalloc space

Dan Carpenter (1):
  btrfs: send: silence an integer overflow warning

David Sterba (32):
  btrfs: rename __check_raid_min_devices
  btrfs: pass number of devices to btrfs_check_raid_min_devices
  btrfs: introduce raid-type to error-code table, for minimum device 
constraint
  btrfs: use existing device constraints table btrfs_raid_array
  btrfs: rename btrfs_find_device_by_user_input
  btrfs: rename flags for vol args v2
  btrfs: kill unused writepage_io_hook callback
  btrfs: ioctl: reorder exclusive op check in RM_DEV
  btrfs: send: use vmalloc only as fallback for send_buf
  btrfs: send: use vmalloc only as fallback for read_buf
  btrfs: send: use temporary variable to store allocation size
  btrfs: send: use vmalloc only as fallback for clone_roots
  btrfs: send: use vmalloc only as fallback for clone_sources_tmp
  btrfs: clone: use vmalloc only as fallback for nodesize bufer
  btrfs: use dynamic allocation for root item in create_subvol
  btrfs: reuse existing variable in scrub_stripe, reduce stack usage
  btrfs: add read-only check to sysfs handler of features
  btrfs: add check to sysfs handler of label
  btrfs: sysfs: protect reading label by lock
  btrfs: add write protection to SET_FEATURES ioctl
  btrfs: ioctl: reorder exclusive op check in RM_DEV
  btrfs: switch to common message helpers in open_ctree, adjust messages
  btrfs: GFP_NOFS does not GFP_HIGHMEM
  btrfs: rename and document compression workspace members
  btrfs: preallocate compression workspaces
  btrfs: make find_workspace always succeed
  btrfs: make find_workspace warn if there are no workspaces
  btrfs: build fixup for qgroup_account_snapshot
  Merge branch 'misc-4.7' into for-chris-4.7-20160516
  Merge branch 'cleanups-4.7' into for-chris-4.7-20160516
  Merge branch 'foreign/anand/dev-del-by-id-ext' into for-chris-4.7-20160516
  Merge branch 'foreign/jeffm/uapi' into for-chris-4.7-20160516

Geert Uytterhoeven (1):
  Btrfs: Refactor

Re: [PATCH v2 0/3] btrfs-progs: autogen: Some compatibility fixs

2016-05-16 Thread David Sterba
On Fri, May 13, 2016 at 11:44:50AM +0800, Zhao Lei wrote:
> Changelog v1->v2:
> 1: btrfs-progs: autogen: Make build success in CentOS 6 and 7
>Add AC_SUBST(UDEVDIR), suggested by: Jeff Mahoney 
> 
> Zhao Lei (3):
>   btrfs-progs: autogen: Avoid chdir fail on dirname with blank
>   btrfs-progs: autogen: Make build success in CentOS 6 and 7
>   btrfs-progs: autogen: Don't show success message on fail

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFE: 'btrfs' tools machine readable output

2016-05-16 Thread Pino Toscano
On Monday 16 May 2016 14:21:07 Martin Steigerwald wrote:
> Hello Richard,
> 
> On Montag, 16. Mai 2016 13:14:56 CEST Richard W.M. Jones wrote:
> > I don't have time to implement this right now, so I'm just posting
> > this as a suggestion/request ...
> > 
> > It would be really helpful if the btrfs tools had a machine-readable
> > output.
> > 
> > Libguestfs parses btrfs tools output in a number of places, eg:
> > https://github.com/libguestfs/libguestfs/blob/master/daemon/btrfs.c
> > This is a massive PITA because each time a new release of btrfs-progs
> > comes along it changes the output slightly, and we end up having
> > to add all sorts of hacks.
> > 
> > With machine-readable output, there'd be a flag which would
> > change the output.  eg:
> 
> I wonder whether parsing a text based output is really the most elegant 
> method 
> here.
> 
> How about a libbtrfs so that other tools can benefit from btrfs tools 
> functionality?

btrfs-progs is GPL v2 only, and that may be an issue for consumers
(the libguestfs daemon, the component interacting with devices and
filesystems inside the libguestfs appliance, is GPL v2+ for example).

> Of course it would likely me more effort than to implement structured output.

There's this of course, which is not entirely negligible.

-- 
Pino Toscano

signature.asc
Description: This is a digitally signed message part.


Re: RFE: 'btrfs' tools machine readable output

2016-05-16 Thread Richard W.M. Jones

On Mon, May 16, 2016 at 02:21:07PM +0200, Martin Steigerwald wrote:
> Hello Richard,
> 
> On Montag, 16. Mai 2016 13:14:56 CEST Richard W.M. Jones wrote:
> > I don't have time to implement this right now, so I'm just posting
> > this as a suggestion/request ...
> > 
> > It would be really helpful if the btrfs tools had a machine-readable
> > output.
> > 
> > Libguestfs parses btrfs tools output in a number of places, eg:
> > https://github.com/libguestfs/libguestfs/blob/master/daemon/btrfs.c
> > This is a massive PITA because each time a new release of btrfs-progs
> > comes along it changes the output slightly, and we end up having
> > to add all sorts of hacks.
> > 
> > With machine-readable output, there'd be a flag which would
> > change the output.  eg:
> 
> I wonder whether parsing a text based output is really the most elegant 
> method 
> here.
> 
> How about a libbtrfs so that other tools can benefit from btrfs tools 
> functionality? This was also desktop environments wishing to make use of 
> snapshot functionality or advanced disk usage reporting for example can 
> easily 
> make use of it without calling external commands.
> 
> Of course it would likely me more effort than to implement structured output.

Yes, a similar situation happened with qemu actually.  We've been
talking for over half a decade about putting the qemu block layer into
a library, and it's still not happened.  But we got `qemu-img info' to
have JSON output in a few weeks, and that is now how we query the
properties of VM disk images:

$ qemu-img info --output=json ./builder/fedora.qcow2
{
"virtual-size": 1073741824,
"filename": "./builder/fedora.qcow2",
"cluster-size": 65536,
"format": "qcow2",
"actual-size": 7933952,
"format-specific": {
"type": "qcow2",
"data": {
"compat": "1.1",
"lazy-refcounts": false,
"refcount-bits": 16,
"corrupt": false
}
},
"dirty-flag": false
}

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFE: 'btrfs' tools machine readable output

2016-05-16 Thread Austin S. Hemmelgarn

On 2016-05-16 08:14, Richard W.M. Jones wrote:

I don't have time to implement this right now, so I'm just posting
this as a suggestion/request ...

It would be really helpful if the btrfs tools had a machine-readable
output.

Libguestfs parses btrfs tools output in a number of places, eg:
https://github.com/libguestfs/libguestfs/blob/master/daemon/btrfs.c
This is a massive PITA because each time a new release of btrfs-progs
comes along it changes the output slightly, and we end up having
to add all sorts of hacks.

With machine-readable output, there'd be a flag which would
change the output.  eg:

$ btrfs filesystem show
Label: 'ROOT'  uuid: af471cfc-421a-4d51-8811-ce969f76828a
Total devices 1 FS bytes used 5.29MiB
devid1 size 767.97MiB used 92.00MiB path /dev/sda2

would become:

$ btrfs --json filesystem show
{
  "devices": {
 "Label": "ROOT",
 "uuid": "af471cfc-421a-4d51-8811-ce969f76828a",
 /// etc
  }
}

By this example I don't mean that JSON has to be the format -- in fact
it's a terrible format with all sorts of problems -- any format which
is parseable with C libraries would do for us.
I would love to see something like this myself, as it would make 
integration with monitoring tools so much easier.  I'd vote for YAML as 
the output format though, as it's easily human readable as well as 
machine parseable while still being space efficient.  Output from your 
example above might look something like this:

---
- filesystems
  - label: 'ROOT'
uuid: af471cfc-421a-4d51-8811-ce969f76828a
devices: 1
  - devid: 1
size: 767.97 MB
used 92.00 MB
path: /dev/sda2
used: 5.29 MB
...
Although of course the numbers would probably just be in bytes instead 
of 'human' values.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] btrfs: fix memory leak during RAID 5/6 device replacement

2016-05-16 Thread Zhao Lei
Hi, Scott Talbert

> From: Zhao Lei [mailto:zhao...@cn.fujitsu.com]
> Sent: Friday, May 13, 2016 6:31 PM
> To: 'dste...@suse.cz' ; 'Scott Talbert'
> 
> Cc: 'linux-btrfs@vger.kernel.org' 
> Subject: RE: [PATCH] btrfs: fix memory leak during RAID 5/6 device replacement
> 
> Hi, Scott Talbert
> 
> * From: David Sterba [mailto:dste...@suse.cz]
> > Sent: Tuesday, May 10, 2016 1:32 AM
> > To: Scott Talbert 
> > Cc: linux-btrfs@vger.kernel.org; Zhao Lei 
> > Subject: Re: [PATCH] btrfs: fix memory leak during RAID 5/6 device
> replacement
> >
> > CC Zhao Lei 
> >
> > On Mon, May 09, 2016 at 09:14:28AM -0400, Scott Talbert wrote:
> > > A 'struct bio' is allocated in scrub_missing_raid56_pages(), but it was 
> > > never
> > > freed anywhere.
> > >
> > > Signed-off-by: Scott Talbert 
> > > ---
> > >  fs/btrfs/scrub.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> > > index 82bedf9..607cc6e 100644
> > > --- a/fs/btrfs/scrub.c
> > > +++ b/fs/btrfs/scrub.c
> > > @@ -2130,6 +2130,8 @@ static void scrub_missing_raid56_end_io(struct
> > bio *bio)
> > >   if (bio->bi_error)
> > >   sblock->no_io_error_seen = 0;
> > >
> > > + bio_put(bio);
> > > +
> Seems good by reviewing.
> I'll add some debug code to test it more detailedly.
> 
Signed-off-by: Zhao Lei 
Test-by: Zhao Lei 

Thanks
Zhaolei

> Thanks
> Zhaolei
> 
> > >   btrfs_queue_work(fs_info->scrub_workers, >work);
> > >  }
> > >
> > > --
> > > 1.9.1
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFE: 'btrfs' tools machine readable output

2016-05-16 Thread Martin Steigerwald
Hello Richard,

On Montag, 16. Mai 2016 13:14:56 CEST Richard W.M. Jones wrote:
> I don't have time to implement this right now, so I'm just posting
> this as a suggestion/request ...
> 
> It would be really helpful if the btrfs tools had a machine-readable
> output.
> 
> Libguestfs parses btrfs tools output in a number of places, eg:
> https://github.com/libguestfs/libguestfs/blob/master/daemon/btrfs.c
> This is a massive PITA because each time a new release of btrfs-progs
> comes along it changes the output slightly, and we end up having
> to add all sorts of hacks.
> 
> With machine-readable output, there'd be a flag which would
> change the output.  eg:

I wonder whether parsing a text based output is really the most elegant method 
here.

How about a libbtrfs so that other tools can benefit from btrfs tools 
functionality? This was also desktop environments wishing to make use of 
snapshot functionality or advanced disk usage reporting for example can easily 
make use of it without calling external commands.

Of course it would likely me more effort than to implement structured output.

Thanks,
-- 
Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFE: 'btrfs' tools machine readable output

2016-05-16 Thread Richard W.M. Jones
I don't have time to implement this right now, so I'm just posting
this as a suggestion/request ...

It would be really helpful if the btrfs tools had a machine-readable
output.

Libguestfs parses btrfs tools output in a number of places, eg:
https://github.com/libguestfs/libguestfs/blob/master/daemon/btrfs.c
This is a massive PITA because each time a new release of btrfs-progs
comes along it changes the output slightly, and we end up having
to add all sorts of hacks.

With machine-readable output, there'd be a flag which would
change the output.  eg:

$ btrfs filesystem show
Label: 'ROOT'  uuid: af471cfc-421a-4d51-8811-ce969f76828a
Total devices 1 FS bytes used 5.29MiB
devid1 size 767.97MiB used 92.00MiB path /dev/sda2

would become:

$ btrfs --json filesystem show
{
  "devices": {
 "Label": "ROOT",
 "uuid": "af471cfc-421a-4d51-8811-ce969f76828a",
 /// etc
  }
}

By this example I don't mean that JSON has to be the format -- in fact
it's a terrible format with all sorts of problems -- any format which
is parseable with C libraries would do for us.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fsck: to repair or not to repair

2016-05-16 Thread Austin S. Hemmelgarn

On 2016-05-16 07:34, Andrei Borzenkov wrote:

16.05.2016 14:17, Austin S. Hemmelgarn пишет:

On 2016-05-13 17:35, Chris Murphy wrote:

On Fri, May 13, 2016 at 9:28 AM, Nikolaus Rath  wrote:

On May 13 2016, Duncan <1i5t5.dun...@cox.net> wrote:

Because btrfs can be multi-device, it needs some way to track which
devices belong to each filesystem, and it uses filesystem UUID for this
purpose.

If you clone a filesystem (for instance using dd or lvm snapshotting,
doesn't matter how) and then trigger a btrfs device scan, say by
plugging
in some other device with btrfs on it so udev triggers a scan, and the
kernel sees multiple devices with the same filesystem UUID as a result,
and one of those happens to be mounted, you can corrupt both copies as
the kernel btrfs won't be able to tell them apart and may write updates
to the wrong one.


That seems like a rather odd design. Why isn't btrfs refusing to mount
in this situation? In the face of ambiguity, guessing is generally bad
idea (at least for a computer program).


The logic  you describe requires code. It's the absence of code rather
than an intentional design that's the cause of the current behavior.
And yes, it'd be nice if Btrfs weren't stepping on its own tail in
this situation. It could be as simple as refusing to mount anytime
there's an ambiguity, but that's sorta user hostile if there isn't a
message that goes along with it to help the user figure out a way to
resolve the problem. And that too could be fraught with peril if the
user makes a mistake. So, really what's the right way to do this is
part of the problem but I agree it's better to be hostile and refuse
to mount a given volume UUID at all when too many devices are found,
than corrupt the file system.


FWIW, the behavior I'd expect from a sysadmin perspective would be:
1. If and only if a correct number of device= options have been passed
to mount, use those devices (and only those devices), and log a warning
if extra devices are detected.


First, how do you know that devices, passed as device= options, are
correct? Is it possible to detect stale copy?
You don't.  As much as it pains me to say it, there's no way to protect 
against this reliably.  The intent is that if you have specified the 
correct number of devices according to the number the filesystem says 
should be there (and that number is the same on all devices specified), 
it's assumed you know what you're doing.


Second, today udev rules will run equivalent of "btrfs device ready" for
each device that is part of btrfs.
That's part of the rules shipped by systemd, and is not by any means on 
every system in existence.  That is an inherent design flaw in systemd 
resulting from them thinking they're smarter than the kernel, and it has 
on multiple occasions bit people.

So you still need to handle the
situation when device(s) appear and disappear after initial mount and
have some way to distinguish between two copies.
Yes, you need to account for devices appearing and disappearing, but at 
least until we add proper support for off-line devices, that's easy.


Third, what exactly "extra devices detected" means? Who is responsible
for detection? Where this information is kept? How can mount query this
information?
If there are more devices with the filesystem's UUID than are passed in 
via device= options, and the above stated condition regarding device= 
options is met, then those are extra devices.



2. Otherwise, refuse to mount and log a warning.


So no way to mount degraded redundant filesystem?
I know a large number of people who routinely use degraded as part of 
their fstab options.  Degraded is supposed to mean reduced data safety, 
not 'may cause random corruption just by being used'.  I have no issue 
with a mount option to force mounting it anyway, but I absolutely do not 
want that to be part of the degraded mount option.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems with free space cache

2016-05-16 Thread Austin S. Hemmelgarn

On 2016-05-16 02:20, Qu Wenruo wrote:



Duncan wrote on 2016/05/16 05:59 +:

Qu Wenruo posted on Mon, 16 May 2016 10:24:23 +0800 as excerpted:


IIRC clear_cache option is fs level option.
So the first mount with clear_cache, then all subvolume will have
clear_cache.


Question:  Does clear_cache work with a read-only mount?

Good question.

But easy to check.
I just checked it and found even that's possible, it doesn't work.

Free space cache inode bytenr doesn't change and no generation change.
While without ro, it did rebuild free space cache for *SOME* chunks, not
*ALL* chunks.

And that's the problem I'm just chasing today.


Short conclude: clear_cache mount option will only rebuild free space
cache for chunks which we allocated space from, during the mount time of
clear_cache.
(Maybe I'm just out-of-date and some other devs may already know that)

And kernel fix for that is a little tricky, as we don't really read out
all free space cache, but only when we are going to allocate space from
them.
For any free space cache we didn't read out, they won't be rebuilt.

You can find my reproducer in my just submitted
patch(https://patchwork.kernel.org/patch/9098431/).

That behavior makes things a little confusing, which users may continue
hitting annoying free space cache warning from kernel, even they try to
use clear_cache mount option.


Anyway, I'll add ability for manually wipe out all/given free space
cache to btrfsck, at least creating a solution to really rebuild all v1
free space cache.

FWIW, I think it's possible to do this by mounting with clear_cache and 
then running a full balance on the filesystem.  Having an option to do 
this on an unmounted FS would be preferred of course, as that would 
almost certainly be more efficient on any reasonably sized filesystem.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fsck: to repair or not to repair

2016-05-16 Thread Andrei Borzenkov
16.05.2016 14:17, Austin S. Hemmelgarn пишет:
> On 2016-05-13 17:35, Chris Murphy wrote:
>> On Fri, May 13, 2016 at 9:28 AM, Nikolaus Rath  wrote:
>>> On May 13 2016, Duncan <1i5t5.dun...@cox.net> wrote:
 Because btrfs can be multi-device, it needs some way to track which
 devices belong to each filesystem, and it uses filesystem UUID for this
 purpose.

 If you clone a filesystem (for instance using dd or lvm snapshotting,
 doesn't matter how) and then trigger a btrfs device scan, say by
 plugging
 in some other device with btrfs on it so udev triggers a scan, and the
 kernel sees multiple devices with the same filesystem UUID as a result,
 and one of those happens to be mounted, you can corrupt both copies as
 the kernel btrfs won't be able to tell them apart and may write updates
 to the wrong one.
>>>
>>> That seems like a rather odd design. Why isn't btrfs refusing to mount
>>> in this situation? In the face of ambiguity, guessing is generally bad
>>> idea (at least for a computer program).
>>
>> The logic  you describe requires code. It's the absence of code rather
>> than an intentional design that's the cause of the current behavior.
>> And yes, it'd be nice if Btrfs weren't stepping on its own tail in
>> this situation. It could be as simple as refusing to mount anytime
>> there's an ambiguity, but that's sorta user hostile if there isn't a
>> message that goes along with it to help the user figure out a way to
>> resolve the problem. And that too could be fraught with peril if the
>> user makes a mistake. So, really what's the right way to do this is
>> part of the problem but I agree it's better to be hostile and refuse
>> to mount a given volume UUID at all when too many devices are found,
>> than corrupt the file system.
>>
> FWIW, the behavior I'd expect from a sysadmin perspective would be:
> 1. If and only if a correct number of device= options have been passed
> to mount, use those devices (and only those devices), and log a warning
> if extra devices are detected.

First, how do you know that devices, passed as device= options, are
correct? Is it possible to detect stale copy?

Second, today udev rules will run equivalent of "btrfs device ready" for
each device that is part of btrfs. So you still need to handle the
situation when device(s) appear and disappear after initial mount and
have some way to distinguish between two copies.

Third, what exactly "extra devices detected" means? Who is responsible
for detection? Where this information is kept? How can mount query this
information?

> 2. Otherwise, refuse to mount and log a warning.

So no way to mount degraded redundant filesystem?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS Data at Rest File Corruption

2016-05-16 Thread Austin S. Hemmelgarn

On 2016-05-16 02:07, Chris Murphy wrote:

Current hypothesis
 "I suspected, and I still suspect that the error occurred upon a
metadata update that corrupted the checksum for the file, probably due
to silent memory corruption.  If the checksum was silently corrupted,
it would be simply written to both drives causing this type of error."

A metadata update alone will not change the data checksums.

But let's ignore that. If there's corrupt extent csum in a node that
itself has a valid csum, this is functionally identical to e.g.
nerfing 100 bytes of a file's extent data (both copies, identically).
The fs doesn't know the difference. All it knows is the node csum is
valid, therefore the data extent csum is valid, and that's why it
assumes the data is wrong and hence you get an I/O error. And I can
reproduce most of your results by nerfing file data.

The entire dmesg for scrub looks like this:


May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6):
checksum error at logical 5566889984 on dev /dev/dm-6, sector 8540160,
root 5, inode 258, offset 0, length 4096, links 1 (path:
openSUSE-Tumbleweed-NET-x86_64-Current.iso)
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
bdev /dev/dm-6 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
unable to fixup (regular) error at logical 5566889984 on dev /dev/dm-6
May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6):
checksum error at logical 5566889984 on dev /dev/mapper/VG-b1, sector
8579072, root 5, inode 258, offset 0, length 4096, links 1 (path:
openSUSE-Tumbleweed-NET-x86_64-Current.iso)
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
bdev /dev/mapper/VG-b1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
unable to fixup (regular) error at logical 5566889984 on dev
/dev/mapper/VG-b1

And the entire dmesg for running sha256sum on the file is

May 15 23:33:41 f23s.localdomain kernel: __readpage_endio_check: 22
callbacks suppressed
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141


And I do get an i/o error for sha256sum and no hash is computed.

But there's two important differences:
1. I have two unable to fixup messages, one for each device, at the
exact same time.
2. I altered both copies of extent data.

It's a mystery to me how your file data has not changed, but somehow
the extent csum was changed but also the node csum was recomputed
correctly. That's a bit odd.
I would think this would be perfectly possible if some other file that 
had a checksum in that node changed, thus forcing the node's checksum to 
be updated.  Theoretical sequence of events:

1. Some file which has a checksum in node A gets written to.
2. Node A is loaded into memory to update the checksum.
3. The new checksum for the changed extent in the file gets updated in 
the in-memory copy of node A.
4. Node A has it's own checksum recomputed based on the new data, and 
then gets saved to disk.
If something happened after 2 but before 4 that caused one of the other 
checksums to go bad, then the checksum computed in 4 will have been with 
the corrupted data.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hot data tracking / hybrid storage

2016-05-16 Thread Austin S. Hemmelgarn

On 2016-05-15 08:12, Ferry Toth wrote:

Is there anything going on in this area?

We have btrfs in RAID10 using 4 HDD's for many years now with a rotating
scheme of snapshots for easy backup. <10% files (bytes) change between
oldest snapshot and the current state.

However, the filesystem seems to become very slow, probably due to the
RAID10 and the snapshots.
While it's not exactly what you're thinking of, have you tried running 
BTRFS in raid1 mode on top of two DM/MD RAID0 volumes?  This provides 
the same degree of data integrity that BTRFS raid10 does, but gets 
measurably better performance.


It would be fantastic if we could just add 4 SSD's to the pool and btrfs
would just magically prefer to put often accessed files there and move
older or less popular files to the HDD's.

In my simple mind this can not be done easily using bcache as that would
require completely rebuilding the file system on top of bcache (can not
just add a few SSD's to the pool), while implementing a cache inside btrfs
is probably a complex thing with lots of overhead.
You may want to look into dm-cache, as that doesn't require reformatting 
the source device.  It doesn't quite get the same performance as bcache, 
but for me at least, the lower performance is a reasonable trade-off for 
being able to easily convert a device to use it, and being able to 
easily convert away from it if need be.


Simply telling the allocator to prefer new files to go to the ssd and
move away unpopular stuff to hdd during balance should do the trick, or am
I wrong?
In theory this would work as a first implementation, but it would need 
to have automatic data migration as an option to be considered 
practical, and that's not as easy to do correctly.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fsck: to repair or not to repair

2016-05-16 Thread Austin S. Hemmelgarn

On 2016-05-13 17:35, Chris Murphy wrote:

On Fri, May 13, 2016 at 9:28 AM, Nikolaus Rath  wrote:

On May 13 2016, Duncan <1i5t5.dun...@cox.net> wrote:

Because btrfs can be multi-device, it needs some way to track which
devices belong to each filesystem, and it uses filesystem UUID for this
purpose.

If you clone a filesystem (for instance using dd or lvm snapshotting,
doesn't matter how) and then trigger a btrfs device scan, say by plugging
in some other device with btrfs on it so udev triggers a scan, and the
kernel sees multiple devices with the same filesystem UUID as a result,
and one of those happens to be mounted, you can corrupt both copies as
the kernel btrfs won't be able to tell them apart and may write updates
to the wrong one.


That seems like a rather odd design. Why isn't btrfs refusing to mount
in this situation? In the face of ambiguity, guessing is generally bad
idea (at least for a computer program).


The logic  you describe requires code. It's the absence of code rather
than an intentional design that's the cause of the current behavior.
And yes, it'd be nice if Btrfs weren't stepping on its own tail in
this situation. It could be as simple as refusing to mount anytime
there's an ambiguity, but that's sorta user hostile if there isn't a
message that goes along with it to help the user figure out a way to
resolve the problem. And that too could be fraught with peril if the
user makes a mistake. So, really what's the right way to do this is
part of the problem but I agree it's better to be hostile and refuse
to mount a given volume UUID at all when too many devices are found,
than corrupt the file system.


FWIW, the behavior I'd expect from a sysadmin perspective would be:
1. If and only if a correct number of device= options have been passed 
to mount, use those devices (and only those devices), and log a warning 
if extra devices are detected.

2. Otherwise, refuse to mount and log a warning.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] Btrfs: free sys_array eb as soon as possible

2016-05-16 Thread David Sterba
On Fri, May 13, 2016 at 05:06:59PM -0700, Liu Bo wrote:
> While reading sys_chunk_array in superblock, btrfs creates a temporary
> extent buffer.  Since we don't use it after finishing reading
>  sys_chunk_array, we don't need to keep it in memory.
> 
> Signed-off-by: Liu Bo 

This patch is independent of the rest, I'll add it to next.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] Btrfs: replace BUG_ON with WARN in merge_bio

2016-05-16 Thread David Sterba
On Fri, May 13, 2016 at 05:07:00PM -0700, Liu Bo wrote:
> We have two BUG_ON in merge_bio, but since it can gracefully return errors
> to callers, use WARN_ONCE to give error information and don't leave a
> possible panic.
> 
> Signed-off-by: Liu Bo 
> ---
>  fs/btrfs/extent_io.c | 1 -
>  fs/btrfs/inode.c | 6 --
>  2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index e601e0f..99286d1 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2746,7 +2746,6 @@ static int merge_bio(int rw, struct extent_io_tree 
> *tree, struct page *page,
>   if (tree->ops && tree->ops->merge_bio_hook)
>   ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio,
>   bio_flags);
> - BUG_ON(ret < 0);
>   return ret;
>  
>  }
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 5874562..3a989e3 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1827,8 +1827,10 @@ int btrfs_merge_bio_hook(int rw, struct page *page, 
> unsigned long offset,
>   map_length = length;
>   ret = btrfs_map_block(root->fs_info, rw, logical,
> _length, NULL, 0);
> - /* Will always return 0 with map_multi == NULL */
> - BUG_ON(ret < 0);
> + if (ret < 0) {
> + WARN_ONCE(ret < 0, KERN_ERR "ret = %d\n", ret);

btrfs_map_block is quite verbose about the errors so it's not needed to
print it here.

Otherwise I'm not sure if all paths that go through the merge hook
handle errors, eg. in btrfs_submit_compressed_read or _write. Some code
is skipped if merge hook returns nonzero. But, the code expects either 0
or 1, and when the BUG_ON(ret < 0) is removed suddenly the 'ret < 0' can
be returned. Unexpected.

It's IMO better to push up the BUG_ON error handling only one caller at
a time. That way it's easier to review the callgraph and call paths.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] Btrfs: replace BUG() with WARN_ONCE in raid56

2016-05-16 Thread David Sterba
On Sun, May 15, 2016 at 04:19:28PM +0200, Holger Hoffstätte wrote:
> On 05/14/16 02:06, Liu Bo wrote:
> > This BUG() has been triggered by a fuzz testing image, but in fact
> > btrfs can handle this gracefully by returning -EIO.
> > 
> > Thus, use WARN_ONCE for warning purpose and don't leave a possible
> > kernel panic.
> > 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/raid56.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> > index 0b7792e..863f7fe 100644
> > --- a/fs/btrfs/raid56.c
> > +++ b/fs/btrfs/raid56.c
> > @@ -2139,7 +2139,7 @@ int raid56_parity_recover(struct btrfs_root *root, 
> > struct bio *bio,
> >  
> > rbio->faila = find_logical_bio_stripe(rbio, bio);
> > if (rbio->faila == -1) {
> > -   BUG();
> > +   WARN_ONCE(1, KERN_WARNING "rbio->faila is -1\n");
> 
> I'm generally in favor of not BUGing out for no good reason, but what
> is e.g. an admin (or user) supposed to do when he sees this message?
> Same for the other rather cryptic WARNs - they contain no actionable
> information, and are most likely going to be ignored as "debug spam".
> IMHO things that can be ignored can be deleted.

Agreed, the way this patchset repalces BUG on is very confusing.
WARN_ONCE is a global state, the message does not even print on which
filesystem the error happened. The only way to reset the state is to
unload the module.

This should be handled as a corruption, no matter if it's fuzzed or not,
report more details about what is corrupted or what was expected.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: problems with free space cache

2016-05-16 Thread Qu Wenruo



Duncan wrote on 2016/05/16 05:59 +:

Qu Wenruo posted on Mon, 16 May 2016 10:24:23 +0800 as excerpted:


IIRC clear_cache option is fs level option.
So the first mount with clear_cache, then all subvolume will have
clear_cache.


Question:  Does clear_cache work with a read-only mount?

Good question.

But easy to check.
I just checked it and found even that's possible, it doesn't work.

Free space cache inode bytenr doesn't change and no generation change.
While without ro, it did rebuild free space cache for *SOME* chunks, not 
*ALL* chunks.


And that's the problem I'm just chasing today.


Short conclude: clear_cache mount option will only rebuild free space 
cache for chunks which we allocated space from, during the mount time of 
clear_cache.

(Maybe I'm just out-of-date and some other devs may already know that)

And kernel fix for that is a little tricky, as we don't really read out 
all free space cache, but only when we are going to allocate space from 
them.

For any free space cache we didn't read out, they won't be rebuilt.

You can find my reproducer in my just submitted 
patch(https://patchwork.kernel.org/patch/9098431/).


That behavior makes things a little confusing, which users may continue 
hitting annoying free space cache warning from kernel, even they try to 
use clear_cache mount option.



Anyway, I'll add ability for manually wipe out all/given free space 
cache to btrfsck, at least creating a solution to really rebuild all v1 
free space cache.


Thanks,
Qu


I could see it being like the log replay and being done even on ro
mounts, particularly since unlike the log replay, clear_cache is a
specific mount option so could be seen as specifically requested even if
otherwise ro.  Or not.  So I don't know.

He mentioned root, which is normally mounted read-only first.  That's
what got me thinking about it.  Does that complicate things?

If root is mounted ro without clear_cache by the initr*, will a
subsequent remount,rw,clear_cache do it?

And for those like me who keep root mounted ro by default, would
clear_cache work at all, or would it not work until such time as I
mounted root, or some other subvolume (which I don't have here to worry
about, but for those who do...) writable?




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS Data at Rest File Corruption

2016-05-16 Thread Chris Murphy
Current hypothesis
 "I suspected, and I still suspect that the error occurred upon a
metadata update that corrupted the checksum for the file, probably due
to silent memory corruption.  If the checksum was silently corrupted,
it would be simply written to both drives causing this type of error."

A metadata update alone will not change the data checksums.

But let's ignore that. If there's corrupt extent csum in a node that
itself has a valid csum, this is functionally identical to e.g.
nerfing 100 bytes of a file's extent data (both copies, identically).
The fs doesn't know the difference. All it knows is the node csum is
valid, therefore the data extent csum is valid, and that's why it
assumes the data is wrong and hence you get an I/O error. And I can
reproduce most of your results by nerfing file data.

The entire dmesg for scrub looks like this:


May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6):
checksum error at logical 5566889984 on dev /dev/dm-6, sector 8540160,
root 5, inode 258, offset 0, length 4096, links 1 (path:
openSUSE-Tumbleweed-NET-x86_64-Current.iso)
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
bdev /dev/dm-6 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
unable to fixup (regular) error at logical 5566889984 on dev /dev/dm-6
May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6):
checksum error at logical 5566889984 on dev /dev/mapper/VG-b1, sector
8579072, root 5, inode 258, offset 0, length 4096, links 1 (path:
openSUSE-Tumbleweed-NET-x86_64-Current.iso)
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
bdev /dev/mapper/VG-b1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6):
unable to fixup (regular) error at logical 5566889984 on dev
/dev/mapper/VG-b1

And the entire dmesg for running sha256sum on the file is

May 15 23:33:41 f23s.localdomain kernel: __readpage_endio_check: 22
callbacks suppressed
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141
May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6):
csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141


And I do get an i/o error for sha256sum and no hash is computed.

But there's two important differences:
1. I have two unable to fixup messages, one for each device, at the
exact same time.
2. I altered both copies of extent data.

It's a mystery to me how your file data has not changed, but somehow
the extent csum was changed but also the node csum was recomputed
correctly. That's a bit odd.




Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: corrupt-block: Add ability to corrupt free space cache file

2016-05-16 Thread Qu Wenruo
One user in mail list reported free space cache error where clear_cache
mount option failes to fix.

To reproduce such problem, add support to corrupt free space cache
file(old free space cache implement, or space_cache=v1).

With this patch, we can reproduce the problem quite easily:
Not suitable for a btrfs-progs or xfstest test case, as btrfsck won't
treat it as an error.

==
$ fallocate  test.img -l 2G
$ mkfs.btrfs test.img
$ sudo mount test.img  /mnt/btrfs/
$ sudo xfs_io  -f -c "pwrite 0 16M" /mnt/btrfs/tmp
$ sudo umount /mnt/btrfs
$ ./btrfs-corrupt-block -s content -l 136708096 ../test.img
  Here 136708096 is the second data chunk bytenr.
  (The first data chunk is the 8M one from mkfs, which doesn't have free
   space cache)

$ ./btrfsck test.img
Checking filesystem on ../test.img
UUID: 9542051b-8150-42e5-a9e6-95829656485c
checking extents
checking free space cache
btrfs: csum mismatch on free space cache  <<< Found space cache problem
failed to load free space cache for block group 136708096
checking fs roots
checking csums
checking root refs
found 16941058 bytes used err is 0
total csum bytes: 16384
total tree bytes: 163840
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 139567
file data blocks allocated: 16908288
 referenced 16908288

$ sudo mount -o clear_cache test.img /mnt/btrfs
$ sudo umount /mnt/btrfs
$ ./btrfsck test.img
Checking filesystem on ../test.img
UUID: 9542051b-8150-42e5-a9e6-95829656485c
checking extents
checking free space cache
btrfs: csum mismatch on free space cache <<< Problem not fixed by kernel
failed to load free space cache for block group 136708096
checking fs roots
checking csums
checking root refs
found 16941058 bytes used err is 0
total csum bytes: 16384
total tree bytes: 163840
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 139567
file data blocks allocated: 16908288
 referenced 16908288
==

Btrfs-progs fix will follow soon.

Reported-by: Ivan P 
Signed-off-by: Qu Wenruo 
---
 btrfs-corrupt-block.c | 124 +-
 1 file changed, 123 insertions(+), 1 deletion(-)

diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c
index d331f96..eb27265 100644
--- a/btrfs-corrupt-block.c
+++ b/btrfs-corrupt-block.c
@@ -115,6 +115,9 @@ static void print_usage(int ret)
fprintf(stderr, "\t-C Delete a csum for the specified bytenr.  When "
"used with -b it'll delete that many bytes, otherwise it's "
"just sectorsize\n");
+   fprintf(stderr, "\t-s \n");
+   fprintf(stderr, "\t   Corrupt free space cache file(space_cache v1), 
must also specify -l for blockgroup bytenr\n");
+   fprintf(stderr, "\tcan be 'zero_gen','rand_gen' or 
'content'\n");
exit(ret);
 }
 
@@ -1012,6 +1015,100 @@ out:
return ret;
 
 }
+
+u64 rand_u64(void)
+{
+   u64 ret = 0;
+   int n;
+
+   for (n = 0; n < sizeof(ret) / sizeof(RAND_MAX); n++)
+   ret += rand() << (n * sizeof(RAND_MAX) * 8);
+   return ret;
+}
+
+#define CORRUPT_SC_FILE_ZERO_GEN   1
+#define CORRUPT_SC_FILE_RAND_GEN   2
+#define CORRUPT_SC_FILE_CONTENT3
+int corrupt_space_cache_file(struct btrfs_fs_info *fs_info, u64 block_group,
+int method)
+{
+   struct btrfs_root *tree_root = fs_info->tree_root;
+   struct btrfs_path path;
+   struct btrfs_key key;
+   struct btrfs_disk_key location;
+   struct btrfs_free_space_header *sc_header;
+   struct btrfs_file_extent_item *fi;
+   struct extent_buffer *node;
+   struct extent_buffer *corrupted_node;
+   u64 disk_bytenr;
+   int slot;
+   int ret;
+
+   key.objectid = BTRFS_FREE_SPACE_OBJECTID;
+   key.type = 0;
+   key.offset = block_group;
+
+   btrfs_init_path();
+
+   /* Don't start trans, as this will cause generation different */
+   ret = btrfs_search_slot(NULL, tree_root, , , 0, 0);
+   if (ret) {
+   error("failed to find free space cahce file for block group 
%llu, ret: %d\n",
+ block_group, ret);
+   goto out;
+   }
+   slot = path.slots[0];
+   node = path.nodes[0];
+   sc_header = btrfs_item_ptr(node, slot, struct btrfs_free_space_header);
+   if (method == CORRUPT_SC_FILE_ZERO_GEN ||
+   method == CORRUPT_SC_FILE_RAND_GEN) {
+   u64 dst_gen;
+
+   if (method == CORRUPT_SC_FILE_ZERO_GEN)
+   dst_gen = 0;
+   else
+   dst_gen = rand_u64();
+
+   btrfs_set_free_space_generation(node, sc_header, dst_gen);
+   /* Manually re-calc csum and write to disk */
+   ret = write_tree_block(NULL, tree_root, node);
+   goto out;
+   }
+
+   btrfs_free_space_key(node, sc_header, );
+   btrfs_disk_key_to_cpu(, );

Re: problems with free space cache

2016-05-16 Thread Duncan
Qu Wenruo posted on Mon, 16 May 2016 10:24:23 +0800 as excerpted:

> IIRC clear_cache option is fs level option.
> So the first mount with clear_cache, then all subvolume will have
> clear_cache.

Question:  Does clear_cache work with a read-only mount?

I could see it being like the log replay and being done even on ro 
mounts, particularly since unlike the log replay, clear_cache is a 
specific mount option so could be seen as specifically requested even if 
otherwise ro.  Or not.  So I don't know.

He mentioned root, which is normally mounted read-only first.  That's 
what got me thinking about it.  Does that complicate things?

If root is mounted ro without clear_cache by the initr*, will a 
subsequent remount,rw,clear_cache do it?

And for those like me who keep root mounted ro by default, would 
clear_cache work at all, or would it not work until such time as I 
mounted root, or some other subvolume (which I don't have here to worry 
about, but for those who do...) writable?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html