Re: [PATCH 0/4] Add support to clear v1 free space cache for btrfs check
Also to Ivan P, the initial reporter of the problem. Now "btrfsck --clear-cache" should help you to prevent kernel warning on free space cache problem. Thanks, Qu Qu Wenruo wrote on 2016/05/17 11:47 +0800: The patchset can be also fetched from github, just in case mail list blocks the last patch, which contains binary file. https://github.com/adam900710/btrfs-progs.git fsck_clear_cache Just as describe in the first patch, btrfs kernel "clear_cache" mount option can't rebuild all free space cache, due to the design of free space cache. (Although I pretty like the idea to delay the load of free space cache, as it hugely reduce the mount time of large fs) So this makes users to report error in mail list, complaining "clear_cache" doesn't wipe the annoying kernel warning on corrupted free space cache. Since kernel can't handle it, and user consider it more like an error, we'd better handle it like an error, to fix it in btrfs check. So this patchset adds the ability to clear free space cache, and add test case for it. The clear procedure is different from kernel, it will remove all free space cache inodes and its contents(with backrefs), and set cache_generation of super block to -1. So there will be no free space cache at all and kernel will be happy with that. This patch also enhances btrfs_previous_item() to use min_objectid to return as early as possible. Lu Fengqi (1): btrfs-progs: tests: add 020-bad-free-space-cache Qu Wenruo (3): btrfs-progs: corrupt-block: Add ability to corrupt free space cache file btrfs-progs: ctree: return earlier for btrfs_previous_item btrfs-progs: fsck: Add support to clear free space cache Documentation/btrfs-check.asciidoc | 8 ++ btrfs-corrupt-block.c | 124 - cmds-check.c | 58 +- ctree.c| 2 + free-space-cache.c | 124 + free-space-cache.h | 4 + .../020-bad-free-space-cache/default_case.raw.xz | Bin 0 -> 164068 bytes tests/fsck-tests/020-bad-free-space-cache/test.sh | 16 +++ 8 files changed, 334 insertions(+), 2 deletions(-) create mode 100644 tests/fsck-tests/020-bad-free-space-cache/default_case.raw.xz create mode 100755 tests/fsck-tests/020-bad-free-space-cache/test.sh -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1.1 3/4] btrfs-progs: fsck: Add support to clear free space cache
Add a new option "--clear-space-cache" for btrfs check. Unlike many may assume, kernel "clear_cache" will only rebuild *SOME* of the free space cache, not *ALL*. Or more specifically, it will only rebuild free space cache for block groups that has read out the cache during "clear_cache" mount time. And since kernel will not read out free space cache at mount, but only when kernel needs to allocate space from the block group, so bad free space cache will stay untouched for a long time. Such kernel design is quite good, as it dramatically reduce the mount time for large fs, so I prefer not to change that. Instead, since a lot of user consider free space warning from kernel as an error, we should handle them like an error, which means we should use btrfsck to fix it, even it's harmless most of time. This patch will use "--clear-space-cache" to clear all free space cache, and modify cache_generation to -1. Reported-by: Ivan PSigned-off-by: Qu Wenruo --- v1.1: Fix error that when free space cache inode is not found, we still remove one item from tree root. --- Documentation/btrfs-check.asciidoc | 8 +++ cmds-check.c | 58 - free-space-cache.c | 126 + free-space-cache.h | 4 ++ 4 files changed, 195 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-check.asciidoc b/Documentation/btrfs-check.asciidoc index 74a2ad2..ea25582 100644 --- a/Documentation/btrfs-check.asciidoc +++ b/Documentation/btrfs-check.asciidoc @@ -78,6 +78,14 @@ respective superblock offset is within the device size This can be used to use a different starting point if some of the primary superblock is damaged. +--clear-space-cache:: +clear all free space cache ++ +NOTE: +Kernel mount option 'clear_cache' is only designed to rebuild free space cache +which is modified during the lifetime of that mount option. +It doesn't rebuild all free space cache, nor clear them out. + DANGEROUS OPTIONS - diff --git a/cmds-check.c b/cmds-check.c index ec0bbfd..1f6aefb 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -9524,10 +9524,41 @@ const char * const cmd_check_usage[] = { "print subvolume extents and sharing state", "-r|--tree-root use the given bytenr for the tree root", "--chunk-rootuse the given bytenr for the chunk tree root", + "--clear-space-cache clear all free space cache(v1)", "-p|--progress indicate progress", NULL }; +static int clear_free_space_cache(struct btrfs_fs_info *fs_info) +{ + struct btrfs_trans_handle *trans; + struct btrfs_block_group_cache *bg_cache; + u64 current = 0; + int ret = 0; + + /* Clear all free space cache inodes and its extent data */ + while (1) { + bg_cache = btrfs_lookup_first_block_group(fs_info, current); + if (!bg_cache) + break; + ret = btrfs_clear_free_space_cache(fs_info, bg_cache); + if (ret < 0) + return ret; + current = bg_cache->key.objectid + bg_cache->key.offset; + } + + /* Don't forget to set cache_generation to -1 */ + trans = btrfs_start_transaction(fs_info->tree_root, 0); + if (IS_ERR(trans)) { + error("failed to update super block cache generation"); + return PTR_ERR(trans); + } + btrfs_set_super_cache_generation(fs_info->super_copy, (u64)-1); + btrfs_commit_transaction(trans, fs_info->tree_root); + + return ret; +} + int cmd_check(int argc, char **argv) { struct cache_tree root_cache; @@ -9543,13 +9574,15 @@ int cmd_check(int argc, char **argv) int init_csum_tree = 0; int readonly = 0; int qgroup_report = 0; + int clear_space_cache = 0; enum btrfs_open_ctree_flags ctree_flags = OPEN_CTREE_EXCLUSIVE; while(1) { int c; enum { GETOPT_VAL_REPAIR = 257, GETOPT_VAL_INIT_CSUM, GETOPT_VAL_INIT_EXTENT, GETOPT_VAL_CHECK_CSUM, - GETOPT_VAL_READONLY, GETOPT_VAL_CHUNK_TREE }; + GETOPT_VAL_READONLY, GETOPT_VAL_CHUNK_TREE, + GETOPT_VAL_CLEAR_SPACE_CACHE}; static const struct option long_options[] = { { "super", required_argument, NULL, 's' }, { "repair", no_argument, NULL, GETOPT_VAL_REPAIR }, @@ -9566,6 +9599,8 @@ int cmd_check(int argc, char **argv) { "tree-root", required_argument, NULL, 'r' }, { "chunk-root", required_argument, NULL, GETOPT_VAL_CHUNK_TREE }, + { "clear-space-cache", no_argument, NULL, +
[PATCH 2/4] btrfs-progs: ctree: return earlier for btrfs_previous_item
Btrfs_previous_item() has a parameter to specify minimal objectid to return. But surprisingly it doesn't use it at all. Although that's OK, but it would take years long for large tree, so return it earlier. Signed-off-by: Qu Wenruo--- ctree.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ctree.c b/ctree.c index 079696e..3a9f417 100644 --- a/ctree.c +++ b/ctree.c @@ -2894,6 +2894,8 @@ int btrfs_previous_item(struct btrfs_root *root, btrfs_item_key_to_cpu(leaf, _key, path->slots[0]); if (found_key.type == type) return 0; + if (found_key.objectid < min_objectid) + break; } return 1; } -- 2.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] btrfs-progs: tests: add 020-bad-free-space-cache
From: Lu FengqiTo test if fsck can detect and clear free space cache error Signed-off-by: Lu Fengqi --- .../020-bad-free-space-cache/default_case.raw.xz | Bin 0 -> 164068 bytes tests/fsck-tests/020-bad-free-space-cache/test.sh | 16 2 files changed, 16 insertions(+) create mode 100644 tests/fsck-tests/020-bad-free-space-cache/default_case.raw.xz create mode 100755 tests/fsck-tests/020-bad-free-space-cache/test.sh diff --git a/tests/fsck-tests/020-bad-free-space-cache/default_case.raw.xz b/tests/fsck-tests/020-bad-free-space-cache/default_case.raw.xz new file mode 100644 index ..5e08de426c5cc5faa9ccd430432b9c37f4e072ff GIT binary patch literal 164068 zcmeI*^;eeNx-amDR-{2nC8RqA>6GpeDQQIM6j137K{}*Cq@`OL1t~!~M5R-@W$!We zIOF`V~XYi!^3uDve)0mhimoY%ZQ-#PF5zTo$Kpoc)9E{#>mAS0-eh!F^c3& z<)trbl_3IQe{p%4Q+p{ZDtFyUz(l<+hH)t;2;*dHfx>Tuqf1a2NwM6kCoD{vBL{oZ z`^}FC8hp)pUD7Vsu r{jDvPxtx%!Tkbo_rhU;xZv758eDW_e=+Pb`q?vkT5awf>N;|gam?{2x9MLE zeUp;lr*CIb@fnr#*(qs|p<(vVd;4f8MEO5zWo@Q=@)kZ7RJb_c+Q(T_uQqFn<0z&{ zRGX_fa*CJuia9>py^$1!^-5Lr7m~=)Rm^x -c#fC)a>@ofXcoM0 zxOX} xG}RQ{v_w-^E9b7B9u7oM!FE)!@e7I*Y)gzdMnHKJ!d^!dxqD? zY2G@P9 USQeg{*~z$6l7 zu5~GUeZO{^{KJ(!R`NO3(c!lgRb>`oax}897FehonYY`Q>I&+FbF?Xzbo#ze6J=~? zHjwFRv9grUuhEJON{u<^wpN;6xqhmv5c?BHU1F^K6AyX9b=1LR919|%4ee80FVUKm zM4=m`>b?{vR6Pe%mfuv?M3Uu+S>jJ*PgRyRzQ>^7Crg)3|317RbEx$*#FdqEx?N4{ z)$K%d6r);1p{-nI?NO>zcSD7~oFZJk)ov?^w^4|+Y(SCDC&!;ikVEh#LIX)@O zi}}?qgX d}t@ft`>Oo?nMPvb3?TWtRkmbMw2HEJ94<>l}G!4x}Qq-9*sO_5_J^Q%HWpO|gFRtx>?Y-Tqq#1#C|Cxx|Y-c68T0 zzSe=MmEQOQ(err_Vp1*Ia#s8gF|Ac?7jw@D?ap{AcjSwwlEwGua6Aym7 zM$?jcd$sEUzP Ai^wt=h4j?hC@gtx7<{iGD)gfz-U+`=2 zEfnoERAZh?3 -EGJTrP-GN$t=hFyat8Hb;|JH6`U9C1v< zG*aj1v?9_>D WR>BIEAa{1tD2|p_}P3)nrcUm z`WDo3xD{m|t`Eo}S?#( GP+X{o@>U&0`rZU%()IbHq!+@*hQL;&*C4cFL@US-ukF_v(;LmtUTYE z*~MKipi {D?=kEfVlSQ+`4dsweIE7P6D86# z{a-i#{!c~y@1}m6`?v!v>_1Rb{%c`FCH%XKFbvoq+_hl9V8DKFUj#D@X4s!G!_bk* z5bwQ5Hw+>_^W!Y-CQ{ZRRV|5kRfl#lR5WUCzAkAo(RRwE5uA$h_ob+=qLNArc_b0G zX}Cq*Zme>pFje-UHNP8@V13P`vzJoeuh;t})YCZQWfi(l+3miK%Mbn7bbff(;xeZ% z#pfkD(ynr+2-aim_RZnA*|4`cAs?*597S0QyBYBoRZ_5{*pTu9A8s?`S*6fdCgjz8 zVtTyiJ0-;trDbehwAw}}z|~bb&2rb=2J_?br+mJQw`HHHQv7XgO5;5PYF%ie*ohBx z)58vXrnadRu Ro5 z>@P-^o~sEX-_l>~BFp4v$rf;tTM6-V)Gy#|iNtWr$=rLYH8tD3)~51~D6EZMaD`Rx zk|>T+DaPAW`a{>x-1(hXn??`$_G2wGq)ZvA$MP|pekEM$92(_tMf;`)G+g50N6lbS zJ)f!mxh!0dk+HSiKiO
[PATCH 0/4] Add support to clear v1 free space cache for btrfs check
The patchset can be also fetched from github, just in case mail list blocks the last patch, which contains binary file. https://github.com/adam900710/btrfs-progs.git fsck_clear_cache Just as describe in the first patch, btrfs kernel "clear_cache" mount option can't rebuild all free space cache, due to the design of free space cache. (Although I pretty like the idea to delay the load of free space cache, as it hugely reduce the mount time of large fs) So this makes users to report error in mail list, complaining "clear_cache" doesn't wipe the annoying kernel warning on corrupted free space cache. Since kernel can't handle it, and user consider it more like an error, we'd better handle it like an error, to fix it in btrfs check. So this patchset adds the ability to clear free space cache, and add test case for it. The clear procedure is different from kernel, it will remove all free space cache inodes and its contents(with backrefs), and set cache_generation of super block to -1. So there will be no free space cache at all and kernel will be happy with that. This patch also enhances btrfs_previous_item() to use min_objectid to return as early as possible. Lu Fengqi (1): btrfs-progs: tests: add 020-bad-free-space-cache Qu Wenruo (3): btrfs-progs: corrupt-block: Add ability to corrupt free space cache file btrfs-progs: ctree: return earlier for btrfs_previous_item btrfs-progs: fsck: Add support to clear free space cache Documentation/btrfs-check.asciidoc | 8 ++ btrfs-corrupt-block.c | 124 - cmds-check.c | 58 +- ctree.c| 2 + free-space-cache.c | 124 + free-space-cache.h | 4 + .../020-bad-free-space-cache/default_case.raw.xz | Bin 0 -> 164068 bytes tests/fsck-tests/020-bad-free-space-cache/test.sh | 16 +++ 8 files changed, 334 insertions(+), 2 deletions(-) create mode 100644 tests/fsck-tests/020-bad-free-space-cache/default_case.raw.xz create mode 100755 tests/fsck-tests/020-bad-free-space-cache/test.sh -- 2.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] btrfs-progs: corrupt-block: Add ability to corrupt free space cache file
One user in mail list reported free space cache error where clear_cache mount option failes to fix. To reproduce such problem, add support to corrupt free space cache file(old free space cache implement, or space_cache=v1). With this patch, we can reproduce the problem quite easily: Not suitable for a btrfs-progs or xfstest test case, as btrfsck won't treat it as an error. == $ fallocate test.img -l 2G $ mkfs.btrfs test.img $ sudo mount test.img /mnt/btrfs/ $ sudo xfs_io -f -c "pwrite 0 16M" /mnt/btrfs/tmp $ sudo umount /mnt/btrfs $ ./btrfs-corrupt-block -s content -l 136708096 ../test.img Here 136708096 is the second data chunk bytenr. (The first data chunk is the 8M one from mkfs, which doesn't have free space cache) $ ./btrfsck test.img Checking filesystem on ../test.img UUID: 9542051b-8150-42e5-a9e6-95829656485c checking extents checking free space cache btrfs: csum mismatch on free space cache <<< Found space cache problem failed to load free space cache for block group 136708096 checking fs roots checking csums checking root refs found 16941058 bytes used err is 0 total csum bytes: 16384 total tree bytes: 163840 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 139567 file data blocks allocated: 16908288 referenced 16908288 $ sudo mount -o clear_cache test.img /mnt/btrfs $ sudo umount /mnt/btrfs $ ./btrfsck test.img Checking filesystem on ../test.img UUID: 9542051b-8150-42e5-a9e6-95829656485c checking extents checking free space cache btrfs: csum mismatch on free space cache <<< Problem not fixed by kernel failed to load free space cache for block group 136708096 checking fs roots checking csums checking root refs found 16941058 bytes used err is 0 total csum bytes: 16384 total tree bytes: 163840 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 139567 file data blocks allocated: 16908288 referenced 16908288 == Btrfs-progs fix will follow soon. Reported-by: Ivan PSigned-off-by: Qu Wenruo --- btrfs-corrupt-block.c | 124 +- 1 file changed, 123 insertions(+), 1 deletion(-) diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c index d331f96..eb27265 100644 --- a/btrfs-corrupt-block.c +++ b/btrfs-corrupt-block.c @@ -115,6 +115,9 @@ static void print_usage(int ret) fprintf(stderr, "\t-C Delete a csum for the specified bytenr. When " "used with -b it'll delete that many bytes, otherwise it's " "just sectorsize\n"); + fprintf(stderr, "\t-s \n"); + fprintf(stderr, "\t Corrupt free space cache file(space_cache v1), must also specify -l for blockgroup bytenr\n"); + fprintf(stderr, "\tcan be 'zero_gen','rand_gen' or 'content'\n"); exit(ret); } @@ -1012,6 +1015,100 @@ out: return ret; } + +u64 rand_u64(void) +{ + u64 ret = 0; + int n; + + for (n = 0; n < sizeof(ret) / sizeof(RAND_MAX); n++) + ret += rand() << (n * sizeof(RAND_MAX) * 8); + return ret; +} + +#define CORRUPT_SC_FILE_ZERO_GEN 1 +#define CORRUPT_SC_FILE_RAND_GEN 2 +#define CORRUPT_SC_FILE_CONTENT3 +int corrupt_space_cache_file(struct btrfs_fs_info *fs_info, u64 block_group, +int method) +{ + struct btrfs_root *tree_root = fs_info->tree_root; + struct btrfs_path path; + struct btrfs_key key; + struct btrfs_disk_key location; + struct btrfs_free_space_header *sc_header; + struct btrfs_file_extent_item *fi; + struct extent_buffer *node; + struct extent_buffer *corrupted_node; + u64 disk_bytenr; + int slot; + int ret; + + key.objectid = BTRFS_FREE_SPACE_OBJECTID; + key.type = 0; + key.offset = block_group; + + btrfs_init_path(); + + /* Don't start trans, as this will cause generation different */ + ret = btrfs_search_slot(NULL, tree_root, , , 0, 0); + if (ret) { + error("failed to find free space cahce file for block group %llu, ret: %d\n", + block_group, ret); + goto out; + } + slot = path.slots[0]; + node = path.nodes[0]; + sc_header = btrfs_item_ptr(node, slot, struct btrfs_free_space_header); + if (method == CORRUPT_SC_FILE_ZERO_GEN || + method == CORRUPT_SC_FILE_RAND_GEN) { + u64 dst_gen; + + if (method == CORRUPT_SC_FILE_ZERO_GEN) + dst_gen = 0; + else + dst_gen = rand_u64(); + + btrfs_set_free_space_generation(node, sc_header, dst_gen); + /* Manually re-calc csum and write to disk */ + ret = write_tree_block(NULL, tree_root, node); + goto out; + } + + btrfs_free_space_key(node, sc_header, ); + btrfs_disk_key_to_cpu(, );
[PATCH 3/4] btrfs-progs: fsck: Add support to clear free space cache
Add a new option "--clear-space-cache" for btrfs check. Unlike many may assume, kernel "clear_cache" will only rebuild *SOME* of the free space cache, not *ALL*. Or more specifically, it will only rebuild free space cache for block groups that has read out the cache during "clear_cache" mount time. And since kernel will not read out free space cache at mount, but only when kernel needs to allocate space from the block group, so bad free space cache will stay untouched for a long time. Such kernel design is quite good, as it dramatically reduce the mount time for large fs, so I prefer not to change that. Instead, since a lot of user consider free space warning from kernel as an error, we should handle them like an error, which means we should use btrfsck to fix it, even it's harmless most of time. This patch will use "--clear-space-cache" to clear all free space cache, and modify cache_generation to -1. Reported-by: Ivan PSigned-off-by: Qu Wenruo --- Documentation/btrfs-check.asciidoc | 8 +++ cmds-check.c | 58 - free-space-cache.c | 124 + free-space-cache.h | 4 ++ 4 files changed, 193 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-check.asciidoc b/Documentation/btrfs-check.asciidoc index 74a2ad2..ea25582 100644 --- a/Documentation/btrfs-check.asciidoc +++ b/Documentation/btrfs-check.asciidoc @@ -78,6 +78,14 @@ respective superblock offset is within the device size This can be used to use a different starting point if some of the primary superblock is damaged. +--clear-space-cache:: +clear all free space cache ++ +NOTE: +Kernel mount option 'clear_cache' is only designed to rebuild free space cache +which is modified during the lifetime of that mount option. +It doesn't rebuild all free space cache, nor clear them out. + DANGEROUS OPTIONS - diff --git a/cmds-check.c b/cmds-check.c index ec0bbfd..1f6aefb 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -9524,10 +9524,41 @@ const char * const cmd_check_usage[] = { "print subvolume extents and sharing state", "-r|--tree-root use the given bytenr for the tree root", "--chunk-rootuse the given bytenr for the chunk tree root", + "--clear-space-cache clear all free space cache(v1)", "-p|--progress indicate progress", NULL }; +static int clear_free_space_cache(struct btrfs_fs_info *fs_info) +{ + struct btrfs_trans_handle *trans; + struct btrfs_block_group_cache *bg_cache; + u64 current = 0; + int ret = 0; + + /* Clear all free space cache inodes and its extent data */ + while (1) { + bg_cache = btrfs_lookup_first_block_group(fs_info, current); + if (!bg_cache) + break; + ret = btrfs_clear_free_space_cache(fs_info, bg_cache); + if (ret < 0) + return ret; + current = bg_cache->key.objectid + bg_cache->key.offset; + } + + /* Don't forget to set cache_generation to -1 */ + trans = btrfs_start_transaction(fs_info->tree_root, 0); + if (IS_ERR(trans)) { + error("failed to update super block cache generation"); + return PTR_ERR(trans); + } + btrfs_set_super_cache_generation(fs_info->super_copy, (u64)-1); + btrfs_commit_transaction(trans, fs_info->tree_root); + + return ret; +} + int cmd_check(int argc, char **argv) { struct cache_tree root_cache; @@ -9543,13 +9574,15 @@ int cmd_check(int argc, char **argv) int init_csum_tree = 0; int readonly = 0; int qgroup_report = 0; + int clear_space_cache = 0; enum btrfs_open_ctree_flags ctree_flags = OPEN_CTREE_EXCLUSIVE; while(1) { int c; enum { GETOPT_VAL_REPAIR = 257, GETOPT_VAL_INIT_CSUM, GETOPT_VAL_INIT_EXTENT, GETOPT_VAL_CHECK_CSUM, - GETOPT_VAL_READONLY, GETOPT_VAL_CHUNK_TREE }; + GETOPT_VAL_READONLY, GETOPT_VAL_CHUNK_TREE, + GETOPT_VAL_CLEAR_SPACE_CACHE}; static const struct option long_options[] = { { "super", required_argument, NULL, 's' }, { "repair", no_argument, NULL, GETOPT_VAL_REPAIR }, @@ -9566,6 +9599,8 @@ int cmd_check(int argc, char **argv) { "tree-root", required_argument, NULL, 'r' }, { "chunk-root", required_argument, NULL, GETOPT_VAL_CHUNK_TREE }, + { "clear-space-cache", no_argument, NULL, + GETOPT_VAL_CLEAR_SPACE_CACHE}, { "progress", no_argument, NULL, 'p' },
Re: BTRFS Data at Rest File Corruption
On Mon, May 16, 2016 at 5:44 PM, Richard A. Lochnerwrote: > Chris, > > It has actually happened to me three times that I know of in ~7mos., > but your point about the "larger footprint" for data corruption is a > good one. No doubt I have silently experienced that too. I dunno three is a lot to have the exact same corruption only in memory then written out into two copies with valid node checksums; and yet not have other problems, like a node item, or uuid, or xattr or any number of other item or object types all of which get checksummed. I suppose if the file system contains large files, the % of metadata that's csums could be the 2nd largest footprint. But still. Three times in 7 months, if it's really the same vector, is just short of almost reproducible. Ha. It seems like if you merely balanced this file system a few times, you'd eventually stumble on this. And if that's true, then it's time for debug options and see if it can be caught in action, and whether there's a hardware or software explanation for it. > And, as you > suggest, there is no way to prevent those errors. If the memory to be > written to disk gets corrupted before its checksum is calculated, the > data will be silently corrupted, period. Well, no way in the present design, maybe. > > Clearly, I won't rely on this machine to produce any data directly that > I would consider important at this point. > > One odd thing to me is that if this is really due to undetected memory > errors, I'd think this system would crash fairly often due to detected > "parity errors." This system rarely crashes. It often runs for > several months without an indication of problems. I think you'd have other problems. Only data csums are being corrupt after they're read in, but before the node csum is computed? Three times? Pretty wonky. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Scrub status regression (repost kinda)
Hi All, I'm still getting this regression on scrub. Scrub status no longer reports the progress of the scrub. It always used to. Anyone else experiencing the same regression? Only quirky part of my setup which may throw btrfs off is that I create and mount the first subvol as the root of the install. I repeat it is only recently 3m that it has started to cease reporting. The scrub itself proceeds normally and using '-B' returns the usual and complete output. # btrfs scrub start / scrub started on /, fsid fdd6a335-6edf-4a74-a909-03c8102cc8f5 (pid=4111) # btrfs scrub status / scrub status for fdd6a335-6edf-4a74-a909-03c8102cc8f5 no stats available total bytes scrubbed: 0.00B with 0 errors -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL] Btrfs for 4.7
On Mon, May 16, 2016 at 04:14:23PM +0200, David Sterba wrote: > Hi, > > please queue the following branch to 4.7. > > New features or user-visible changes: > * device delete by id, a v2 ioctl for device deletion (this was held back from > 4.6 pull due to possibly related crashes that haven't appeared for a long > time) > * DUP allowed on multiple-device filesystem (help in case we want to go to > SINGLE from RAID1/.. but do not want to lose all redundancy while > converting) > * GETFLAGS/SETFLAGS/GETVERSION ioctls work on 32bit > > Developer-visible changes: > * definitions that are shared with userspace are moved to the uapi directory > > Other: > * assorted bugfixes and cleanups > > The branch is merged from several topic branches, grouped by functionality > or purpose. The patches have been in for-next, most of them for a few weeks at > least, tested together with Filipe's branch. Great, thanks Dave! I've got things merged up and I'm running more tests. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS Data at Rest File Corruption
Chris, It has actually happened to me three times that I know of in ~7mos., but your point about the "larger footprint" for data corruption is a good one. No doubt I have silently experienced that too. And, as you suggest, there is no way to prevent those errors. If the memory to be written to disk gets corrupted before its checksum is calculated, the data will be silently corrupted, period. Clearly, I won't rely on this machine to produce any data directly that I would consider important at this point. One odd thing to me is that if this is really due to undetected memory errors, I'd think this system would crash fairly often due to detected "parity errors." This system rarely crashes. It often runs for several months without an indication of problems. Rick Lochner On Mon, 2016-05-16 at 16:43 -0600, Chris Murphy wrote: > On Mon, May 16, 2016 at 5:33 AM, Austin S. Hemmelgarn >wrote: > > > > > > > I would think this would be perfectly possible if some other file > > that had a > > checksum in that node changed, thus forcing the node's checksum to > > be > > updated. Theoretical sequence of events: > > 1. Some file which has a checksum in node A gets written to. > > 2. Node A is loaded into memory to update the checksum. > > 3. The new checksum for the changed extent in the file gets updated > > in the > > in-memory copy of node A. > > 4. Node A has it's own checksum recomputed based on the new data, > > and then > > gets saved to disk. > > If something happened after 2 but before 4 that caused one of the > > other > > checksums to go bad, then the checksum computed in 4 will have been > > with the > > corrupted data. > > > I'm pretty sure Qu had a suggestion that would mitigate this sort of > problem, where there'd be a CRC32C checksum for each data extent (?) > something like that anyway. There's enough room to stuff in more than > just a checksum per 4096 byte block. That way there's three checks, > and thus there's a way to break a tie. > > But this has now happened to Richard twice. What are the chances of > this manifesting exactly the same way a second time? If the chance of > corruption is equal, I'd think the much much larger footprint for > in-memory corruption is data itself. Problem is, if the corruption > happens before the checksum is computed, the checksum would say the > data is valid. So the only way to test this would be passing all file > from this volume and a reference volume through a hash function and > comparing hashes, e.g. rsync -c option. > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: corrupt node with extent csum item results in bogus data being read
I realized this could be a much bigger problem for single copy metadata on SSD case, rather than the example using raid1 where it's unlikely for both copies of the node to be corrupt. So I filed a bug. https://bugzilla.kernel.org/show_bug.cgi?id=118321 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS Data at Rest File Corruption
On Mon, May 16, 2016 at 5:33 AM, Austin S. Hemmelgarnwrote: > > I would think this would be perfectly possible if some other file that had a > checksum in that node changed, thus forcing the node's checksum to be > updated. Theoretical sequence of events: > 1. Some file which has a checksum in node A gets written to. > 2. Node A is loaded into memory to update the checksum. > 3. The new checksum for the changed extent in the file gets updated in the > in-memory copy of node A. > 4. Node A has it's own checksum recomputed based on the new data, and then > gets saved to disk. > If something happened after 2 but before 4 that caused one of the other > checksums to go bad, then the checksum computed in 4 will have been with the > corrupted data. > I'm pretty sure Qu had a suggestion that would mitigate this sort of problem, where there'd be a CRC32C checksum for each data extent (?) something like that anyway. There's enough room to stuff in more than just a checksum per 4096 byte block. That way there's three checks, and thus there's a way to break a tie. But this has now happened to Richard twice. What are the chances of this manifesting exactly the same way a second time? If the chance of corruption is equal, I'd think the much much larger footprint for in-memory corruption is data itself. Problem is, if the corruption happens before the checksum is computed, the checksum would say the data is valid. So the only way to test this would be passing all file from this volume and a reference volume through a hash function and comparing hashes, e.g. rsync -c option. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS Data at Rest File Corruption
Chris/Austin, Thank you both for your help. The sequence of events described by Austin is the only sequence that seems to be plausible, given what I have been seen in the data (other than an outright bug which I think extremely unlikely). I will be moving these drives soon to a new system with ECC memory. I will definitely let you both know if I encounter this problem again after that. I do not expect to. If I was really adventurous, I would modify the code to attempt to detect this and run the patched version on my system to see if it is possible to detect (and maybe even correct) it as it happens. Unfortunately, that does not appear to be a trivial exercise. Rick Lochner On Mon, 2016-05-16 at 07:33 -0400, Austin S. Hemmelgarn wrote: > On 2016-05-16 02:07, Chris Murphy wrote: > > > > Current hypothesis > > "I suspected, and I still suspect that the error occurred upon a > > metadata update that corrupted the checksum for the file, probably > > due > > to silent memory corruption. If the checksum was silently > > corrupted, > > it would be simply written to both drives causing this type of > > error." > > > > A metadata update alone will not change the data checksums. > > > > But let's ignore that. If there's corrupt extent csum in a node > > that > > itself has a valid csum, this is functionally identical to e.g. > > nerfing 100 bytes of a file's extent data (both copies, > > identically). > > The fs doesn't know the difference. All it knows is the node csum > > is > > valid, therefore the data extent csum is valid, and that's why it > > assumes the data is wrong and hence you get an I/O error. And I can > > reproduce most of your results by nerfing file data. > > > > The entire dmesg for scrub looks like this: > > > > > > May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm- > > 6): > > checksum error at logical 5566889984 on dev /dev/dm-6, sector > > 8540160, > > root 5, inode 258, offset 0, length 4096, links 1 (path: > > openSUSE-Tumbleweed-NET-x86_64-Current.iso) > > May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6): > > bdev /dev/dm-6 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 > > May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6): > > unable to fixup (regular) error at logical 5566889984 on dev > > /dev/dm-6 > > May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm- > > 6): > > checksum error at logical 5566889984 on dev /dev/mapper/VG-b1, > > sector > > 8579072, root 5, inode 258, offset 0, length 4096, links 1 (path: > > openSUSE-Tumbleweed-NET-x86_64-Current.iso) > > May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6): > > bdev /dev/mapper/VG-b1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 > > May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6): > > unable to fixup (regular) error at logical 5566889984 on dev > > /dev/mapper/VG-b1 > > > > And the entire dmesg for running sha256sum on the file is > > > > May 15 23:33:41 f23s.localdomain kernel: __readpage_endio_check: 22 > > callbacks suppressed > > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm- > > 6): > > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 > > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm- > > 6): > > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 > > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm- > > 6): > > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 > > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm- > > 6): > > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 > > May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm- > > 6): > > csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 > > > > > > And I do get an i/o error for sha256sum and no hash is computed. > > > > But there's two important differences: > > 1. I have two unable to fixup messages, one for each device, at the > > exact same time. > > 2. I altered both copies of extent data. > > > > It's a mystery to me how your file data has not changed, but > > somehow > > the extent csum was changed but also the node csum was recomputed > > correctly. That's a bit odd. > I would think this would be perfectly possible if some other file > that > had a checksum in that node changed, thus forcing the node's checksum > to > be updated. Theoretical sequence of events: > 1. Some file which has a checksum in node A gets written to. > 2. Node A is loaded into memory to update the checksum. > 3. The new checksum for the changed extent in the file gets updated > in > the in-memory copy of node A. > 4. Node A has it's own checksum recomputed based on the new data, > and > then gets saved to disk. > If something happened after 2 but before 4 that caused one of the > other > checksums to go bad, then the checksum computed in 4 will have been > with > the corrupted data. > -- To unsubscribe
[PATCH] Btrfs: fix handling of faults from btrfs_copy_from_user
Hi everyone, I've tried to cc most of the recent reports of this warning. It was pretty hard to track down, but I've got a test program for it now. My goal is to change xfs_io to add the madvise loop under --do-horrible-things, instead of adding yet another special doio.c function to xfstests. I've run this through both my test case, and Dave's trinity code. And now for the patch: When btrfs_copy_from_user isn't able to copy all of the pages, we need to adjust our accounting to reflect the work that was actually done. Commit 2e78c927d79 changed around the decisions a little and we ended up skipping the accounting adjustments some of the time. This commit makes sure that when we don't copy anything at all, we still hop into the adjustments, and switches to release_bytes instead of write_bytes, since write_bytes isn't aligned. The accounting errors led to warnings during btrfs_destroy_inode: [ 70.847532] WARNING: CPU: 10 PID: 514 at fs/btrfs/inode.c:9350 btrfs_destroy_inode+0x2b3/0x2c0 [ 70.847536] Modules linked in: i2c_piix4 virtio_net i2c_core input_leds button led_class serio_raw acpi_cpufreq sch_fq_codel autofs4 virtio_blk [ 70.847538] CPU: 10 PID: 514 Comm: umount Tainted: GW 4.6.0-rc6_00062_g2997da1-dirty #23 [ 70.847539] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.0-1.fc24 04/01/2014 [ 70.847542] 880ff5cafab8 8149d5e9 0202 [ 70.847543] 880ff5cafb08 [ 70.847547] 8107bdfd 880ff5cafaf8 24868120013d 880ff5cafb28 [ 70.847547] Call Trace: [ 70.847550] [] dump_stack+0x51/0x78 [ 70.847551] [] __warn+0xfd/0x120 [ 70.847553] [] warn_slowpath_null+0x1d/0x20 [ 70.847555] [] btrfs_destroy_inode+0x2b3/0x2c0 [ 70.847556] [] ? __destroy_inode+0x71/0x140 [ 70.847558] [] destroy_inode+0x43/0x70 [ 70.847559] [] ? wake_up_bit+0x2f/0x40 [ 70.847560] [] evict+0x148/0x1d0 [ 70.847562] [] ? start_transaction+0x3de/0x460 [ 70.847564] [] dispose_list+0x59/0x80 [ 70.847565] [] evict_inodes+0x180/0x190 [ 70.847566] [] ? __sync_filesystem+0x3f/0x50 [ 70.847568] [] generic_shutdown_super+0x48/0x100 [ 70.847569] [] ? woken_wake_function+0x20/0x20 [ 70.847571] [] kill_anon_super+0x16/0x30 [ 70.847573] [] btrfs_kill_super+0x1e/0x130 [ 70.847574] [] deactivate_locked_super+0x4e/0x90 [ 70.847576] [] deactivate_super+0x51/0x70 [ 70.847577] [] cleanup_mnt+0x3f/0x80 [ 70.847579] [] __cleanup_mnt+0x12/0x20 [ 70.847581] [] task_work_run+0x68/0xa0 [ 70.847582] [] exit_to_usermode_loop+0xd6/0xe0 [ 70.847583] [] do_syscall_64+0xbd/0x170 [ 70.847586] [] entry_SYSCALL64_slow_path+0x25/0x25 This is the test program I used to force short returns from btrfs_copy_from_user void *dontneed(void *arg) { char *p = arg; int ret; while(1) { ret = madvise(p, BUFSIZE/4, MADV_DONTNEED); if (ret) { perror("madvise"); exit(1); } } } int main(int ac, char **av) { int ret; int fd; char *filename; unsigned long offset; char *buf; int i; pthread_t tid; if (ac != 2) { fprintf(stderr, "usage: dammitdave filename\n"); exit(1); } buf = mmap(NULL, BUFSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); if (buf == MAP_FAILED) { perror("mmap"); exit(1); } memset(buf, 'a', BUFSIZE); filename = av[1]; ret = pthread_create(, NULL, dontneed, buf); if (ret) { fprintf(stderr, "error %d from pthread_create\n", ret); exit(1); } ret = pthread_detach(tid); if (ret) { fprintf(stderr, "pthread detach failed %d\n", ret); exit(1); } while (1) { fd = open(filename, O_RDWR | O_CREAT, 0600); if (fd < 0) { perror("open"); exit(1); } for (i = 0; i < ROUNDS; i++) { int this_write = BUFSIZE; offset = rand() % MAXSIZE; ret = pwrite(fd, buf, this_write, offset); if (ret < 0) { perror("pwrite"); exit(1); } else if (ret != this_write) { fprintf(stderr, "short write to %s offset %lu ret %d\n", filename, offset, ret); exit(1); } if (i == ROUNDS - 1) { ret = sync_file_range(fd, offset, 4096,
Re: [PATCH 3/7] Btrfs: check if extent buffer is aligned to sectorsize
On Sat, May 14, 2016 at 06:30:52PM +0800, Qu Wenruo wrote: > Hi Liu, > > Thanks for your patch first. > > On 05/14/2016 08:06 AM, Liu Bo wrote: > > Thanks to fuzz testing, we can pass an invalid bytenr to extent buffer > > via alloc_extent_buffer(). An unaligned eb can have more pages than it > > should have, which ends up extent buffer's leak or some corrupted content > > in extent buffer. > > > > This adds a warning to let us quickly know what was happening. > > > > Signed-off-by: Liu Bo> > --- > > fs/btrfs/extent_io.c | 4 > > 1 file changed, 4 insertions(+) > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index d247fc0..e601e0f 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -4868,6 +4868,10 @@ struct extent_buffer *alloc_extent_buffer(struct > > btrfs_fs_info *fs_info, > > int uptodate = 1; > > int ret; > > > > + WARN_ONCE(!IS_ALIGNED(start, fs_info->tree_root->sectorsize), > > + KERN_WARNING "eb->start(%llu) is not aligned to > > root->sectorsize(%u)\n", > > + start, fs_info->tree_root->sectorsize); > > + > > IMHO this is a quite big problem. As almost all other things rely on the > assumption that extent buffer are at least sectorsize aligned. > It won't cause too much trouble as reading eb's page can prevent btrfs using this eb. > What about warning and returning NULL? WARN_ONCE() only won't info user > quick enough. I'm OK with warning, but I just realized that warning doesn't show which filesystem has problems, so btrfs_crit and -EINVAL is preferable. > > BTW, after a quick glance into __alloc_extent_buffer(), it seems that we > didn't check the return pointer of kmem_cache_zalloc(), since you're fixing > things around that code, would you mind to fix it too? It's not necessary to do that since it's using __GFP_NOFAIL. Thanks, -liubo > > Thanks, > Qu > > > eb = find_extent_buffer(fs_info, start); > > if (eb) > > return eb; > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] Btrfs: replace BUG_ON with WARN in merge_bio
On Mon, May 16, 2016 at 10:44:38AM +0200, David Sterba wrote: > On Fri, May 13, 2016 at 05:07:00PM -0700, Liu Bo wrote: > > We have two BUG_ON in merge_bio, but since it can gracefully return errors > > to callers, use WARN_ONCE to give error information and don't leave a > > possible panic. > > > > Signed-off-by: Liu Bo> > --- > > fs/btrfs/extent_io.c | 1 - > > fs/btrfs/inode.c | 6 -- > > 2 files changed, 4 insertions(+), 3 deletions(-) > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index e601e0f..99286d1 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -2746,7 +2746,6 @@ static int merge_bio(int rw, struct extent_io_tree > > *tree, struct page *page, > > if (tree->ops && tree->ops->merge_bio_hook) > > ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio, > > bio_flags); > > - BUG_ON(ret < 0); > > return ret; > > > > } > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index 5874562..3a989e3 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c > > @@ -1827,8 +1827,10 @@ int btrfs_merge_bio_hook(int rw, struct page *page, > > unsigned long offset, > > map_length = length; > > ret = btrfs_map_block(root->fs_info, rw, logical, > > _length, NULL, 0); > > - /* Will always return 0 with map_multi == NULL */ > > - BUG_ON(ret < 0); > > + if (ret < 0) { > > + WARN_ONCE(ret < 0, KERN_ERR "ret = %d\n", ret); > > btrfs_map_block is quite verbose about the errors so it's not needed to > print it here. OK. > > Otherwise I'm not sure if all paths that go through the merge hook > handle errors, eg. in btrfs_submit_compressed_read or _write. Some code > is skipped if merge hook returns nonzero. But, the code expects either 0 > or 1, and when the BUG_ON(ret < 0) is removed suddenly the 'ret < 0' can > be returned. Unexpected. Right now btrfs_merge_bio_hook() only returns 1 or 0 and marks (ret < 0) as BUG_ON. But compress code is ready to handle the error, btrfs_submit_compressed_read/write() { ... ret = merge_bio_hook() if (ret || bio_add_page()) { ret = btrfs_map_bio(); BUG_ON(ret); ... } ... } So ret < 0 is handled, if there's any errors from merge_bio_hook(), it submits the current bio, so the way is sane to me. The other consumer of merge_bio is submit_extent_page(), where it's OK to return errors and callers are ready to handle them. > > It's IMO better to push up the BUG_ON error handling only one caller at > a time. That way it's easier to review the callgraph and call paths. merge_bio() is a wrapper for tree->ops->merge_bio_hook(), which is the same thing with btrfs_merge_bio_hook(). It makes no sense if we just look at BUG_ON in btrfs_merge_bio_hook but keep BUG_ON() in merge_bio(). Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About in-band dedupe for v4.7
On Fri, May 13, 2016 at 11:13:18AM +0800, Wang Shilong wrote: > I am commenting not because of Qu's patch, of course, Qu and Mark > Fasheh > Do a really good thing for Btrfs contributions. > > Just my two cents! > > 1) I think Currently, we should really focus on Btrfs stability, there > are still many > bugs hidden inside Btrfs, please See Filipe flighting patches here and > there. Unfortunately, > I have not seen Btrfs's Bug is reducing for these two years even we > have frozen new features here. So this means that number of bugs is increasing despite all the bugfixes taht get merged? I wonder if you have numbers to back that or if it's just handwaving. Complex code has bugs and it's not easy to discover them, that's not surprising. > we are adopting Btrfs internally sometime before, but it is really > unstable unfortunately. > > So Why not sit down and make it stable firstly? And do you thing this does not happen? Most of patches that are merged are fixes or updates related to fixes. The number of new big features is at most one per release. > 2)I am not against new feature, but for a new feature, I think we > should be really > careful now, Especially if a new feature affect normal write/read > path, I think following things can > be done to make things better: > > ->Give your design and documents(maybe put it in wiki or somewhere > else) So that > other guys can really review your design instead of looking a bunch of > codes firstly. And we really > need understand pros and cons of design, also if there is TODO, please > do clarity out how we > can solve this problem(or it is possible?). The design document should come as th 0-th patch in a series. This patchset had the overall idea described in 1438072250-2871-1-git-send-email-quwen...@cn.fujitsu.com, further comments about details are scattered under the patchset revisions, so this could be improved so everybody knows what's the consensus. >->We need reallly a lot of testing and benchmarking if it affects > normal write/read path. Sure, new tests and test results are welcome but they do not come in hundreds. >->I think Chris is nice to merge patches, but I really argue next > time if we want to merge new feautres > Please make sure at least two other guys review for patches. Thanks for the advice, but that's nothing new. People willing to do reviews are scarce and overloaded. More reviews are not a problem, everybody does it in a different style, the more the better coverage we'll have in the end. Doing reviews cannot be forced otherwise the quality would drop, announcing 'reviewer of the month' would not help either. Everybody who considers self a frequent contributor and knows some area should be also self-motivated to review patches that touch it. Reviewing has a positive educational effect as one has to make sure he understands what the old code does and how it's going to be changed by the patch. There's usually overlap with other areas so this slowly extends the expertise which has a good impact on the developer community etc. Maintainers do reviews, but not necessarily as deep as another developer could do, because sometimes it's difficult, or too lengthy to get familiar (again) with some particular code or code path. This leads to slow merging pace or worse merging buggy patches if the review misses some dark corners. I think I'm stating the obvious but it should not hurt to repeat this from time to time. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About in-band dedupe for v4.7
On Thu, May 12, 2016 at 01:54:26PM -0700, Mark Fasheh wrote: > On Wed, May 11, 2016 at 07:36:59PM +0200, David Sterba wrote: > > On Tue, May 10, 2016 at 07:52:11PM -0700, Mark Fasheh wrote: > > > Taking your history with qgroups out of this btw, my opinion does not > > > change. > > > > > > With respect to in-memory only dedupe, it is my honest opinion that such a > > > limited feature is not worth the extra maintenance work. In particular > > > there's about 800 lines of code in the userspace patches which I'm sure > > > you'd want merged, because how could we test this then? > > > > I like the in-memory dedup backend. It's lightweight, only a heuristic, > > does not need any IO or persistent storage. OTOH I consider it a subpart > > of the in-band deduplication that does all the persistency etc. So I > > treat the ioctl interface from a broader aspect. > > Those are all nice qualities, but what do they all get us? > > For example, my 'large' duperemove test involves about 750 gigabytes of > general purpose data - quite literally /home off my workstation. Home directory is IMO a typicial anti-usecase for in-band deduplication. > After the run I'm usually seeing between 65-75 gigabytes saved for a total > of only 10% duplicated data. I would expect this to be fairly 'average' - > /home on my machine has the usual stuff - documents, source code, media, > etc. > > So if you were writing your whole fs out you could expect about the same > from inline dedupe - 10%-ish. Let's be generous and go with that number > though as a general 'this is how much dedupe we get'. > > What the memory backend is doing then is providing a cache of sha256/block > calculations. This cache is very expensive to fill, and every written block > must go through it. On top of that, the cache does not persist between > mounts, and has items regularly removed from it when we run low on memory. > All of this will drive down the amount of duplicated data we can find. > > So our best case savings is probably way below 10% - let's be _really_ nice > and say 5%. > > Now ask yourself the question - would you accept a write cache which is > expensive to fill and would only have a hit rate of less than 5%? I'd ask myself a diffrent question: would I turn on a write cache if the utilization is that low? No. The in-memory cache is not going to be on by default, it's going to help in spefific scenarios, as described in other replies. > Oh and there's 800 lines of userspace we'd merge to manage this cache too, > kernel ioctls which would have to be finalized, etc. Most of the userspace code is going to be shared with the generic dedup. I haven't reviewed the userspace because the ioctl interfaces are not settled so the code can change, but now it should be enough to test the feature. > > A usecase I find interesting is to keep the in-memory dedup cache and > > then flush it to disk on demand, compared to automatically synced dedup > > (eg. at commit time). > > What's the benefit here? We're still going to be hashing blocks on the way > in, and if we're not deduping them at write time then we're just have to > remove the extents and dedupe them later. > > > > > A couple examples sore points in my review so far: > > > > > > - Internally you're using a mutex (instead of a spinlock) to lock out > > > queries > > > to the in-memory hash, which I can see becoming a performance problem in > > > the > > > write path. > > > > > > - Also, we're doing SHA256 in the write path which I expect will > > > slow it down even more dramatically. Given that all the work done gets > > > thrown out every time we fill the hash (or remount), I just don't see > > > much > > > benefit to the user with this. > > > > I had some ideas to use faster hashes and do sha256 when it's going to > > be stored on disk, but there were some concerns. The objection against > > speed and performance hit at write time is valid. But we'll need to > > verify that in real performance tests, which haven't happend yet up to > > my knowledge. > > This is the type of thing that IMHO absolutely must be provided with each > code drop of the feature. Dedupe is nice but _nobody_ will use it if it's > slow. I know this from experience. I personally feel that btrfs has had > enough of 'cute' and 'almost working' features. If we want inline dedupe we > should do it correctly and with the right metrics from the beginning. That's why this feature is on slow-merge track and I'm glad that so many people are interested. And I hope it's not just because Qu asked to merge it. I'm pushing back on the interface side, so far I think the feature will address real usecases. To evaluate the feature we have to put into some more visible git tree, not only to see how it plays together with other changes, but because the number of people who would apply the patches from mailinglist is low. Leaving out the patchset from a for-next is easy if we encounter problems. > This is slightly
[PULL] Btrfs for 4.7
Hi, please queue the following branch to 4.7. New features or user-visible changes: * device delete by id, a v2 ioctl for device deletion (this was held back from 4.6 pull due to possibly related crashes that haven't appeared for a long time) * DUP allowed on multiple-device filesystem (help in case we want to go to SINGLE from RAID1/.. but do not want to lose all redundancy while converting) * GETFLAGS/SETFLAGS/GETVERSION ioctls work on 32bit Developer-visible changes: * definitions that are shared with userspace are moved to the uapi directory Other: * assorted bugfixes and cleanups The branch is merged from several topic branches, grouped by functionality or purpose. The patches have been in for-next, most of them for a few weeks at least, tested together with Filipe's branch. Thanks. The following changes since commit 02da2d72174c61988eb4456b53f405e3ebdebce4: Linux 4.6-rc5 (2016-04-24 16:17:05 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris-4.7 for you to fetch changes up to 680834ca0ad4e9827048d4bda1e38db69c3dd1e4: Merge branch 'foreign/jeffm/uapi' into for-chris-4.7-20160516 (2016-05-16 15:46:29 +0200) Adam Borowski (1): btrfs: fix int32 overflow in shrink_delalloc(). Anand Jain (20): btrfs: rename btrfs_std_error to btrfs_handle_fs_error btrfs: remove unused function btrfs_assert() btrfs: move error handling code together in ctree.h btrfs: remove save_error_info() btrfs: create a helper function to read the disk super btrfs: create helper function __check_raid_min_devices() btrfs: clean up and optimize __check_raid_min_device() btrfs: create helper btrfs_find_device_by_user_input() btrfs: make use of btrfs_find_device_by_user_input() btrfs: enhance btrfs_find_device_by_user_input() to check device path btrfs: make use of btrfs_scratch_superblocks() in btrfs_rm_device() btrfs: introduce device delete by devid btrfs: optimize check for stale device btrfs: use fs_info directly btrfs: refactor btrfs_dev_replace_start for reuse btrfs: pass the right error code to the btrfs_std_error btrfs: s_bdev is not null after missing replace btrfs: cleanup assigning next active device with a check btrfs: fix lock dep warning, move scratch dev out of device_list_mutex and uuid_mutex btrfs: fix lock dep warning move scratch super outside of chunk_mutex Ashish Samant (1): btrfs: Fix BUG_ON condition in scrub_setup_recheck_block() Austin S. Hemmelgarn (1): btrfs: allow balancing to dup with multi-device Chandan Rajendra (1): Btrfs: __btrfs_buffered_write: Pass valid file offset when releasing delalloc space Dan Carpenter (1): btrfs: send: silence an integer overflow warning David Sterba (32): btrfs: rename __check_raid_min_devices btrfs: pass number of devices to btrfs_check_raid_min_devices btrfs: introduce raid-type to error-code table, for minimum device constraint btrfs: use existing device constraints table btrfs_raid_array btrfs: rename btrfs_find_device_by_user_input btrfs: rename flags for vol args v2 btrfs: kill unused writepage_io_hook callback btrfs: ioctl: reorder exclusive op check in RM_DEV btrfs: send: use vmalloc only as fallback for send_buf btrfs: send: use vmalloc only as fallback for read_buf btrfs: send: use temporary variable to store allocation size btrfs: send: use vmalloc only as fallback for clone_roots btrfs: send: use vmalloc only as fallback for clone_sources_tmp btrfs: clone: use vmalloc only as fallback for nodesize bufer btrfs: use dynamic allocation for root item in create_subvol btrfs: reuse existing variable in scrub_stripe, reduce stack usage btrfs: add read-only check to sysfs handler of features btrfs: add check to sysfs handler of label btrfs: sysfs: protect reading label by lock btrfs: add write protection to SET_FEATURES ioctl btrfs: ioctl: reorder exclusive op check in RM_DEV btrfs: switch to common message helpers in open_ctree, adjust messages btrfs: GFP_NOFS does not GFP_HIGHMEM btrfs: rename and document compression workspace members btrfs: preallocate compression workspaces btrfs: make find_workspace always succeed btrfs: make find_workspace warn if there are no workspaces btrfs: build fixup for qgroup_account_snapshot Merge branch 'misc-4.7' into for-chris-4.7-20160516 Merge branch 'cleanups-4.7' into for-chris-4.7-20160516 Merge branch 'foreign/anand/dev-del-by-id-ext' into for-chris-4.7-20160516 Merge branch 'foreign/jeffm/uapi' into for-chris-4.7-20160516 Geert Uytterhoeven (1): Btrfs: Refactor
Re: [PATCH v2 0/3] btrfs-progs: autogen: Some compatibility fixs
On Fri, May 13, 2016 at 11:44:50AM +0800, Zhao Lei wrote: > Changelog v1->v2: > 1: btrfs-progs: autogen: Make build success in CentOS 6 and 7 >Add AC_SUBST(UDEVDIR), suggested by: Jeff Mahoney> > Zhao Lei (3): > btrfs-progs: autogen: Avoid chdir fail on dirname with blank > btrfs-progs: autogen: Make build success in CentOS 6 and 7 > btrfs-progs: autogen: Don't show success message on fail Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFE: 'btrfs' tools machine readable output
On Monday 16 May 2016 14:21:07 Martin Steigerwald wrote: > Hello Richard, > > On Montag, 16. Mai 2016 13:14:56 CEST Richard W.M. Jones wrote: > > I don't have time to implement this right now, so I'm just posting > > this as a suggestion/request ... > > > > It would be really helpful if the btrfs tools had a machine-readable > > output. > > > > Libguestfs parses btrfs tools output in a number of places, eg: > > https://github.com/libguestfs/libguestfs/blob/master/daemon/btrfs.c > > This is a massive PITA because each time a new release of btrfs-progs > > comes along it changes the output slightly, and we end up having > > to add all sorts of hacks. > > > > With machine-readable output, there'd be a flag which would > > change the output. eg: > > I wonder whether parsing a text based output is really the most elegant > method > here. > > How about a libbtrfs so that other tools can benefit from btrfs tools > functionality? btrfs-progs is GPL v2 only, and that may be an issue for consumers (the libguestfs daemon, the component interacting with devices and filesystems inside the libguestfs appliance, is GPL v2+ for example). > Of course it would likely me more effort than to implement structured output. There's this of course, which is not entirely negligible. -- Pino Toscano signature.asc Description: This is a digitally signed message part.
Re: RFE: 'btrfs' tools machine readable output
On Mon, May 16, 2016 at 02:21:07PM +0200, Martin Steigerwald wrote: > Hello Richard, > > On Montag, 16. Mai 2016 13:14:56 CEST Richard W.M. Jones wrote: > > I don't have time to implement this right now, so I'm just posting > > this as a suggestion/request ... > > > > It would be really helpful if the btrfs tools had a machine-readable > > output. > > > > Libguestfs parses btrfs tools output in a number of places, eg: > > https://github.com/libguestfs/libguestfs/blob/master/daemon/btrfs.c > > This is a massive PITA because each time a new release of btrfs-progs > > comes along it changes the output slightly, and we end up having > > to add all sorts of hacks. > > > > With machine-readable output, there'd be a flag which would > > change the output. eg: > > I wonder whether parsing a text based output is really the most elegant > method > here. > > How about a libbtrfs so that other tools can benefit from btrfs tools > functionality? This was also desktop environments wishing to make use of > snapshot functionality or advanced disk usage reporting for example can > easily > make use of it without calling external commands. > > Of course it would likely me more effort than to implement structured output. Yes, a similar situation happened with qemu actually. We've been talking for over half a decade about putting the qemu block layer into a library, and it's still not happened. But we got `qemu-img info' to have JSON output in a few weeks, and that is now how we query the properties of VM disk images: $ qemu-img info --output=json ./builder/fedora.qcow2 { "virtual-size": 1073741824, "filename": "./builder/fedora.qcow2", "cluster-size": 65536, "format": "qcow2", "actual-size": 7933952, "format-specific": { "type": "qcow2", "data": { "compat": "1.1", "lazy-refcounts": false, "refcount-bits": 16, "corrupt": false } }, "dirty-flag": false } Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFE: 'btrfs' tools machine readable output
On 2016-05-16 08:14, Richard W.M. Jones wrote: I don't have time to implement this right now, so I'm just posting this as a suggestion/request ... It would be really helpful if the btrfs tools had a machine-readable output. Libguestfs parses btrfs tools output in a number of places, eg: https://github.com/libguestfs/libguestfs/blob/master/daemon/btrfs.c This is a massive PITA because each time a new release of btrfs-progs comes along it changes the output slightly, and we end up having to add all sorts of hacks. With machine-readable output, there'd be a flag which would change the output. eg: $ btrfs filesystem show Label: 'ROOT' uuid: af471cfc-421a-4d51-8811-ce969f76828a Total devices 1 FS bytes used 5.29MiB devid1 size 767.97MiB used 92.00MiB path /dev/sda2 would become: $ btrfs --json filesystem show { "devices": { "Label": "ROOT", "uuid": "af471cfc-421a-4d51-8811-ce969f76828a", /// etc } } By this example I don't mean that JSON has to be the format -- in fact it's a terrible format with all sorts of problems -- any format which is parseable with C libraries would do for us. I would love to see something like this myself, as it would make integration with monitoring tools so much easier. I'd vote for YAML as the output format though, as it's easily human readable as well as machine parseable while still being space efficient. Output from your example above might look something like this: --- - filesystems - label: 'ROOT' uuid: af471cfc-421a-4d51-8811-ce969f76828a devices: 1 - devid: 1 size: 767.97 MB used 92.00 MB path: /dev/sda2 used: 5.29 MB ... Although of course the numbers would probably just be in bytes instead of 'human' values. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] btrfs: fix memory leak during RAID 5/6 device replacement
Hi, Scott Talbert > From: Zhao Lei [mailto:zhao...@cn.fujitsu.com] > Sent: Friday, May 13, 2016 6:31 PM > To: 'dste...@suse.cz'; 'Scott Talbert' > > Cc: 'linux-btrfs@vger.kernel.org' > Subject: RE: [PATCH] btrfs: fix memory leak during RAID 5/6 device replacement > > Hi, Scott Talbert > > * From: David Sterba [mailto:dste...@suse.cz] > > Sent: Tuesday, May 10, 2016 1:32 AM > > To: Scott Talbert > > Cc: linux-btrfs@vger.kernel.org; Zhao Lei > > Subject: Re: [PATCH] btrfs: fix memory leak during RAID 5/6 device > replacement > > > > CC Zhao Lei > > > > On Mon, May 09, 2016 at 09:14:28AM -0400, Scott Talbert wrote: > > > A 'struct bio' is allocated in scrub_missing_raid56_pages(), but it was > > > never > > > freed anywhere. > > > > > > Signed-off-by: Scott Talbert > > > --- > > > fs/btrfs/scrub.c | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c > > > index 82bedf9..607cc6e 100644 > > > --- a/fs/btrfs/scrub.c > > > +++ b/fs/btrfs/scrub.c > > > @@ -2130,6 +2130,8 @@ static void scrub_missing_raid56_end_io(struct > > bio *bio) > > > if (bio->bi_error) > > > sblock->no_io_error_seen = 0; > > > > > > + bio_put(bio); > > > + > Seems good by reviewing. > I'll add some debug code to test it more detailedly. > Signed-off-by: Zhao Lei Test-by: Zhao Lei Thanks Zhaolei > Thanks > Zhaolei > > > > btrfs_queue_work(fs_info->scrub_workers, >work); > > > } > > > > > > -- > > > 1.9.1 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > > the body of a message to majord...@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFE: 'btrfs' tools machine readable output
Hello Richard, On Montag, 16. Mai 2016 13:14:56 CEST Richard W.M. Jones wrote: > I don't have time to implement this right now, so I'm just posting > this as a suggestion/request ... > > It would be really helpful if the btrfs tools had a machine-readable > output. > > Libguestfs parses btrfs tools output in a number of places, eg: > https://github.com/libguestfs/libguestfs/blob/master/daemon/btrfs.c > This is a massive PITA because each time a new release of btrfs-progs > comes along it changes the output slightly, and we end up having > to add all sorts of hacks. > > With machine-readable output, there'd be a flag which would > change the output. eg: I wonder whether parsing a text based output is really the most elegant method here. How about a libbtrfs so that other tools can benefit from btrfs tools functionality? This was also desktop environments wishing to make use of snapshot functionality or advanced disk usage reporting for example can easily make use of it without calling external commands. Of course it would likely me more effort than to implement structured output. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFE: 'btrfs' tools machine readable output
I don't have time to implement this right now, so I'm just posting this as a suggestion/request ... It would be really helpful if the btrfs tools had a machine-readable output. Libguestfs parses btrfs tools output in a number of places, eg: https://github.com/libguestfs/libguestfs/blob/master/daemon/btrfs.c This is a massive PITA because each time a new release of btrfs-progs comes along it changes the output slightly, and we end up having to add all sorts of hacks. With machine-readable output, there'd be a flag which would change the output. eg: $ btrfs filesystem show Label: 'ROOT' uuid: af471cfc-421a-4d51-8811-ce969f76828a Total devices 1 FS bytes used 5.29MiB devid1 size 767.97MiB used 92.00MiB path /dev/sda2 would become: $ btrfs --json filesystem show { "devices": { "Label": "ROOT", "uuid": "af471cfc-421a-4d51-8811-ce969f76828a", /// etc } } By this example I don't mean that JSON has to be the format -- in fact it's a terrible format with all sorts of problems -- any format which is parseable with C libraries would do for us. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fsck: to repair or not to repair
On 2016-05-16 07:34, Andrei Borzenkov wrote: 16.05.2016 14:17, Austin S. Hemmelgarn пишет: On 2016-05-13 17:35, Chris Murphy wrote: On Fri, May 13, 2016 at 9:28 AM, Nikolaus Rathwrote: On May 13 2016, Duncan <1i5t5.dun...@cox.net> wrote: Because btrfs can be multi-device, it needs some way to track which devices belong to each filesystem, and it uses filesystem UUID for this purpose. If you clone a filesystem (for instance using dd or lvm snapshotting, doesn't matter how) and then trigger a btrfs device scan, say by plugging in some other device with btrfs on it so udev triggers a scan, and the kernel sees multiple devices with the same filesystem UUID as a result, and one of those happens to be mounted, you can corrupt both copies as the kernel btrfs won't be able to tell them apart and may write updates to the wrong one. That seems like a rather odd design. Why isn't btrfs refusing to mount in this situation? In the face of ambiguity, guessing is generally bad idea (at least for a computer program). The logic you describe requires code. It's the absence of code rather than an intentional design that's the cause of the current behavior. And yes, it'd be nice if Btrfs weren't stepping on its own tail in this situation. It could be as simple as refusing to mount anytime there's an ambiguity, but that's sorta user hostile if there isn't a message that goes along with it to help the user figure out a way to resolve the problem. And that too could be fraught with peril if the user makes a mistake. So, really what's the right way to do this is part of the problem but I agree it's better to be hostile and refuse to mount a given volume UUID at all when too many devices are found, than corrupt the file system. FWIW, the behavior I'd expect from a sysadmin perspective would be: 1. If and only if a correct number of device= options have been passed to mount, use those devices (and only those devices), and log a warning if extra devices are detected. First, how do you know that devices, passed as device= options, are correct? Is it possible to detect stale copy? You don't. As much as it pains me to say it, there's no way to protect against this reliably. The intent is that if you have specified the correct number of devices according to the number the filesystem says should be there (and that number is the same on all devices specified), it's assumed you know what you're doing. Second, today udev rules will run equivalent of "btrfs device ready" for each device that is part of btrfs. That's part of the rules shipped by systemd, and is not by any means on every system in existence. That is an inherent design flaw in systemd resulting from them thinking they're smarter than the kernel, and it has on multiple occasions bit people. So you still need to handle the situation when device(s) appear and disappear after initial mount and have some way to distinguish between two copies. Yes, you need to account for devices appearing and disappearing, but at least until we add proper support for off-line devices, that's easy. Third, what exactly "extra devices detected" means? Who is responsible for detection? Where this information is kept? How can mount query this information? If there are more devices with the filesystem's UUID than are passed in via device= options, and the above stated condition regarding device= options is met, then those are extra devices. 2. Otherwise, refuse to mount and log a warning. So no way to mount degraded redundant filesystem? I know a large number of people who routinely use degraded as part of their fstab options. Degraded is supposed to mean reduced data safety, not 'may cause random corruption just by being used'. I have no issue with a mount option to force mounting it anyway, but I absolutely do not want that to be part of the degraded mount option. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems with free space cache
On 2016-05-16 02:20, Qu Wenruo wrote: Duncan wrote on 2016/05/16 05:59 +: Qu Wenruo posted on Mon, 16 May 2016 10:24:23 +0800 as excerpted: IIRC clear_cache option is fs level option. So the first mount with clear_cache, then all subvolume will have clear_cache. Question: Does clear_cache work with a read-only mount? Good question. But easy to check. I just checked it and found even that's possible, it doesn't work. Free space cache inode bytenr doesn't change and no generation change. While without ro, it did rebuild free space cache for *SOME* chunks, not *ALL* chunks. And that's the problem I'm just chasing today. Short conclude: clear_cache mount option will only rebuild free space cache for chunks which we allocated space from, during the mount time of clear_cache. (Maybe I'm just out-of-date and some other devs may already know that) And kernel fix for that is a little tricky, as we don't really read out all free space cache, but only when we are going to allocate space from them. For any free space cache we didn't read out, they won't be rebuilt. You can find my reproducer in my just submitted patch(https://patchwork.kernel.org/patch/9098431/). That behavior makes things a little confusing, which users may continue hitting annoying free space cache warning from kernel, even they try to use clear_cache mount option. Anyway, I'll add ability for manually wipe out all/given free space cache to btrfsck, at least creating a solution to really rebuild all v1 free space cache. FWIW, I think it's possible to do this by mounting with clear_cache and then running a full balance on the filesystem. Having an option to do this on an unmounted FS would be preferred of course, as that would almost certainly be more efficient on any reasonably sized filesystem. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fsck: to repair or not to repair
16.05.2016 14:17, Austin S. Hemmelgarn пишет: > On 2016-05-13 17:35, Chris Murphy wrote: >> On Fri, May 13, 2016 at 9:28 AM, Nikolaus Rathwrote: >>> On May 13 2016, Duncan <1i5t5.dun...@cox.net> wrote: Because btrfs can be multi-device, it needs some way to track which devices belong to each filesystem, and it uses filesystem UUID for this purpose. If you clone a filesystem (for instance using dd or lvm snapshotting, doesn't matter how) and then trigger a btrfs device scan, say by plugging in some other device with btrfs on it so udev triggers a scan, and the kernel sees multiple devices with the same filesystem UUID as a result, and one of those happens to be mounted, you can corrupt both copies as the kernel btrfs won't be able to tell them apart and may write updates to the wrong one. >>> >>> That seems like a rather odd design. Why isn't btrfs refusing to mount >>> in this situation? In the face of ambiguity, guessing is generally bad >>> idea (at least for a computer program). >> >> The logic you describe requires code. It's the absence of code rather >> than an intentional design that's the cause of the current behavior. >> And yes, it'd be nice if Btrfs weren't stepping on its own tail in >> this situation. It could be as simple as refusing to mount anytime >> there's an ambiguity, but that's sorta user hostile if there isn't a >> message that goes along with it to help the user figure out a way to >> resolve the problem. And that too could be fraught with peril if the >> user makes a mistake. So, really what's the right way to do this is >> part of the problem but I agree it's better to be hostile and refuse >> to mount a given volume UUID at all when too many devices are found, >> than corrupt the file system. >> > FWIW, the behavior I'd expect from a sysadmin perspective would be: > 1. If and only if a correct number of device= options have been passed > to mount, use those devices (and only those devices), and log a warning > if extra devices are detected. First, how do you know that devices, passed as device= options, are correct? Is it possible to detect stale copy? Second, today udev rules will run equivalent of "btrfs device ready" for each device that is part of btrfs. So you still need to handle the situation when device(s) appear and disappear after initial mount and have some way to distinguish between two copies. Third, what exactly "extra devices detected" means? Who is responsible for detection? Where this information is kept? How can mount query this information? > 2. Otherwise, refuse to mount and log a warning. So no way to mount degraded redundant filesystem? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS Data at Rest File Corruption
On 2016-05-16 02:07, Chris Murphy wrote: Current hypothesis "I suspected, and I still suspect that the error occurred upon a metadata update that corrupted the checksum for the file, probably due to silent memory corruption. If the checksum was silently corrupted, it would be simply written to both drives causing this type of error." A metadata update alone will not change the data checksums. But let's ignore that. If there's corrupt extent csum in a node that itself has a valid csum, this is functionally identical to e.g. nerfing 100 bytes of a file's extent data (both copies, identically). The fs doesn't know the difference. All it knows is the node csum is valid, therefore the data extent csum is valid, and that's why it assumes the data is wrong and hence you get an I/O error. And I can reproduce most of your results by nerfing file data. The entire dmesg for scrub looks like this: May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6): checksum error at logical 5566889984 on dev /dev/dm-6, sector 8540160, root 5, inode 258, offset 0, length 4096, links 1 (path: openSUSE-Tumbleweed-NET-x86_64-Current.iso) May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6): bdev /dev/dm-6 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6): unable to fixup (regular) error at logical 5566889984 on dev /dev/dm-6 May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6): checksum error at logical 5566889984 on dev /dev/mapper/VG-b1, sector 8579072, root 5, inode 258, offset 0, length 4096, links 1 (path: openSUSE-Tumbleweed-NET-x86_64-Current.iso) May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6): bdev /dev/mapper/VG-b1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6): unable to fixup (regular) error at logical 5566889984 on dev /dev/mapper/VG-b1 And the entire dmesg for running sha256sum on the file is May 15 23:33:41 f23s.localdomain kernel: __readpage_endio_check: 22 callbacks suppressed May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6): csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6): csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6): csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6): csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6): csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 And I do get an i/o error for sha256sum and no hash is computed. But there's two important differences: 1. I have two unable to fixup messages, one for each device, at the exact same time. 2. I altered both copies of extent data. It's a mystery to me how your file data has not changed, but somehow the extent csum was changed but also the node csum was recomputed correctly. That's a bit odd. I would think this would be perfectly possible if some other file that had a checksum in that node changed, thus forcing the node's checksum to be updated. Theoretical sequence of events: 1. Some file which has a checksum in node A gets written to. 2. Node A is loaded into memory to update the checksum. 3. The new checksum for the changed extent in the file gets updated in the in-memory copy of node A. 4. Node A has it's own checksum recomputed based on the new data, and then gets saved to disk. If something happened after 2 but before 4 that caused one of the other checksums to go bad, then the checksum computed in 4 will have been with the corrupted data. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hot data tracking / hybrid storage
On 2016-05-15 08:12, Ferry Toth wrote: Is there anything going on in this area? We have btrfs in RAID10 using 4 HDD's for many years now with a rotating scheme of snapshots for easy backup. <10% files (bytes) change between oldest snapshot and the current state. However, the filesystem seems to become very slow, probably due to the RAID10 and the snapshots. While it's not exactly what you're thinking of, have you tried running BTRFS in raid1 mode on top of two DM/MD RAID0 volumes? This provides the same degree of data integrity that BTRFS raid10 does, but gets measurably better performance. It would be fantastic if we could just add 4 SSD's to the pool and btrfs would just magically prefer to put often accessed files there and move older or less popular files to the HDD's. In my simple mind this can not be done easily using bcache as that would require completely rebuilding the file system on top of bcache (can not just add a few SSD's to the pool), while implementing a cache inside btrfs is probably a complex thing with lots of overhead. You may want to look into dm-cache, as that doesn't require reformatting the source device. It doesn't quite get the same performance as bcache, but for me at least, the lower performance is a reasonable trade-off for being able to easily convert a device to use it, and being able to easily convert away from it if need be. Simply telling the allocator to prefer new files to go to the ssd and move away unpopular stuff to hdd during balance should do the trick, or am I wrong? In theory this would work as a first implementation, but it would need to have automatic data migration as an option to be considered practical, and that's not as easy to do correctly. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fsck: to repair or not to repair
On 2016-05-13 17:35, Chris Murphy wrote: On Fri, May 13, 2016 at 9:28 AM, Nikolaus Rathwrote: On May 13 2016, Duncan <1i5t5.dun...@cox.net> wrote: Because btrfs can be multi-device, it needs some way to track which devices belong to each filesystem, and it uses filesystem UUID for this purpose. If you clone a filesystem (for instance using dd or lvm snapshotting, doesn't matter how) and then trigger a btrfs device scan, say by plugging in some other device with btrfs on it so udev triggers a scan, and the kernel sees multiple devices with the same filesystem UUID as a result, and one of those happens to be mounted, you can corrupt both copies as the kernel btrfs won't be able to tell them apart and may write updates to the wrong one. That seems like a rather odd design. Why isn't btrfs refusing to mount in this situation? In the face of ambiguity, guessing is generally bad idea (at least for a computer program). The logic you describe requires code. It's the absence of code rather than an intentional design that's the cause of the current behavior. And yes, it'd be nice if Btrfs weren't stepping on its own tail in this situation. It could be as simple as refusing to mount anytime there's an ambiguity, but that's sorta user hostile if there isn't a message that goes along with it to help the user figure out a way to resolve the problem. And that too could be fraught with peril if the user makes a mistake. So, really what's the right way to do this is part of the problem but I agree it's better to be hostile and refuse to mount a given volume UUID at all when too many devices are found, than corrupt the file system. FWIW, the behavior I'd expect from a sysadmin perspective would be: 1. If and only if a correct number of device= options have been passed to mount, use those devices (and only those devices), and log a warning if extra devices are detected. 2. Otherwise, refuse to mount and log a warning. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/7] Btrfs: free sys_array eb as soon as possible
On Fri, May 13, 2016 at 05:06:59PM -0700, Liu Bo wrote: > While reading sys_chunk_array in superblock, btrfs creates a temporary > extent buffer. Since we don't use it after finishing reading > sys_chunk_array, we don't need to keep it in memory. > > Signed-off-by: Liu BoThis patch is independent of the rest, I'll add it to next. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] Btrfs: replace BUG_ON with WARN in merge_bio
On Fri, May 13, 2016 at 05:07:00PM -0700, Liu Bo wrote: > We have two BUG_ON in merge_bio, but since it can gracefully return errors > to callers, use WARN_ONCE to give error information and don't leave a > possible panic. > > Signed-off-by: Liu Bo> --- > fs/btrfs/extent_io.c | 1 - > fs/btrfs/inode.c | 6 -- > 2 files changed, 4 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index e601e0f..99286d1 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -2746,7 +2746,6 @@ static int merge_bio(int rw, struct extent_io_tree > *tree, struct page *page, > if (tree->ops && tree->ops->merge_bio_hook) > ret = tree->ops->merge_bio_hook(rw, page, offset, size, bio, > bio_flags); > - BUG_ON(ret < 0); > return ret; > > } > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 5874562..3a989e3 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -1827,8 +1827,10 @@ int btrfs_merge_bio_hook(int rw, struct page *page, > unsigned long offset, > map_length = length; > ret = btrfs_map_block(root->fs_info, rw, logical, > _length, NULL, 0); > - /* Will always return 0 with map_multi == NULL */ > - BUG_ON(ret < 0); > + if (ret < 0) { > + WARN_ONCE(ret < 0, KERN_ERR "ret = %d\n", ret); btrfs_map_block is quite verbose about the errors so it's not needed to print it here. Otherwise I'm not sure if all paths that go through the merge hook handle errors, eg. in btrfs_submit_compressed_read or _write. Some code is skipped if merge hook returns nonzero. But, the code expects either 0 or 1, and when the BUG_ON(ret < 0) is removed suddenly the 'ret < 0' can be returned. Unexpected. It's IMO better to push up the BUG_ON error handling only one caller at a time. That way it's easier to review the callgraph and call paths. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] Btrfs: replace BUG() with WARN_ONCE in raid56
On Sun, May 15, 2016 at 04:19:28PM +0200, Holger Hoffstätte wrote: > On 05/14/16 02:06, Liu Bo wrote: > > This BUG() has been triggered by a fuzz testing image, but in fact > > btrfs can handle this gracefully by returning -EIO. > > > > Thus, use WARN_ONCE for warning purpose and don't leave a possible > > kernel panic. > > > > Signed-off-by: Liu Bo> > --- > > fs/btrfs/raid56.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > > index 0b7792e..863f7fe 100644 > > --- a/fs/btrfs/raid56.c > > +++ b/fs/btrfs/raid56.c > > @@ -2139,7 +2139,7 @@ int raid56_parity_recover(struct btrfs_root *root, > > struct bio *bio, > > > > rbio->faila = find_logical_bio_stripe(rbio, bio); > > if (rbio->faila == -1) { > > - BUG(); > > + WARN_ONCE(1, KERN_WARNING "rbio->faila is -1\n"); > > I'm generally in favor of not BUGing out for no good reason, but what > is e.g. an admin (or user) supposed to do when he sees this message? > Same for the other rather cryptic WARNs - they contain no actionable > information, and are most likely going to be ignored as "debug spam". > IMHO things that can be ignored can be deleted. Agreed, the way this patchset repalces BUG on is very confusing. WARN_ONCE is a global state, the message does not even print on which filesystem the error happened. The only way to reset the state is to unload the module. This should be handled as a corruption, no matter if it's fuzzed or not, report more details about what is corrupted or what was expected. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: problems with free space cache
Duncan wrote on 2016/05/16 05:59 +: Qu Wenruo posted on Mon, 16 May 2016 10:24:23 +0800 as excerpted: IIRC clear_cache option is fs level option. So the first mount with clear_cache, then all subvolume will have clear_cache. Question: Does clear_cache work with a read-only mount? Good question. But easy to check. I just checked it and found even that's possible, it doesn't work. Free space cache inode bytenr doesn't change and no generation change. While without ro, it did rebuild free space cache for *SOME* chunks, not *ALL* chunks. And that's the problem I'm just chasing today. Short conclude: clear_cache mount option will only rebuild free space cache for chunks which we allocated space from, during the mount time of clear_cache. (Maybe I'm just out-of-date and some other devs may already know that) And kernel fix for that is a little tricky, as we don't really read out all free space cache, but only when we are going to allocate space from them. For any free space cache we didn't read out, they won't be rebuilt. You can find my reproducer in my just submitted patch(https://patchwork.kernel.org/patch/9098431/). That behavior makes things a little confusing, which users may continue hitting annoying free space cache warning from kernel, even they try to use clear_cache mount option. Anyway, I'll add ability for manually wipe out all/given free space cache to btrfsck, at least creating a solution to really rebuild all v1 free space cache. Thanks, Qu I could see it being like the log replay and being done even on ro mounts, particularly since unlike the log replay, clear_cache is a specific mount option so could be seen as specifically requested even if otherwise ro. Or not. So I don't know. He mentioned root, which is normally mounted read-only first. That's what got me thinking about it. Does that complicate things? If root is mounted ro without clear_cache by the initr*, will a subsequent remount,rw,clear_cache do it? And for those like me who keep root mounted ro by default, would clear_cache work at all, or would it not work until such time as I mounted root, or some other subvolume (which I don't have here to worry about, but for those who do...) writable? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS Data at Rest File Corruption
Current hypothesis "I suspected, and I still suspect that the error occurred upon a metadata update that corrupted the checksum for the file, probably due to silent memory corruption. If the checksum was silently corrupted, it would be simply written to both drives causing this type of error." A metadata update alone will not change the data checksums. But let's ignore that. If there's corrupt extent csum in a node that itself has a valid csum, this is functionally identical to e.g. nerfing 100 bytes of a file's extent data (both copies, identically). The fs doesn't know the difference. All it knows is the node csum is valid, therefore the data extent csum is valid, and that's why it assumes the data is wrong and hence you get an I/O error. And I can reproduce most of your results by nerfing file data. The entire dmesg for scrub looks like this: May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6): checksum error at logical 5566889984 on dev /dev/dm-6, sector 8540160, root 5, inode 258, offset 0, length 4096, links 1 (path: openSUSE-Tumbleweed-NET-x86_64-Current.iso) May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6): bdev /dev/dm-6 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6): unable to fixup (regular) error at logical 5566889984 on dev /dev/dm-6 May 15 23:29:46 f23s.localdomain kernel: BTRFS warning (device dm-6): checksum error at logical 5566889984 on dev /dev/mapper/VG-b1, sector 8579072, root 5, inode 258, offset 0, length 4096, links 1 (path: openSUSE-Tumbleweed-NET-x86_64-Current.iso) May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6): bdev /dev/mapper/VG-b1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 May 15 23:29:46 f23s.localdomain kernel: BTRFS error (device dm-6): unable to fixup (regular) error at logical 5566889984 on dev /dev/mapper/VG-b1 And the entire dmesg for running sha256sum on the file is May 15 23:33:41 f23s.localdomain kernel: __readpage_endio_check: 22 callbacks suppressed May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6): csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6): csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6): csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6): csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 May 15 23:33:41 f23s.localdomain kernel: BTRFS warning (device dm-6): csum failed ino 258 off 0 csum 3634944209 expected csum 1334657141 And I do get an i/o error for sha256sum and no hash is computed. But there's two important differences: 1. I have two unable to fixup messages, one for each device, at the exact same time. 2. I altered both copies of extent data. It's a mystery to me how your file data has not changed, but somehow the extent csum was changed but also the node csum was recomputed correctly. That's a bit odd. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: corrupt-block: Add ability to corrupt free space cache file
One user in mail list reported free space cache error where clear_cache mount option failes to fix. To reproduce such problem, add support to corrupt free space cache file(old free space cache implement, or space_cache=v1). With this patch, we can reproduce the problem quite easily: Not suitable for a btrfs-progs or xfstest test case, as btrfsck won't treat it as an error. == $ fallocate test.img -l 2G $ mkfs.btrfs test.img $ sudo mount test.img /mnt/btrfs/ $ sudo xfs_io -f -c "pwrite 0 16M" /mnt/btrfs/tmp $ sudo umount /mnt/btrfs $ ./btrfs-corrupt-block -s content -l 136708096 ../test.img Here 136708096 is the second data chunk bytenr. (The first data chunk is the 8M one from mkfs, which doesn't have free space cache) $ ./btrfsck test.img Checking filesystem on ../test.img UUID: 9542051b-8150-42e5-a9e6-95829656485c checking extents checking free space cache btrfs: csum mismatch on free space cache <<< Found space cache problem failed to load free space cache for block group 136708096 checking fs roots checking csums checking root refs found 16941058 bytes used err is 0 total csum bytes: 16384 total tree bytes: 163840 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 139567 file data blocks allocated: 16908288 referenced 16908288 $ sudo mount -o clear_cache test.img /mnt/btrfs $ sudo umount /mnt/btrfs $ ./btrfsck test.img Checking filesystem on ../test.img UUID: 9542051b-8150-42e5-a9e6-95829656485c checking extents checking free space cache btrfs: csum mismatch on free space cache <<< Problem not fixed by kernel failed to load free space cache for block group 136708096 checking fs roots checking csums checking root refs found 16941058 bytes used err is 0 total csum bytes: 16384 total tree bytes: 163840 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 139567 file data blocks allocated: 16908288 referenced 16908288 == Btrfs-progs fix will follow soon. Reported-by: Ivan PSigned-off-by: Qu Wenruo --- btrfs-corrupt-block.c | 124 +- 1 file changed, 123 insertions(+), 1 deletion(-) diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c index d331f96..eb27265 100644 --- a/btrfs-corrupt-block.c +++ b/btrfs-corrupt-block.c @@ -115,6 +115,9 @@ static void print_usage(int ret) fprintf(stderr, "\t-C Delete a csum for the specified bytenr. When " "used with -b it'll delete that many bytes, otherwise it's " "just sectorsize\n"); + fprintf(stderr, "\t-s \n"); + fprintf(stderr, "\t Corrupt free space cache file(space_cache v1), must also specify -l for blockgroup bytenr\n"); + fprintf(stderr, "\tcan be 'zero_gen','rand_gen' or 'content'\n"); exit(ret); } @@ -1012,6 +1015,100 @@ out: return ret; } + +u64 rand_u64(void) +{ + u64 ret = 0; + int n; + + for (n = 0; n < sizeof(ret) / sizeof(RAND_MAX); n++) + ret += rand() << (n * sizeof(RAND_MAX) * 8); + return ret; +} + +#define CORRUPT_SC_FILE_ZERO_GEN 1 +#define CORRUPT_SC_FILE_RAND_GEN 2 +#define CORRUPT_SC_FILE_CONTENT3 +int corrupt_space_cache_file(struct btrfs_fs_info *fs_info, u64 block_group, +int method) +{ + struct btrfs_root *tree_root = fs_info->tree_root; + struct btrfs_path path; + struct btrfs_key key; + struct btrfs_disk_key location; + struct btrfs_free_space_header *sc_header; + struct btrfs_file_extent_item *fi; + struct extent_buffer *node; + struct extent_buffer *corrupted_node; + u64 disk_bytenr; + int slot; + int ret; + + key.objectid = BTRFS_FREE_SPACE_OBJECTID; + key.type = 0; + key.offset = block_group; + + btrfs_init_path(); + + /* Don't start trans, as this will cause generation different */ + ret = btrfs_search_slot(NULL, tree_root, , , 0, 0); + if (ret) { + error("failed to find free space cahce file for block group %llu, ret: %d\n", + block_group, ret); + goto out; + } + slot = path.slots[0]; + node = path.nodes[0]; + sc_header = btrfs_item_ptr(node, slot, struct btrfs_free_space_header); + if (method == CORRUPT_SC_FILE_ZERO_GEN || + method == CORRUPT_SC_FILE_RAND_GEN) { + u64 dst_gen; + + if (method == CORRUPT_SC_FILE_ZERO_GEN) + dst_gen = 0; + else + dst_gen = rand_u64(); + + btrfs_set_free_space_generation(node, sc_header, dst_gen); + /* Manually re-calc csum and write to disk */ + ret = write_tree_block(NULL, tree_root, node); + goto out; + } + + btrfs_free_space_key(node, sc_header, ); + btrfs_disk_key_to_cpu(, );
Re: problems with free space cache
Qu Wenruo posted on Mon, 16 May 2016 10:24:23 +0800 as excerpted: > IIRC clear_cache option is fs level option. > So the first mount with clear_cache, then all subvolume will have > clear_cache. Question: Does clear_cache work with a read-only mount? I could see it being like the log replay and being done even on ro mounts, particularly since unlike the log replay, clear_cache is a specific mount option so could be seen as specifically requested even if otherwise ro. Or not. So I don't know. He mentioned root, which is normally mounted read-only first. That's what got me thinking about it. Does that complicate things? If root is mounted ro without clear_cache by the initr*, will a subsequent remount,rw,clear_cache do it? And for those like me who keep root mounted ro by default, would clear_cache work at all, or would it not work until such time as I mounted root, or some other subvolume (which I don't have here to worry about, but for those who do...) writable? -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html