Re: [PATCH] fstests: btrfs/150 regression test for reading compressed data
On Wed, Sep 20, 2017 at 05:52:43PM -0600, Liu Bo wrote: > We had a bug in btrfs compression code which could end up with a > kernel panic. > > This is adding a regression test for the bug and I've also sent a > kernel patch to fix the bug. > > The patch is "Btrfs: fix kernel oops while reading compressed data". > > Signed-off-by: Liu Bo> --- > tests/btrfs/150 | 102 > > tests/btrfs/150.out | 3 ++ > tests/btrfs/group | 1 + > 3 files changed, 106 insertions(+) > create mode 100755 tests/btrfs/150 > create mode 100644 tests/btrfs/150.out > > diff --git a/tests/btrfs/150 b/tests/btrfs/150 > new file mode 100755 > index 000..834be51 > --- /dev/null > +++ b/tests/btrfs/150 > @@ -0,0 +1,102 @@ > +#! /bin/bash > +# FS QA Test btrfs/150 > +# > +# This is a regression test which ends up with a kernel oops in btrfs. group += dangerous > +# It occurs when btrfs's read repair happens while reading a compressed > +# extent. > +# The patch for this is > +# x Incomplete? > +# > +#--- > +# Copyright (c) 2017 Liu Bo. All Rights Reserved. You're signing off this patch an Oracle employee, but claiming personal copyright. Please clarify who owns the copyright - if it's your personal copyright then please sign off with a personal email address, not your employer's... Also, I note that these recently added tests from you: tests/btrfs/140:# Copyright (c) 2017 Liu Bo. All Rights Reserved. tests/btrfs/141:# Copyright (c) 2017 Liu Bo. All Rights Reserved. tests/btrfs/142:# Copyright (c) 2017 Liu Bo. All Rights Reserved. tests/btrfs/143:# Copyright (c) 2017 Liu Bo. All Rights Reserved. tests/generic/406:# Copyright (c) 2017 Liu Bo. All Rights Reserved. all have this same ambiguity - personal copyright with employer signoff in the commit. This definitely needs clarification and fixing if it is wrong > +disable_io_failure() > +{ > +echo 0 > $SYSFS_BDEV/make-it-fail > +echo 0 > $DEBUGFS_MNT/fail_make_request/probability > +echo 0 > $DEBUGFS_MNT/fail_make_request/times > +} > + > +_scratch_pool_mkfs "-d raid1 -b 1G" >> $seqres.full 2>&1 > + > +# It doesn't matter which compression algorithm we use. > +_scratch_mount -ocompress > + > +# Create a file with all data being compressed > +$XFS_IO_PROG -f -c "pwrite -W 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io needs an fsync to reach disk. > +# Raid1 consists of two copies and btrfs decides which copy to read by > reader's > +# %pid. Now we inject errors to copy #1 and copy #0 is good. We want to > read > +# the bad copy to trigger read-repair. > +while true; do > + disable_io_failure > + # invalidate the page cache > + $XFS_IO_PROG -f -c "fadvise -d 0 128K" $SCRATCH_MNT/foobar | > _filter_xfs_io > + > + enable_io_failure > + od -x $SCRATCH_MNT/foobar > /dev/null & why are you using od to read the data when the output is piped to dev/null? why not just xfs_io -c "pread 0 8k" ? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs: LOGICAL_INO enhancements (this time based on 4.14-rc1)
The previous patch series was based on v4.12.14, and this introductory text was missing. This patch series fixes some weaknesses in the btrfs LOGICAL_INO ioctl. Background: Suppose we have a file with one extent: root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a root@tester:~# sync Split the extent by overwriting it in the middle: root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 conv=notrunc of=/test/a We should now have 3 extent refs to 2 extents, with one block unreachable. The extent tree looks like: root@tester:~# btrfs-debug-tree /dev/vdc -t 2 [...] item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53 extent refs 2 gen 29 flags DATA extent data backref root 5 objectid 261 offset 0 count 2 [...] item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53 extent refs 1 gen 30 flags DATA extent data backref root 5 objectid 261 offset 8192 count 1 [...] and the ref tree looks like: root@tester:~# btrfs-debug-tree /dev/vdc -t 5 [...] item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53 extent data disk byte 1103101952 nr 73728 extent data offset 0 nr 8192 ram 73728 extent compression(none) item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53 extent data disk byte 1103175680 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression(none) item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53 extent data disk byte 1103101952 nr 73728 extent data offset 12288 nr 61440 ram 73728 extent compression(none) [...] There are two references to the same extent with different, non-overlapping byte offsets: [--72K extent at 1103101952--] [--8K|--4K unreachable|--60K-] ^ ^ | | [--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--] | v [-4K extent-] at 1103175680 We now want to find all of the references to extent bytenr 1103101952. Without the patch (and without running btrfs-debug-tree), we have to do it with 18 LOGICAL_INO calls: root@tester:~# btrfs ins log 1103101952 -P /test/ Using LOGICAL_INO inode 261 offset 0 root 5 root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 4096)) -P /test/; done 2>&1 | grep inode inode 261 offset 0 root 5 inode 261 offset 4096 root 5 <- same extent ref as offset 0 (offset 8192 returns empty set, not reachable) inode 261 offset 12288 root 5 inode 261 offset 16384 root 5 \ inode 261 offset 20480 root 5 | inode 261 offset 24576 root 5 | inode 261 offset 28672 root 5 | inode 261 offset 32768 root 5 | inode 261 offset 36864 root 5 \ inode 261 offset 40960 root 5 > all the same extent ref as offset 12288. inode 261 offset 45056 root 5 / More processing required in userspace inode 261 offset 49152 root 5 | to figure out these are all duplicates. inode 261 offset 53248 root 5 | inode 261 offset 57344 root 5 | inode 261 offset 61440 root 5 | inode 261 offset 65536 root 5 | inode 261 offset 69632 root 5 / In the worst case the extents are 128MB long, and we have to do 32768 iterations of the loop to find one 4K extent ref. With the patch, we just use one call to map all refs to the extent at once: root@tester:~# btrfs ins log 1103101952 -P /test/ Using LOGICAL_INO_V2 inode 261 offset 0 root 5 inode 261 offset 12288 root 5 The TREE_SEARCH ioctl allows userspace to retrieve the offset and extent bytenr fields easily once the root, inode and offset are known. This is sufficient information to build a complete map of the extent and all of its references. Userspace can use this information to make better choices to dedup or defrag. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs: increase output size for LOGICAL_INO_V2 ioctl
Build-server workloads have hundreds of references per file after dedup. Multiply by a few snapshots and we quickly exhaust the limit of 2730 references per extent that can fit into a 64K buffer. Raise the limit to 16M to be consistent with other btrfs ioctls (e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME). To minimize surprising userspace behavior, apply this change only to the LOGICAL_INO_V2 ioctl. Signed-off-by: Zygo Blaxell--- fs/btrfs/ioctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 2bc3a9588d1d..4be9b1791f58 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4554,6 +4554,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, if (version == 1) { ignore_offset = false; + size = min_t(u32, loi->size, SZ_64K); } else { /* Only accept flags we have defined so far */ if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) { @@ -4561,6 +4562,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, goto out_loi; } ignore_offset = loi->flags & BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET; + size = min_t(u32, loi->size, SZ_16M); } path = btrfs_alloc_path(); @@ -4569,7 +4571,6 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, goto out; } - size = min_t(u32, loi->size, SZ_64K); inodes = init_data_container(size); if (IS_ERR(inodes)) { ret = PTR_ERR(inodes); -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off, we need a way to do it from userspace. Add a 'flags' field to the btrfs_logical_ino_args structure to disable extent offset filtering, taking the place of one of the reserved[] fields. Previous versions of LOGICAL_INO neglected to check whether any of the reserved fields have non-zero values. Assigning meaning to those fields now may change the behavior of existing programs that left these fields uninitialized. To avoid any surprises, define a new ioctl LOGICAL_INO_V2 which uses the same argument layout as LOGICAL_INO, but uses one of the reserved fields for flags. The V2 ioctl explicitly checks that unsupported flag bits are zero so that userspace can probe for future feature bits as they are defined. If the other reserved fields are used in the future, one of the remaining flag bits could specify that the other reserved fields are valid, so we don't need to check those for now. Since the memory layouts and behavior of the two ioctls' arguments are almost identical, there is no need for a separate function for logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search). A version parameter and an 'if' statement will suffice. Now that we have a flags field in logical_ino_args, add a flag BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want, and pass it down the stack to iterate_inodes_from_logical. Signed-off-by: Zygo Blaxell--- fs/btrfs/ioctl.c | 21 ++--- include/uapi/linux/btrfs.h | 8 +++- 2 files changed, 25 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index b7de32568082..2bc3a9588d1d 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4536,13 +4536,14 @@ static int build_ino_list(u64 inum, u64 offset, u64 root, void *ctx) } static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, - void __user *arg) + void __user *arg, int version) { int ret = 0; int size; struct btrfs_ioctl_logical_ino_args *loi; struct btrfs_data_container *inodes = NULL; struct btrfs_path *path = NULL; + bool ignore_offset; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -4551,6 +4552,17 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, if (IS_ERR(loi)) return PTR_ERR(loi); + if (version == 1) { + ignore_offset = false; + } else { + /* Only accept flags we have defined so far */ + if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) { + ret = -EINVAL; + goto out_loi; + } + ignore_offset = loi->flags & BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET; + } + path = btrfs_alloc_path(); if (!path) { ret = -ENOMEM; @@ -4566,7 +4578,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, } ret = iterate_inodes_from_logical(loi->logical, fs_info, path, - build_ino_list, inodes, false); + build_ino_list, inodes, ignore_offset); if (ret == -EINVAL) ret = -ENOENT; if (ret < 0) @@ -4580,6 +4592,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, out: btrfs_free_path(path); kvfree(inodes); +out_loi: kfree(loi); return ret; @@ -5550,7 +5563,9 @@ long btrfs_ioctl(struct file *file, unsigned int case BTRFS_IOC_INO_PATHS: return btrfs_ioctl_ino_to_path(root, argp); case BTRFS_IOC_LOGICAL_INO: - return btrfs_ioctl_logical_to_ino(fs_info, argp); + return btrfs_ioctl_logical_to_ino(fs_info, argp, 1); + case BTRFS_IOC_LOGICAL_INO_V2: + return btrfs_ioctl_logical_to_ino(fs_info, argp, 2); case BTRFS_IOC_SPACE_INFO: return btrfs_ioctl_space_info(fs_info, argp); case BTRFS_IOC_SYNC: { diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 378230c163d5..0b3de597e04f 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -608,10 +608,14 @@ struct btrfs_ioctl_ino_path_args { struct btrfs_ioctl_logical_ino_args { __u64 logical;/* in */ __u64 size; /* in */ - __u64 reserved[4]; + __u64 flags; /* in, v2 only */ + __u64 reserved[3]; /* struct btrfs_data_container *inodes;out */ __u64 inodes; }; +/* Return every ref to the extent, not just those containing logical block. + * Requires logical ==
[PATCH 1/3] btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and offset (encoded as a single logical address) to a list of extent refs. LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping (extent ref -> extent bytenr and offset, or logical address). These are useful capabilities for programs that manipulate extents and extent references from userspace (e.g. dedup and defrag utilities). When the extents are uncompressed (and not encrypted and not other), check_extent_in_eb performs filtering of the extent refs to remove any extent refs which do not contain the same extent offset as the 'logical' parameter's extent offset. This prevents LOGICAL_INO from returning references to more than a single block. To find the set of extent references to an uncompressed extent from [a, b), userspace has to run a loop like this pseudocode: for (i = a; i < b; ++i) extent_ref_set += LOGICAL_INO(i); At each iteration of the loop (up to 32768 iterations for a 128M extent), data we are interested in is collected in the kernel, then deleted by the filter in check_extent_in_eb. When the extents are compressed (or encrypted or other), the 'logical' parameter must be an extent bytenr (the 'a' parameter in the loop). No filtering by extent offset is done (or possible?) so the result is the complete set of extent refs for the entire extent. This removes the need for the loop, since we get all the extent refs in one call. Add an 'ignore_offset' argument to iterate_inodes_from_logical, [...several levels of function call graph...], and check_extent_in_eb, so that we can disable the extent offset filtering for uncompressed extents. This flag can be set by an improved version of the LOGICAL_INO ioctl to get either behavior as desired. There is no functional change in this patch. The new flag is always false. Signed-off-by: Zygo Blaxell--- fs/btrfs/backref.c| 63 ++- fs/btrfs/backref.h| 8 +++--- fs/btrfs/inode.c | 2 +- fs/btrfs/ioctl.c | 2 +- fs/btrfs/qgroup.c | 8 +++--- fs/btrfs/scrub.c | 6 ++--- fs/btrfs/send.c | 2 +- fs/btrfs/tests/qgroup-tests.c | 20 +++--- 8 files changed, 63 insertions(+), 48 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index b517ef1477ea..a2609786cd86 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -40,12 +40,14 @@ static int check_extent_in_eb(const struct btrfs_key *key, const struct extent_buffer *eb, const struct btrfs_file_extent_item *fi, u64 extent_item_pos, - struct extent_inode_elem **eie) + struct extent_inode_elem **eie, + bool ignore_offset) { u64 offset = 0; struct extent_inode_elem *e; - if (!btrfs_file_extent_compression(eb, fi) && + if (!ignore_offset && + !btrfs_file_extent_compression(eb, fi) && !btrfs_file_extent_encryption(eb, fi) && !btrfs_file_extent_other_encoding(eb, fi)) { u64 data_offset; @@ -84,7 +86,8 @@ static void free_inode_elem_list(struct extent_inode_elem *eie) static int find_extent_in_eb(const struct extent_buffer *eb, u64 wanted_disk_byte, u64 extent_item_pos, -struct extent_inode_elem **eie) +struct extent_inode_elem **eie, +bool ignore_offset) { u64 disk_byte; struct btrfs_key key; @@ -113,7 +116,7 @@ static int find_extent_in_eb(const struct extent_buffer *eb, if (disk_byte != wanted_disk_byte) continue; - ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie); + ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie, ignore_offset); if (ret < 0) return ret; } @@ -419,7 +422,7 @@ static int add_indirect_ref(const struct btrfs_fs_info *fs_info, static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, struct ulist *parents, struct prelim_ref *ref, int level, u64 time_seq, const u64 *extent_item_pos, - u64 total_refs) + u64 total_refs, bool ignore_offset) { int ret = 0; int slot; @@ -472,7 +475,7 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, if (extent_item_pos) { ret = check_extent_in_eb(, eb, fi, *extent_item_pos, - ); + ,
Re: btrfs-progs: suggestion of removing --commit-after option of subvol delete
On 2017年09月21日 10:49, Misono, Tomohiro wrote: On 2017/09/20 23:21, Qu Wenruo wrote: On 2017年09月20日 22:03, David Sterba wrote: On Wed, Sep 20, 2017 at 08:22:54AM +0800, Qu Wenruo wrote: The costly part will be tracking the filesystems of subvolumes. We must do it for each subvolume and batch them to address the final transaction commit for each fs. I didn't see any easy ioctl to get the UUID from fd, meaning (if didn't miss anything) we need to iterate the path until reaching the mount boundary and then refer to mountinfo to find the fs. The FS_INFO ioctl will tell you the fsid from a fd. Iterating the paths will not work reliably, due to some potentially very creative use of mounts that could build up the path components. The biggest concern is about how to get fsid reliably. And since FS_INFO ioctl solves this, I'm completely fine. (I just search "UUID" for ioctls and get no hit, so I assume it may be a problem) Not to mention that the possible softlink in the path may make things more complex. Yes, this may be fixed with tons of code, but I don't think the complexity worthy for a minor feature. So what if we fix it like that: - iterate over subvolumes - check if the fsid is same as for the previous subvolume - if it is, continue - if not, do the sync I prefer to maintain a subvolume <-> fs mapping, and only address transaction commit after all subvolumes are iterated. IOW, sync when we leave subvolumes of one filesystem. In the degenerate case, we can have subvolumes interleaved that we'd sync after each, effectively --commit-each. Since we have a subvolume <-> fs mapping, we can commit each fs after iterating all subvolumes. AFAIK "--commit-after" user would like to introduce the minimal commits, so each fs only get committed once seems better to me. The simple tracking should avoid keeping all the filedescriptors open and I think this won't need tons of code to implement. Indeed, since there is reliable fs UUID ioctl. And above subvolume <-> fs mapping can also be used to resolve missing fd problems. Just adding a fds <-> fs mapping, and try to call commit ioctl on first good fd. The typical could be 'btrfs subvol delete -c path/*', ie. deleting from just one filesystem. If we happen to cross more filesystems, each of them will get the sync, but possibly more than one. I don't think this is too serious. Well, after we implemented above user case, some facility can be reused for this case. Thanks, Qu OK, I understood the points and will try to make a patch. btw, by saying "some facility can be reused", do you mean we can reuse fds <-> fs mapping function to other than delete? Yes, I think there may be some other use cases can benefit from such mapping. Thanks, Qu Thanks, Tomohiro -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs: suggestion of removing --commit-after option of subvol delete
On 2017/09/20 23:21, Qu Wenruo wrote: > > > On 2017年09月20日 22:03, David Sterba wrote: >> On Wed, Sep 20, 2017 at 08:22:54AM +0800, Qu Wenruo wrote: >>> The costly part will be tracking the filesystems of subvolumes. >>> We must do it for each subvolume and batch them to address the final >>> transaction commit for each fs. >>> >>> I didn't see any easy ioctl to get the UUID from fd, meaning (if didn't >>> miss anything) we need to iterate the path until reaching the mount >>> boundary and then refer to mountinfo to find the fs. >> >> The FS_INFO ioctl will tell you the fsid from a fd. Iterating the paths >> will not work reliably, due to some potentially very creative use of >> mounts that could build up the path components. > > The biggest concern is about how to get fsid reliably. > And since FS_INFO ioctl solves this, I'm completely fine. > (I just search "UUID" for ioctls and get no hit, so I assume it may be a > problem) > >> >>> Not to mention that the possible softlink in the path may make things >>> more complex. >>> >>> Yes, this may be fixed with tons of code, but I don't think the >>> complexity worthy for a minor feature. >> >> So what if we fix it like that: >> >> - iterate over subvolumes >> - check if the fsid is same as for the previous subvolume >>- if it is, continue >>- if not, do the sync > > I prefer to maintain a subvolume <-> fs mapping, and only address > transaction commit after all subvolumes are iterated. > >> >> IOW, sync when we leave subvolumes of one filesystem. In the degenerate >> case, we can have subvolumes interleaved that we'd sync after each, >> effectively --commit-each. > > Since we have a subvolume <-> fs mapping, we can commit each fs after > iterating all subvolumes. > > AFAIK "--commit-after" user would like to introduce the minimal commits, > so each fs only get committed once seems better to me. > >> >> The simple tracking should avoid keeping all the filedescriptors open >> and I think this won't need tons of code to implement. > > Indeed, since there is reliable fs UUID ioctl. > > And above subvolume <-> fs mapping can also be used to resolve missing > fd problems. > Just adding a fds <-> fs mapping, and try to call commit ioctl on first > good fd. > >> >> The typical could be 'btrfs subvol delete -c path/*', ie. deleting from >> just one filesystem. If we happen to cross more filesystems, each of >> them will get the sync, but possibly more than one. I don't think this >> is too serious. > > Well, after we implemented above user case, some facility can be reused > for this case. > > Thanks, > Qu OK, I understood the points and will try to make a patch. btw, by saying "some facility can be reused", do you mean we can reuse fds <-> fs mapping function to other than delete? Thanks, Tomohiro -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fstests: btrfs/150 regression test for reading compressed data
We had a bug in btrfs compression code which could end up with a kernel panic. This is adding a regression test for the bug and I've also sent a kernel patch to fix the bug. The patch is "Btrfs: fix kernel oops while reading compressed data". Signed-off-by: Liu Bo--- tests/btrfs/150 | 102 tests/btrfs/150.out | 3 ++ tests/btrfs/group | 1 + 3 files changed, 106 insertions(+) create mode 100755 tests/btrfs/150 create mode 100644 tests/btrfs/150.out diff --git a/tests/btrfs/150 b/tests/btrfs/150 new file mode 100755 index 000..834be51 --- /dev/null +++ b/tests/btrfs/150 @@ -0,0 +1,102 @@ +#! /bin/bash +# FS QA Test btrfs/150 +# +# This is a regression test which ends up with a kernel oops in btrfs. +# It occurs when btrfs's read repair happens while reading a compressed +# extent. +# The patch for this is +# x +# +#--- +# Copyright (c) 2017 Liu Bo. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_fail_make_request +_require_scratch_dev_pool 2 + +SYSFS_BDEV=`_sysfs_dev $SCRATCH_DEV` +enable_io_failure() +{ +echo 100 > $DEBUGFS_MNT/fail_make_request/probability +echo 1000 > $DEBUGFS_MNT/fail_make_request/times +echo 0 > $DEBUGFS_MNT/fail_make_request/verbose +echo 1 > $SYSFS_BDEV/make-it-fail +} + +disable_io_failure() +{ +echo 0 > $SYSFS_BDEV/make-it-fail +echo 0 > $DEBUGFS_MNT/fail_make_request/probability +echo 0 > $DEBUGFS_MNT/fail_make_request/times +} + +_scratch_pool_mkfs "-d raid1 -b 1G" >> $seqres.full 2>&1 + +# It doesn't matter which compression algorithm we use. +_scratch_mount -ocompress + +# Create a file with all data being compressed +$XFS_IO_PROG -f -c "pwrite -W 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io + +# Raid1 consists of two copies and btrfs decides which copy to read by reader's +# %pid. Now we inject errors to copy #1 and copy #0 is good. We want to read +# the bad copy to trigger read-repair. +while true; do + disable_io_failure + # invalidate the page cache + $XFS_IO_PROG -f -c "fadvise -d 0 128K" $SCRATCH_MNT/foobar | _filter_xfs_io + + enable_io_failure + od -x $SCRATCH_MNT/foobar > /dev/null & + pid=$! + wait + [ $((pid % 2)) == 1 ] && break +done + +disable_io_failure + +# success, all done +status=0 +exit diff --git a/tests/btrfs/150.out b/tests/btrfs/150.out new file mode 100644 index 000..c492c24 --- /dev/null +++ b/tests/btrfs/150.out @@ -0,0 +1,3 @@ +QA output created by 150 +wrote 8192/8192 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) diff --git a/tests/btrfs/group b/tests/btrfs/group index 70c3f05..b70a122 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -152,3 +152,4 @@ 147 auto quick send 148 auto quick rw 149 auto quick send compress +150 auto quick -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] Btrfs: skip checksum when reading compressed data if some IO have failed
Currently even if the underlying disk reports failure on IO, compressed read endio still gets to verify checksum and reports it as a checksum error. In fact, if some IO have failed during reading a compressed data extent , there's no way the checksum could match, therefore, we can skip that in order to return error quickly to the upper layer. Please note that we need to do this after recording the failed mirror index so that read-repair in the upper layer's endio can work properly. Signed-off-by: Liu Bo--- fs/btrfs/compression.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 9a4ccdf..d8d4678 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -108,7 +108,7 @@ static void end_compressed_bio_read(struct bio *bio) struct page *page; unsigned long index; unsigned int mirror = btrfs_io_bio(bio)->mirror_num; - int ret; + int ret = 0; if (bio->bi_status) cb->errors = 1; @@ -127,6 +127,13 @@ static void end_compressed_bio_read(struct bio *bio) btrfs_io_bio(cb->orig_bio)->mirror_num = mirror; cb->mirror_num = mirror; + /* +* Some IO in this cb have failed, just skip checksum as there +* is no way it could be correct. +*/ + if (cb->errors == 1) + goto csum_failed; + inode = cb->inode; ret = check_compressed_csum(BTRFS_I(inode), cb, (u64)bio->bi_iter.bi_sector << 9); -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Btrfs: fix kernel oops while reading compressed data
The kernel oops happens at kernel BUG at fs/btrfs/extent_io.c:2104! ... RIP: clean_io_failure+0x263/0x2a0 [btrfs] It's showing that read-repair code is using an improper mirror index. This is due to the fact that compression read's endio hasn't recorded the failed mirror index in %cb->orig_bio. With this, btrfs's read-repair can work properly on reading compressed data. Signed-off-by: Liu BoReported-by: Paul Jones --- fs/btrfs/compression.c | 9 + 1 file changed, 9 insertions(+) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index d2ef9ac..9a4ccdf 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -107,6 +107,7 @@ static void end_compressed_bio_read(struct bio *bio) struct inode *inode; struct page *page; unsigned long index; + unsigned int mirror = btrfs_io_bio(bio)->mirror_num; int ret; if (bio->bi_status) @@ -118,6 +119,14 @@ static void end_compressed_bio_read(struct bio *bio) if (!refcount_dec_and_test(>pending_bios)) goto out; + /* +* Record the correct mirror_num in cb->orig_bio so that +* read-repair can work properly. +*/ + ASSERT(btrfs_io_bio(cb->orig_bio)); + btrfs_io_bio(cb->orig_bio)->mirror_num = mirror; + cb->mirror_num = mirror; + inode = cb->inode; ret = check_compressed_csum(BTRFS_I(inode), cb, (u64)bio->bi_iter.bi_sector << 9); -- 2.9.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents
The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and offset (encoded as a single logical address) to a list of extent refs. LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping (extent ref -> extent bytenr and offset, or logical address). These are useful capabilities for programs that manipulate extents and extent references from userspace (e.g. dedup and defrag utilities). When the extents are uncompressed (and not encrypted and not other), check_extent_in_eb performs filtering of the extent refs to remove any extent refs which do not contain the same extent offset as the 'logical' parameter's extent offset. This prevents LOGICAL_INO from returning references to more than a single block. To find the set of extent references to an uncompressed extent from [a, b), userspace has to run a loop like this pseudocode: for (i = a; i < b; ++i) extent_ref_set += LOGICAL_INO(i); At each iteration of the loop (up to 32768 iterations for a 128M extent), data we are interested in is collected in the kernel, then deleted by the filter in check_extent_in_eb. When the extents are compressed (or encrypted or other), the 'logical' parameter must be an extent bytenr (the 'a' parameter in the loop). No filtering by extent offset is done (or possible?) so the result is the complete set of extent refs for the entire extent. This removes the need for the loop, since we get all the extent refs in one call. Add an 'ignore_offset' argument to iterate_inodes_from_logical, [...several levels of function call graph...], and check_extent_in_eb, so that we can disable the extent offset filtering for uncompressed extents. This flag can be set by an improved version of the LOGICAL_INO ioctl to get either behavior as desired. There is no functional change in this patch. The new flag is always false. Signed-off-by: Zygo Blaxell--- fs/btrfs/backref.c | 62 -- fs/btrfs/backref.h | 8 --- fs/btrfs/inode.c | 2 +- fs/btrfs/ioctl.c | 2 +- fs/btrfs/qgroup.c | 8 +++ fs/btrfs/scrub.c | 6 +++--- fs/btrfs/send.c| 2 +- 7 files changed, 52 insertions(+), 38 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 1d71a5a4b1b9..3bffd36c6897 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -302,12 +302,14 @@ static int ref_tree_add(struct ref_root *ref_tree, u64 root_id, u64 object_id, static int check_extent_in_eb(struct btrfs_key *key, struct extent_buffer *eb, struct btrfs_file_extent_item *fi, u64 extent_item_pos, - struct extent_inode_elem **eie) + struct extent_inode_elem **eie, + bool ignore_offset) { u64 offset = 0; struct extent_inode_elem *e; - if (!btrfs_file_extent_compression(eb, fi) && + if (!ignore_offset && + !btrfs_file_extent_compression(eb, fi) && !btrfs_file_extent_encryption(eb, fi) && !btrfs_file_extent_other_encoding(eb, fi)) { u64 data_offset; @@ -346,7 +348,8 @@ static void free_inode_elem_list(struct extent_inode_elem *eie) static int find_extent_in_eb(struct extent_buffer *eb, u64 wanted_disk_byte, u64 extent_item_pos, - struct extent_inode_elem **eie) + struct extent_inode_elem **eie, + bool ignore_offset) { u64 disk_byte; struct btrfs_key key; @@ -375,7 +378,7 @@ static int find_extent_in_eb(struct extent_buffer *eb, u64 wanted_disk_byte, if (disk_byte != wanted_disk_byte) continue; - ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie); + ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie, ignore_offset); if (ret < 0) return ret; } @@ -511,7 +514,7 @@ static int __add_prelim_ref(struct list_head *head, u64 root_id, static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, struct ulist *parents, struct __prelim_ref *ref, int level, u64 time_seq, const u64 *extent_item_pos, - u64 total_refs) + u64 total_refs, bool ignore_offset) { int ret = 0; int slot; @@ -564,7 +567,7 @@ static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path, if (extent_item_pos) { ret = check_extent_in_eb(, eb, fi, *extent_item_pos, - ); + , ignore_offset);
[PATCH 2/3] btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2
Now that check_extent_in_eb()'s extent offset filter can be turned off, we need a way to do it from userspace. Add a 'flags' field to the btrfs_logical_ino_args structure to disable extent offset filtering, taking the place of one of the reserved[] fields. Previous versions of LOGICAL_INO neglected to check whether any of the reserved fields have non-zero values. Assigning meaning to those fields now may change the behavior of existing programs that left these fields uninitialized. To avoid any surprises, define a new ioctl LOGICAL_INO_V2 which uses the same argument layout as LOGICAL_INO, but uses one of the reserved fields for flags. The V2 ioctl explicitly checks that unsupported flag bits are zero so that userspace can probe for future feature bits as they are defined. If the other reserved fields are used in the future, one of the remaining flag bits could specify that the other reserved fields are valid, so we don't need to check those for now. Since the memory layouts and behavior of the two ioctls' arguments are almost identical, there is no need for a separate function for logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search). A version parameter and an 'if' statement will suffice. Now that we have a flags field in logical_ino_args, add a flag BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want, and pass it down the stack to iterate_inodes_from_logical. Signed-off-by: Zygo Blaxell--- fs/btrfs/ioctl.c | 21 ++--- include/uapi/linux/btrfs.h | 8 +++- 2 files changed, 25 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index c6787660d91f..def0ab85134a 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4542,13 +4542,14 @@ static int build_ino_list(u64 inum, u64 offset, u64 root, void *ctx) } static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, - void __user *arg) + void __user *arg, int version) { int ret = 0; int size; struct btrfs_ioctl_logical_ino_args *loi; struct btrfs_data_container *inodes = NULL; struct btrfs_path *path = NULL; + bool ignore_offset; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -4557,6 +4558,17 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, if (IS_ERR(loi)) return PTR_ERR(loi); + if (version == 1) { + ignore_offset = false; + } else { + /* Only accept flags we have defined so far */ + if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) { + ret = -EINVAL; + goto out_loi; + } + ignore_offset = loi->flags & BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET; + } + path = btrfs_alloc_path(); if (!path) { ret = -ENOMEM; @@ -4572,7 +4584,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, } ret = iterate_inodes_from_logical(loi->logical, fs_info, path, - build_ino_list, inodes, false); + build_ino_list, inodes, ignore_offset); if (ret == -EINVAL) ret = -ENOENT; if (ret < 0) @@ -4586,6 +4598,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, out: btrfs_free_path(path); vfree(inodes); +out_loi: kfree(loi); return ret; @@ -5559,7 +5572,9 @@ long btrfs_ioctl(struct file *file, unsigned int case BTRFS_IOC_INO_PATHS: return btrfs_ioctl_ino_to_path(root, argp); case BTRFS_IOC_LOGICAL_INO: - return btrfs_ioctl_logical_to_ino(fs_info, argp); + return btrfs_ioctl_logical_to_ino(fs_info, argp, 1); + case BTRFS_IOC_LOGICAL_INO_V2: + return btrfs_ioctl_logical_to_ino(fs_info, argp, 2); case BTRFS_IOC_SPACE_INFO: return btrfs_ioctl_space_info(fs_info, argp); case BTRFS_IOC_SYNC: { diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index a456e5309238..a23555026994 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -591,10 +591,14 @@ struct btrfs_ioctl_ino_path_args { struct btrfs_ioctl_logical_ino_args { __u64 logical;/* in */ __u64 size; /* in */ - __u64 reserved[4]; + __u64 flags; /* in, v2 only */ + __u64 reserved[3]; /* struct btrfs_data_container *inodes;out */ __u64 inodes; }; +/* Return every ref to the extent, not just those containing logical block. + * Requires logical ==
[PATCH 3/3] btrfs: increase output size for LOGICAL_INO_V2 ioctl
Build-server workloads have hundreds of references per file after dedup. Multiply by a few snapshots and we quickly exhaust the limit of 2730 references per extent that can fit into a 64K buffer. Raise the limit to 16M to be consistent with other btrfs ioctls (e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME). To minimize surprising userspace behavior, apply this change only to the LOGICAL_INO_V2 ioctl. Signed-off-by: Zygo Blaxell--- fs/btrfs/ioctl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index def0ab85134a..e13fea25ecb8 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -4560,6 +4560,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, if (version == 1) { ignore_offset = false; + size = min_t(u32, loi->size, SZ_64K); } else { /* Only accept flags we have defined so far */ if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) { @@ -4567,6 +4568,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, goto out_loi; } ignore_offset = loi->flags & BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET; + size = min_t(u32, loi->size, SZ_16M); } path = btrfs_alloc_path(); @@ -4575,7 +4577,6 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, goto out; } - size = min_t(u32, loi->size, SZ_64K); inodes = init_data_container(size); if (IS_ERR(inodes)) { ret = PTR_ERR(inodes); -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ERROR: parent determination failed (btrfs send-receive)
On Tue, Sep 19, 2017 at 3:37 PM, Andrei Borzenkovwrote: > 18.09.2017 09:10, Dave пишет: >> I use snap-sync to create and send snapshots. >> >> GitHub - wesbarnett/snap-sync: Use snapper snapshots to backup to external >> drive >> https://github.com/wesbarnett/snap-sync >> > > Are you trying to backup top-level subvolume? No. I never mount the top-level subvolume. > I just reproduced this > behavior with this tool. The problem is, snapshots > of top-level > subvolume do not have parent UUID (I am not even sure if UUID exists at > all TBH). If you mount any other subvolume, it will work. Well, that's not exactly my issue but it still helps me gain additional understanding. It may help me reproduce my original problem too. Thank you. > > > As I told you in the first reply, showing output of "btrfs su li -qu > /path/to/src" would explain your problem much earlier. > > Actually if snap-sync used "btrfs send -p" instead of "btrfs send -c" it > would work as well, as then no parent search would be needed (and as I > mentioned in another mail both commands are functionally equivalent). > But this becomes really off-topic on this list. As already suggested, > open issue for snap-sync. Thank you for this also. I will simply edit my current copy of the snap-sync script to use "-p" now. That's a simple change I am implement immediately. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Btrfs does not store checksum type? Only size? O_o
On 2017年09月21日 04:31, Timofey Titovets wrote: I was reading the kernel code and docs and didn't find anything about some info like "cheksum type" as btrfs have for compression code. (I only found some silent "hack" in btrfs-progs, that print "checksum type", but get "checksum size") Checksum type is global, so one btrfs fs uses the same checksum type for all its meta and data. So checksum type is in superblock. Check btrfs_super_block->csum_type. If i understood all correctly btrfs only store checksum size, i.e. 4 for current CRC32C. Yes. May be that can be useful to rewrite some part of code and reuse checksum size, as checksum type. For backward compatibility current CRC32C size can be used for identify CRC32C. Since we have superblock csum type, why introducing new one? (Unless you're going to allow multi csum algorithm to co-exist in one btrfs fs) Thanks, Qu As that will not add any new checksum, that allow make that with zero "blood". And later if we add new checksum type, that possible to add compat flag for that. What you think? Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]
On 2017年09月21日 02:10, Rich Rauenzahn wrote: On 9/20/2017 9:58 AM, Rich Rauenzahn wrote: What's the most direct way to do that? (Was about to risk breaking the mirror and repartitioning! I'd rather not!) Hmm -- maybe this worked: $ sudo btrfs filesystem resize -1m /.MEDIA/ Resize '/.MEDIA/' of '-1m' No, doesn't seem to have worked. I used btrfs fi show --raw /.MEDIA to see the size in bytes, and then ... sudo btrfs filesystem resize N:SIZE /.MEDIA/ Yes, this is what I did to my fs. While in fact you can do it better by using: btrfs fi resize N:max Or btrfs fi resize N:-1 The first one will resize the devid N to its max size. The 2nd one will resize the devid N to current size - 1 (byte). Both will go through shrink routine, which will do the round_down. Thanks, Qu Where N was the disk number in the show output, and SIZE was the number of bytes to set it to that was a multiple of 4k. Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does btrfs use crc32 for error correction?
2017-09-19 19:51 GMT+03:00 Eric Sandeen: > On 9/19/17 10:35 AM, Timofey Titovets wrote: >> Stupid question: >> Does btrfs use crc32 for error correction? >> If no, why? >> >> (AFAIK if using CRC that possible to fix 1 bit flip) >> >> P.S. I try check that (i create image, create text file, flip bit, try >> read and btrfs show IO-error) >> >> Thanks! > > I wasn't aware that crc32 could (in general) be used for single bit > correction; I've read up on that, and it seems pretty cool. > > However, I don't think that the generator polynomial used in crc32c /can/ be > used for error correction. I just skimmed some reading, but that seems to be > the case if I understand it correctly. > > -Eric That possible, but if i understood doc about CRC16 bit correction correctly, that need some tricks (As example preprocess all possible bit flips and have a map of that) for fast find bit error. Also in theory for CRC32C that possible with some state machine, to move algo backward and forward over data. I also check [1], so CRC32C (for our case) have hamming distance about 4. So CRC32 in theory can fix up to HD-1 ~= 3 bit (If i understand all things correctly); So that possible to just brute force over data and flip 1-3 bit, and try check if CRC fixed %) In theory that safe for 1-2 bit error (Because again HD for input data) (Proof of concept[2]). Thanks. 1. https://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koopman.pdf 2. https://github.com/Nefelim4ag/CRC32C_AND_8Byte_parity/commit/cb0fe76eac2ff036468bbc5ea75ebc06cb1e9928 -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD caching an existing btrfs raid1
Am Wed, 20 Sep 2017 17:51:15 +0200 schrieb Psalle: > On 19/09/17 17:47, Austin S. Hemmelgarn wrote: > (...) > > > > A better option if you can afford to remove a single device from > > that array temporarily is to use bcache. Bcache has one specific > > advantage in this case, multiple backend devices can share the same > > cache device. This means you don't have to carve out dedicated > > cache space for each disk on the SSD and leave some unused space so > > that you can add new devices if needed. The downside is that you > > can't convert each device in-place, but because you're using BTRFS, > > you can still convert the volume as a whole in-place. The > > procedure for doing so looks like this: > > > > 1. Format the SSD as a bcache cache. > > 2. Use `btrfs device delete` to remove a single hard drive from the > > array. > > 3. Set up the drive you just removed as a bcache backing device > > bound to the cache you created in step 1. > > 4. Add the new bcache device to the array. > > 5. Repeat from step 2 until the whole array is converted. > > > > A similar procedure can actually be used to do almost any > > underlying storage conversion (for example, switching to whole disk > > encryption, or adding LVM underneath BTRFS) provided all your data > > can fit on one less disk than you have. > > Thanks Austin, that's just great. For some reason I had discarded > bcache thinking that it would force me to rebuild from scratch, but > this kind of incremental migration is exactly why I hoped was > possible. I have plenty of space to replace the devices one by one. > > I will report back my experience in a few days, I hope. I've done it exactly that way in the past and it worked flawlessly (but it took 24+ hours). But it was easy for me because I was also adding a third disk to the pool, so existing stuff could easily move. I suggest to initialize bcache to writearound mode while converting, so your maybe terabytes of disk don't go through the SSD. If you later decide to remove bcache or not sure about future bcache usage, you can wrap any partition into a bcache container - just don't connect it to a cache and it will work like a normal partition. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Btrfs does not store checksum type? Only size? O_o
On 09/20/2017 10:31 PM, Timofey Titovets wrote: > I was reading the kernel code and docs and didn't find anything about > some info like "cheksum type" as btrfs have for compression code. > (I only found some silent "hack" in btrfs-progs, that print "checksum > type", but get "checksum size") > > If i understood all correctly btrfs only store checksum size, i.e. 4 > for current CRC32C. The checksum type is set in the superblock: csum_type 0 (crc32c) csum_size 4 [...] So in the current design, you probably could have different checksum types, but you will have to choose one of them when creating a filesystem. -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: Btrfs does not store checksum type? Only size? O_o
I was reading the kernel code and docs and didn't find anything about some info like "cheksum type" as btrfs have for compression code. (I only found some silent "hack" in btrfs-progs, that print "checksum type", but get "checksum size") If i understood all correctly btrfs only store checksum size, i.e. 4 for current CRC32C. May be that can be useful to rewrite some part of code and reuse checksum size, as checksum type. For backward compatibility current CRC32C size can be used for identify CRC32C. As that will not add any new checksum, that allow make that with zero "blood". And later if we add new checksum type, that possible to add compat flag for that. What you think? Thanks. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at fs/btrfs/extent_io.c:1989
On Wed, Sep 20, 2017 at 02:53:57PM +0200, David Sterba wrote: > On Tue, Sep 19, 2017 at 10:12:39AM -0600, Liu Bo wrote: > > On Tue, Sep 19, 2017 at 05:07:25PM +0200, David Sterba wrote: > > > On Tue, Sep 19, 2017 at 11:32:46AM +, Paul Jones wrote: > > > > > This 'mirror 0' looks fishy, (as mirror comes from > > > > > btrfs_io_bio->mirror_num, > > > > > which should be at least 1 if raid1 setup is in use.) > > > > > > > > > > Not sure if 4.13.2-gentoo made any changes on btrfs, but can you > > > > > please > > > > > verify with the upstream kernel, say, v4.13? > > > > > > > > It's basically a vanilla kernel with a handful of unrelated patches. > > > > The filesystem fell apart overnight, there were a few thousand > > > > checksum errors and eventually it went read-only. I tried to remount > > > > it, but got open_ctree failed. Btrfs check segfaulted, lowmem mode > > > > completed with so many errors I gave up and will restore from the > > > > backup. > > > > > > > > I think I know the problem now - the lvm cache was in writeback mode > > > > (by accident) so during a defrag there would be gigabytes of unwritten > > > > data in memory only, which was all lost when the system crashed > > > > (motherboard failure). No wonder the filesystem didn't quite survive. > > > > > > Yeah, the caching layer was my first suspicion, and lack of propagating > > > of the barriers. Good that you were able to confirm that as the root > > > cause. > > > > > > > I must say though, I'm seriously impressed at the data integrity of > > > > BTRFS - there were near 10,000 checksum errors, 4 which were > > > > uncorrectable, and from what I could tell nearly all of the data was > > > > still intact according to rsync checksums. > > > > > > Yay! > > > > But still don't get why mirror_num is 0, do you have an idea on how > > does writeback cache make that? > > My first idea was that the cached blocks were zeroed, so we'd see the ino > and mirror as 0. But this is not correct as the blocks would not pass > the checksum tests, so the blocks must be from some previous generation. > Ie. the transid verify failure. And all the error reports appear after > that so I'm slightly suspicious about the way it's actually reported. > > btrfs_print_data_csum_error takes mirror from either io_bio or > compressed_bio structures, so there might be a case when the structures > are initialized. If the transid check is ok, then the structures are > updated. If the check fails we'd see the initial mirror number. All of > that is just a hypothesis, I haven't checked with the code. > Thanks a lot for the input, you're right, mirror_num 0 should come from compressed read where it doesn't record the bbio->mirror_num but the mirror passing from the upper layer, and it's not metadata as we don't yet compress metadata, so this all makes sense. I think it also disables the ability of read-repair from raid1 for compressed data, and that's what caused the bug where it hits BUG_ON(mirror_num == 0) in cleanup_io_failure(). The good news is that I can reproduce it, will send a patch and a testcase. > I don't have a theoretical explanation for the ino 0. The inode pointer > that goes to btrfs_print_data_csum_error should be from a properly > initialized inode and we print the number using btrfs_ino. That will use > the vfs i_ino value and we should never get 0 out of that. ino 0 comes from metadata read-repair, some cleanup may be needed to make it less confusing. thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: difference between -c and -p for send-receive?
20.09.2017 22:05, Antoine Belvire пишет: > Hello, > >> All snapshots listed in -c options and snapshot that we want to >> transfer must have the same parent uuid, unless -p is explicitly >> provided. > > It's rather the same mount point than the same parent uuid, like cp > --reflink, isn't it? Sorry, I do not understand this sentence. Could you rephrase? > > ~# btrfs subvolume create /test2/ > Create subvolume '//test2' > ~# btrfs subvolume create /test2/foo > Create subvolume '/test2/foo' > ~# cd /test2 > ~# btrfs subvolume snapshot -r . .1 > Create a readonly snapshot of '.' in './.1' > ~# > ~# # a: 40 MiB in /test2/ > ~# dd if=/dev/urandom of=a bs=4k count=10k > 10240+0 records in > 10240+0 records out > 41943040 bytes (42 MB, 40 MiB) copied, 0.198961 s, 211 MB/s > ~# > ~# # b: 80 MiB in /test2/foo > ~# dd if=/dev/urandom of=foo/b bs=4k count=20k > 0480+0 records in > 20480+0 records out > 83886080 bytes (84 MB, 80 MiB) copied, 0.393823 s, 213 MB/s > ~# > ~# # copy-clone /test2/foo/b to /test2/b > ~# cp --reflink foo/b . > ~# > ~# btrfs subvolume -s . .2 > Create a readonly snapshot of '.' in './.2' > ~# > ~# # Sending .2 with only .1 as parent (.1 already sent) > ~# btrfs send -p .1 .2 | wc -c > At subvol .2 > 125909258 # 120 Mio = 'a' + 'b' > ~# > ~# # Sending .2 with .1 and foo as clone sources (.1 and foo already > ~# # sent), .1 is automatically picked as parent > ~# btrfs property set foo ro true > ~# btrfs send -c .1 -c foo .2 | wc -c > At subvol .2 > 41970349 # 40 Mio, only 'a' > ~# > > UUIDs on the sending side: > > ~# btrfs subvolume list -uq / | grep test2 > ID 6141 gen 454658 top level 6049 parent_uuid - uuid > bbf936dd-ca84-f749-9b9b-09f7081879a2 path test2 > ID 6142 gen 454658 top level 6141 parent_uuid - uuid > 54a7cdea-6198-424a-9349-8116172d0c17 path test2/foo Yes, sorry, I misread the code. We need at least one snapshot from -c option that has the same parent uuid (i.e. is snapshot of the same subvolume) as the snapshot we want to transfer, not all of them. In your case btrfs send -c .1 -c foo .2 it will select .1 as base snapshot and additionally try to clone from foo if possible. In wiki example the only two snapshots are from completely different subvolumes which will fail. In discussion that lead to this mail snapshots probably did not have any parent uuid at all. > ID 6143 gen 454655 top level 6141 parent_uuid > bbf936dd-ca84-f749-9b9b-09f7081879a2 uuid > 28f1d7db-7341-f545-a2ac-d8819d22a5b5 path test2/.1 > ID 6144 gen 454658 top level 6141 parent_uuid > bbf936dd-ca84-f749-9b9b-09f7081879a2 uuid > db9ad2b1-aee1-544c-b368-d698b4a05119 path test2/.2 > ~# > > On the receiving side, .1 is used as parent: > > ~# btrfs subvolume list -uq /var/run/media/antoine/backups/ | grep dest > ID 298 gen 443 top level 5 parent_uuid - uuid > 7695cba7-dfbf-2f44-bd79-18c9820fdb2f path dest/.1 > ID 299 gen 443 top level 5 parent_uuid - uuid > c32c06ec-0a17-cf42-9b04-3804ad72f836 path dest/foo > ID 300 gen 446 top level 5 parent_uuid > 7695cba7-dfbf-2f44-bd79-18c9820fdb2f uuid > 552b1d51-38bf-d546-a47f-4bc667ec4128 path dest/.2 > ~# > > Regards, > > -- > Antoine -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: difference between -c and -p for send-receive?
Hello, > All snapshots listed in -c options and snapshot that we want to > transfer must have the same parent uuid, unless -p is explicitly > provided. It's rather the same mount point than the same parent uuid, like cp --reflink, isn't it? ~# btrfs subvolume create /test2/ Create subvolume '//test2' ~# btrfs subvolume create /test2/foo Create subvolume '/test2/foo' ~# cd /test2 ~# btrfs subvolume snapshot -r . .1 Create a readonly snapshot of '.' in './.1' ~# ~# # a: 40 MiB in /test2/ ~# dd if=/dev/urandom of=a bs=4k count=10k 10240+0 records in 10240+0 records out 41943040 bytes (42 MB, 40 MiB) copied, 0.198961 s, 211 MB/s ~# ~# # b: 80 MiB in /test2/foo ~# dd if=/dev/urandom of=foo/b bs=4k count=20k 0480+0 records in 20480+0 records out 83886080 bytes (84 MB, 80 MiB) copied, 0.393823 s, 213 MB/s ~# ~# # copy-clone /test2/foo/b to /test2/b ~# cp --reflink foo/b . ~# ~# btrfs subvolume -s . .2 Create a readonly snapshot of '.' in './.2' ~# ~# # Sending .2 with only .1 as parent (.1 already sent) ~# btrfs send -p .1 .2 | wc -c At subvol .2 125909258 # 120 Mio = 'a' + 'b' ~# ~# # Sending .2 with .1 and foo as clone sources (.1 and foo already ~# # sent), .1 is automatically picked as parent ~# btrfs property set foo ro true ~# btrfs send -c .1 -c foo .2 | wc -c At subvol .2 41970349 # 40 Mio, only 'a' ~# UUIDs on the sending side: ~# btrfs subvolume list -uq / | grep test2 ID 6141 gen 454658 top level 6049 parent_uuid - uuid bbf936dd-ca84-f749-9b9b-09f7081879a2 path test2 ID 6142 gen 454658 top level 6141 parent_uuid - uuid 54a7cdea-6198-424a-9349-8116172d0c17 path test2/foo ID 6143 gen 454655 top level 6141 parent_uuid bbf936dd-ca84-f749-9b9b-09f7081879a2 uuid 28f1d7db-7341-f545-a2ac-d8819d22a5b5 path test2/.1 ID 6144 gen 454658 top level 6141 parent_uuid bbf936dd-ca84-f749-9b9b-09f7081879a2 uuid db9ad2b1-aee1-544c-b368-d698b4a05119 path test2/.2 ~# On the receiving side, .1 is used as parent: ~# btrfs subvolume list -uq /var/run/media/antoine/backups/ | grep dest ID 298 gen 443 top level 5 parent_uuid - uuid 7695cba7-dfbf-2f44-bd79-18c9820fdb2f path dest/.1 ID 299 gen 443 top level 5 parent_uuid - uuid c32c06ec-0a17-cf42-9b04-3804ad72f836 path dest/foo ID 300 gen 446 top level 5 parent_uuid 7695cba7-dfbf-2f44-bd79-18c9820fdb2f uuid 552b1d51-38bf-d546-a47f-4bc667ec4128 path dest/.2 ~# Regards, -- Antoine -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]
On 9/20/2017 9:58 AM, Rich Rauenzahn wrote: What's the most direct way to do that? (Was about to risk breaking the mirror and repartitioning! I'd rather not!) Hmm -- maybe this worked: $ sudo btrfs filesystem resize -1m /.MEDIA/ Resize '/.MEDIA/' of '-1m' No, doesn't seem to have worked. I used btrfs fi show --raw /.MEDIA to see the size in bytes, and then ... sudo btrfs filesystem resize N:SIZE /.MEDIA/ Where N was the disk number in the show output, and SIZE was the number of bytes to set it to that was a multiple of 4k. Rich -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]
On 9/19/2017 10:39 PM, Qu Wenruo wrote: In v4.13 kernel, the newly added devices are in fact rounded down. But existing device doesn't get the round down. So it's recommended to resize (shrink) your fs for very small size to fix it if you don't want to wait for the kernel fix. What's the most direct way to do that? (Was about to risk breaking the mirror and repartitioning! I'd rather not!) Hmm -- maybe this worked: $ sudo btrfs filesystem resize -1m /.MEDIA/ Resize '/.MEDIA/' of '-1m' Thanks, Qu So I'm wondering if it's caused by added new btrfs device. Possibly... I've added and removed disks to some of the btrfs fs's. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]
For the warning -- is there anyway to add in the filesystem/disk causing the issue? I didn't see any identifier in the message that told me which it was. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SSD caching an existing btrfs raid1
On 19/09/17 17:47, Austin S. Hemmelgarn wrote: (...) A better option if you can afford to remove a single device from that array temporarily is to use bcache. Bcache has one specific advantage in this case, multiple backend devices can share the same cache device. This means you don't have to carve out dedicated cache space for each disk on the SSD and leave some unused space so that you can add new devices if needed. The downside is that you can't convert each device in-place, but because you're using BTRFS, you can still convert the volume as a whole in-place. The procedure for doing so looks like this: 1. Format the SSD as a bcache cache. 2. Use `btrfs device delete` to remove a single hard drive from the array. 3. Set up the drive you just removed as a bcache backing device bound to the cache you created in step 1. 4. Add the new bcache device to the array. 5. Repeat from step 2 until the whole array is converted. A similar procedure can actually be used to do almost any underlying storage conversion (for example, switching to whole disk encryption, or adding LVM underneath BTRFS) provided all your data can fit on one less disk than you have. Thanks Austin, that's just great. For some reason I had discarded bcache thinking that it would force me to rebuild from scratch, but this kind of incremental migration is exactly why I hoped was possible. I have plenty of space to replace the devices one by one. I will report back my experience in a few days, I hope. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]
On 2017年09月20日 22:14, nborisov wrote: On 2017-09-20 08:23, Qu Wenruo wrote: On 2017年09月20日 14:11, nborisov wrote: On 2017-09-20 07:39, Qu Wenruo wrote: On 2017年09月20日 13:10, Qu Wenruo wrote: On 2017年09月20日 12:59, Qu Wenruo wrote: On 2017年09月20日 12:49, Rich Rauenzahn wrote: On 9/19/2017 5:31 PM, Qu Wenruo wrote: On 2017年09月19日 23:56, Rich Rauenzahn wrote: [ 4.747377] WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs] Is that line the following WARN_ON()? --- static inline void btrfs_set_device_total_bytes(struct extent_buffer *eb, struct btrfs_dev_item *s, u64 val) { BUILD_BUG_ON(sizeof(u64) != sizeof(((struct btrfs_dev_item *)0))->total_bytes); WARN_ON(!IS_ALIGNED(val, eb->fs_info->sectorsize)); <<< btrfs_set_64(eb, s, offsetof(struct btrfs_dev_item, total_bytes), val); } --- If so, that means your devices size is not aligned to 4K. Is your block device still using old 512 block size? AFAIK nowadays most HDDs are using 4K blocksize and it's recommended to use it. It's not a big problem and one can easily remove the WARN_ON(). But I think we'd better fix the caller to do round_down() before calling this function. That's interesting! I believe I made an effort to align them when I set it up years ago, but never knew how to verify. Well, best verifying if that's the line causing the warning, since I don't have the source of RedHat kernel. I have three mirrored filesystems: [snip] Number Start (sector) End (sector) Size Code Name 1 40 3907029134 1.8 TiB 8300 BTRFS MEDIA GPT fdisk (gdisk) version 0.8.6 At least this size is not aligned to 4K. Partition table scan: [snip] .and one is aligned differently! Could it be /dev/sdd that's the issue? But it's aligned at 4096 -- so I'm not sure that's the issue after all. Its start sector is aligned, but end point is not, so the size is not aligned either. BTW, is /dev/sdd added to btrfs using "btrfs device add"? In my test, if making btrfs on a unaligned file, it will round down to its sectorsize boundary. Confirmed that "btrfs device add" won't round down the size. Check the btrfs-debug-tree output: -- item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98 devid 1 total_bytes 10737418240 bytes_used 2172649472 io_align 4096 io_width 4096 sector_size 4096 type 0 generation 0 start_offset 0 dev_group 0 seek_speed 0 bandwidth 0 uuid 243a1117-ca31-4d87-8656-81c5630aafb2 fsid 6452cde7-14d5-4541-aa07-b265a400bad0 item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98 devid 2 total_bytes 1073742336 bytes_used 0 io_align 4096 io_width 4096 sector_size 4096 type 0 generation 0 start_offset 0 dev_group 0 seek_speed 0 bandwidth 0 uuid 6bb07260-d230-4e22-88b1-1eabb46622ed fsid 6452cde7-14d5-4541-aa07-b265a400bad0 -- Sorry, the output is from v4.12.x, so no kernel warning nor the patch rounding down the value. Where first device is completely aligned, the 2nd device which is just 1G + 512, definitely not aligned. So if you're using single device purely created by mkfs.btrfs, you're OK. But if any new device added, you're not OK and causing the false alert. Any way, it should not be hard to fix. Just remove the WARN_ON() and add extra round_down when adding device. In v4.13 kernel, the newly added devices are in fact rounded down. But existing device doesn't get the round down. We got a report internally at Suse of this problem and it prevented a filesystem from being mounted due to the following check failing: http://elixir.free-electrons.com/linux/latest/source/fs/btrfs/volumes.c#L6893. Hence I added the rounding down fixes. And this warning was put specifically to catch future offenders and see if I had missed a place to patch it. So removing the warning is the wrong solution to the problem. Then at least only enable it for BTRFS_DEBUG. No, the idea is that if a bug in the code causes such a "corruption" we ought to be able to catch it in the first instance and not post factum. If a bug in btrfs is introduced in a call path which invokes btrfs_update_device with misaligned values we won't see which was the culprit. If one then enables BTRFS_DEBUG and starts seeing the warnings it will likely yield no useful information since the value will already be corrupted. I totally understand the fix and introduced WARN_ON(). But the truth is, the fix (along with Liu Bo's validation checker) doesn't consider existing fs with unaligned bytenr. No user would be happy to see tons of kernel backtrace just because they updated their kernel. For end user it's just confusing. I have submitted a
Re: btrfs-progs: suggestion of removing --commit-after option of subvol delete
On 2017年09月20日 22:03, David Sterba wrote: On Wed, Sep 20, 2017 at 08:22:54AM +0800, Qu Wenruo wrote: The costly part will be tracking the filesystems of subvolumes. We must do it for each subvolume and batch them to address the final transaction commit for each fs. I didn't see any easy ioctl to get the UUID from fd, meaning (if didn't miss anything) we need to iterate the path until reaching the mount boundary and then refer to mountinfo to find the fs. The FS_INFO ioctl will tell you the fsid from a fd. Iterating the paths will not work reliably, due to some potentially very creative use of mounts that could build up the path components. The biggest concern is about how to get fsid reliably. And since FS_INFO ioctl solves this, I'm completely fine. (I just search "UUID" for ioctls and get no hit, so I assume it may be a problem) Not to mention that the possible softlink in the path may make things more complex. Yes, this may be fixed with tons of code, but I don't think the complexity worthy for a minor feature. So what if we fix it like that: - iterate over subvolumes - check if the fsid is same as for the previous subvolume - if it is, continue - if not, do the sync I prefer to maintain a subvolume <-> fs mapping, and only address transaction commit after all subvolumes are iterated. IOW, sync when we leave subvolumes of one filesystem. In the degenerate case, we can have subvolumes interleaved that we'd sync after each, effectively --commit-each. Since we have a subvolume <-> fs mapping, we can commit each fs after iterating all subvolumes. AFAIK "--commit-after" user would like to introduce the minimal commits, so each fs only get committed once seems better to me. The simple tracking should avoid keeping all the filedescriptors open and I think this won't need tons of code to implement. Indeed, since there is reliable fs UUID ioctl. And above subvolume <-> fs mapping can also be used to resolve missing fd problems. Just adding a fds <-> fs mapping, and try to call commit ioctl on first good fd. The typical could be 'btrfs subvol delete -c path/*', ie. deleting from just one filesystem. If we happen to cross more filesystems, each of them will get the sync, but possibly more than one. I don't think this is too serious. Well, after we implemented above user case, some facility can be reused for this case. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]
On 2017-09-20 08:23, Qu Wenruo wrote: On 2017年09月20日 14:11, nborisov wrote: On 2017-09-20 07:39, Qu Wenruo wrote: On 2017年09月20日 13:10, Qu Wenruo wrote: On 2017年09月20日 12:59, Qu Wenruo wrote: On 2017年09月20日 12:49, Rich Rauenzahn wrote: On 9/19/2017 5:31 PM, Qu Wenruo wrote: On 2017年09月19日 23:56, Rich Rauenzahn wrote: [ 4.747377] WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs] Is that line the following WARN_ON()? --- static inline void btrfs_set_device_total_bytes(struct extent_buffer *eb, struct btrfs_dev_item *s, u64 val) { BUILD_BUG_ON(sizeof(u64) != sizeof(((struct btrfs_dev_item *)0))->total_bytes); WARN_ON(!IS_ALIGNED(val, eb->fs_info->sectorsize)); <<< btrfs_set_64(eb, s, offsetof(struct btrfs_dev_item, total_bytes), val); } --- If so, that means your devices size is not aligned to 4K. Is your block device still using old 512 block size? AFAIK nowadays most HDDs are using 4K blocksize and it's recommended to use it. It's not a big problem and one can easily remove the WARN_ON(). But I think we'd better fix the caller to do round_down() before calling this function. That's interesting! I believe I made an effort to align them when I set it up years ago, but never knew how to verify. Well, best verifying if that's the line causing the warning, since I don't have the source of RedHat kernel. I have three mirrored filesystems: [snip] Number Start (sector) End (sector) Size Code Name 1 40 3907029134 1.8 TiB 8300 BTRFS MEDIA GPT fdisk (gdisk) version 0.8.6 At least this size is not aligned to 4K. Partition table scan: [snip] .and one is aligned differently! Could it be /dev/sdd that's the issue? But it's aligned at 4096 -- so I'm not sure that's the issue after all. Its start sector is aligned, but end point is not, so the size is not aligned either. BTW, is /dev/sdd added to btrfs using "btrfs device add"? In my test, if making btrfs on a unaligned file, it will round down to its sectorsize boundary. Confirmed that "btrfs device add" won't round down the size. Check the btrfs-debug-tree output: -- item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98 devid 1 total_bytes 10737418240 bytes_used 2172649472 io_align 4096 io_width 4096 sector_size 4096 type 0 generation 0 start_offset 0 dev_group 0 seek_speed 0 bandwidth 0 uuid 243a1117-ca31-4d87-8656-81c5630aafb2 fsid 6452cde7-14d5-4541-aa07-b265a400bad0 item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98 devid 2 total_bytes 1073742336 bytes_used 0 io_align 4096 io_width 4096 sector_size 4096 type 0 generation 0 start_offset 0 dev_group 0 seek_speed 0 bandwidth 0 uuid 6bb07260-d230-4e22-88b1-1eabb46622ed fsid 6452cde7-14d5-4541-aa07-b265a400bad0 -- Sorry, the output is from v4.12.x, so no kernel warning nor the patch rounding down the value. Where first device is completely aligned, the 2nd device which is just 1G + 512, definitely not aligned. So if you're using single device purely created by mkfs.btrfs, you're OK. But if any new device added, you're not OK and causing the false alert. Any way, it should not be hard to fix. Just remove the WARN_ON() and add extra round_down when adding device. In v4.13 kernel, the newly added devices are in fact rounded down. But existing device doesn't get the round down. We got a report internally at Suse of this problem and it prevented a filesystem from being mounted due to the following check failing: http://elixir.free-electrons.com/linux/latest/source/fs/btrfs/volumes.c#L6893. Hence I added the rounding down fixes. And this warning was put specifically to catch future offenders and see if I had missed a place to patch it. So removing the warning is the wrong solution to the problem. Then at least only enable it for BTRFS_DEBUG. No, the idea is that if a bug in the code causes such a "corruption" we ought to be able to catch it in the first instance and not post factum. If a bug in btrfs is introduced in a call path which invokes btrfs_update_device with misaligned values we won't see which was the culprit. If one then enables BTRFS_DEBUG and starts seeing the warnings it will likely yield no useful information since the value will already be corrupted. For end user it's just confusing. I have submitted a patch to do the check at mounting time, and warning user to do shrink to fix it. (Although still removed the WARN_ON) I think such warning breaks backward compatibility should be as gentle as possible for end users. How exactly is it breaking backward compatibility? Thanks, Qu Generally if balancing kicked
Re: btrfs-progs: suggestion of removing --commit-after option of subvol delete
On Wed, Sep 20, 2017 at 08:22:54AM +0800, Qu Wenruo wrote: > The costly part will be tracking the filesystems of subvolumes. > We must do it for each subvolume and batch them to address the final > transaction commit for each fs. > > I didn't see any easy ioctl to get the UUID from fd, meaning (if didn't > miss anything) we need to iterate the path until reaching the mount > boundary and then refer to mountinfo to find the fs. The FS_INFO ioctl will tell you the fsid from a fd. Iterating the paths will not work reliably, due to some potentially very creative use of mounts that could build up the path components. > Not to mention that the possible softlink in the path may make things > more complex. > > Yes, this may be fixed with tons of code, but I don't think the > complexity worthy for a minor feature. So what if we fix it like that: - iterate over subvolumes - check if the fsid is same as for the previous subvolume - if it is, continue - if not, do the sync IOW, sync when we leave subvolumes of one filesystem. In the degenerate case, we can have subvolumes interleaved that we'd sync after each, effectively --commit-each. The simple tracking should avoid keeping all the filedescriptors open and I think this won't need tons of code to implement. The typical could be 'btrfs subvol delete -c path/*', ie. deleting from just one filesystem. If we happen to cross more filesystems, each of them will get the sync, but possibly more than one. I don't think this is too serious. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at fs/btrfs/extent_io.c:1989
On Tue, Sep 19, 2017 at 10:12:39AM -0600, Liu Bo wrote: > On Tue, Sep 19, 2017 at 05:07:25PM +0200, David Sterba wrote: > > On Tue, Sep 19, 2017 at 11:32:46AM +, Paul Jones wrote: > > > > This 'mirror 0' looks fishy, (as mirror comes from > > > > btrfs_io_bio->mirror_num, > > > > which should be at least 1 if raid1 setup is in use.) > > > > > > > > Not sure if 4.13.2-gentoo made any changes on btrfs, but can you please > > > > verify with the upstream kernel, say, v4.13? > > > > > > It's basically a vanilla kernel with a handful of unrelated patches. > > > The filesystem fell apart overnight, there were a few thousand > > > checksum errors and eventually it went read-only. I tried to remount > > > it, but got open_ctree failed. Btrfs check segfaulted, lowmem mode > > > completed with so many errors I gave up and will restore from the > > > backup. > > > > > > I think I know the problem now - the lvm cache was in writeback mode > > > (by accident) so during a defrag there would be gigabytes of unwritten > > > data in memory only, which was all lost when the system crashed > > > (motherboard failure). No wonder the filesystem didn't quite survive. > > > > Yeah, the caching layer was my first suspicion, and lack of propagating > > of the barriers. Good that you were able to confirm that as the root cause. > > > > > I must say though, I'm seriously impressed at the data integrity of > > > BTRFS - there were near 10,000 checksum errors, 4 which were > > > uncorrectable, and from what I could tell nearly all of the data was > > > still intact according to rsync checksums. > > > > Yay! > > But still don't get why mirror_num is 0, do you have an idea on how > does writeback cache make that? My first idea was that the cached blocks were zeroed, so we'd see the ino and mirror as 0. But this is not correct as the blocks would not pass the checksum tests, so the blocks must be from some previous generation. Ie. the transid verify failure. And all the error reports appear after that so I'm slightly suspicious about the way it's actually reported. btrfs_print_data_csum_error takes mirror from either io_bio or compressed_bio structures, so there might be a case when the structures are initialized. If the transid check is ok, then the structures are updated. If the check fails we'd see the initial mirror number. All of that is just a hypothesis, I haven't checked with the code. I don't have a theoretical explanation for the ino 0. The inode pointer that goes to btrfs_print_data_csum_error should be from a properly initialized inode and we print the number using btrfs_ino. That will use the vfs i_ino value and we should never get 0 out of that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] fs:btrfs: return -ENOMEM on allocation failure.
On 09/20/2017 02:17 PM, Allen Pais wrote: v2: return -ENOMEM from btrfsic_dev_state_alloc() too. Signed-off-by: Allen PaisReviewed-by: Anand Jain --- fs/btrfs/check-integrity.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index 7d5a9b5..9db1e76 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -2913,7 +2913,7 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info, state = kvzalloc(sizeof(*state), GFP_KERNEL); if (!state) { pr_info("btrfs check-integrity: allocation failed!\n"); - return -1; + return -ENOMEM; } if (!btrfsic_is_initialized) { @@ -2945,7 +2945,7 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info, if (NULL == ds) { pr_info("btrfs check-integrity: kmalloc() failed!\n"); mutex_unlock(_mutex); - return -1; + return -ENOMEM; } ds->bdev = device->bdev; ds->state = state; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On 2017-09-20 02:38, Dave wrote: On Thu 2017-08-31 (09:05), Ulli Horlacher wrote: When I do a btrfs filesystem defragment -r /directory does it defragment really all files in this directory tree, even if it contains subvolumes? The man page does not mention subvolumes on this topic. No answer so far :-( But I found another problem in the man-page: Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4 will break up the ref-links of COW data (for example files copied with cp --reflink, snapshots or de-duplicated data). This may cause considerable increase of space usage depending on the broken up ref-links. I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several snapshots. Therefore, I better should avoid calling "btrfs filesystem defragment -r"? What is the defragmenting best practice? Avoid it completly? My question is the same as the OP in this thread, so I came here to read the answers before asking. However, it turns out that I still need to ask something. Should I ask here or start a new thread? (I'll assume here, since the topic is the same.) Based on the answers here, it sounds like I should not run defrag at all. However, I have a performance problem I need to solve, so if I don't defrag, I need to do something else. Here's my scenario. Some months ago I built an over-the-top powerful desktop computer / workstation and I was looking forward to really fantastic performance improvements over my 6 year old Ubuntu machine. I installed Arch Linux on BTRFS on the new computer (on an SSD). To my shock, it was no faster than my old machine. I focused a lot on Firefox performance because I use Firefox a lot and that was one of the applications in which I was most looking forward to better performance. I tried everything I could think of and everything recommended to me in various forums (except switching to Windows) and the performance remained very disappointing. Switching to Windows won't help any more than switching to ext4 would. If you were running Chrome, it might (Chrome actually has better performance on Windows than Linux by a small margin last time I checked), but Firefox gets pretty much the same performance on both platforms. Then today I read the following: Gotchas - btrfs Wiki https://btrfs.wiki.kernel.org/index.php/Gotchas Fragmentation: Files with a lot of random writes can become heavily fragmented (1+ extents) causing excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM. On desktops this primarily affects application databases (including Firefox). Workarounds include manually defragmenting your home directory using btrfs fi defragment. Auto-defragment (mount option autodefrag) should solve this problem. Upon reading that I am wondering if fragmentation in the Firefox profile is part of my issue. That's one thing I never tested previously. (BTW, this system has 256 GB of RAM and 20 cores.) Almost certainly. Most modern web browsers are brain-dead and insist on using SQLite databases (or traditional DB files) for everything, including the cache, and the usage for the cache in particular kills performance when fragmentation is an issue. Furthermore, on the same BTRFS Wiki page, it mentions the performance penalties of many snapshots. I am keeping 30 to 50 snapshots of the volume that contains the Firefox profile. Would these two things be enough to turn top-of-the-line hardware into a mediocre-preforming desktop system? (The system performs fine on benchmarks -- it's real life usage, particularly with Firefox where it is disappointing.) Even ignoring fragmentation and reflink issues (it's reflinks, not snapshots that are the issue, snapshots just have tons of reflinks), BTRFS is slower than ext4 or XFS simply because of the fact that it's doing way more work. The difference should have limited impact on an SSD if you get a handle on the other issues though. After reading the info here, I am wondering if I should make a new subvolume just for my Firefox profile(s) and not use COW and/or not keep snapshots on it and mount it with the autodefrag option. As part of this strategy, I could send snapshots to another disk using btrfs send-receive. That way I would have the benefits of snapshots (which are important to me), but by not keeping any snapshots on the live subvolume I could avoid the performance problems. What would you guys do in this situation? Personally? Use Chrome or Chromium and turn on the simple cache backend (chrome://flags/#enable-simple-cache-backend) which doesn't have issues with fragmentation because it doesn't use a database file to store the cache and lets the filesystem handle the allocations. The difference in performance in Chrome itself from flipping this switch is pretty amazing to be honest. They're also faster than Firefox in general in my experience, but
Re: using fio to test btrfs compression
On Wed, Sep 20, 2017 at 4:00 PM, Timofey Titovetswrote: > 2017-09-20 13:13 GMT+03:00 shally verma : >> On Wed, Sep 20, 2017 at 3:06 PM, Timofey Titovets >> wrote: >>> 2017-09-20 11:59 GMT+03:00 shally verma : On Wed, Sep 20, 2017 at 2:25 PM, Timofey Titovets wrote: > 2017-09-20 11:44 GMT+03:00 shally verma : >> One more catch... I am initiating fio from non-btrfs filesystem i.e. >> pwd is ext4 based fs where as mount point is btrfs. >> Could that make difference? >> >>> Thanks >>> Shally > > Only matter are where you store test file =) > If you store test file on btrfs, pwd does nothing. > Then steps listed in previous mail should work right? Am listing them again here: " > > 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt > > 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k > --rw=write --iodepth=128 --name=test --size=1G > --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer > --ioengine=libaio > > 1GN file written uncompressed. Here no compression invoked (though > compress-force=zlib) > > 3. cp mnt/test ./ --> copy back fio generated test file from btrfs > mount point to local drive > > 4. hex dump test file (all FFs) -- confirmed that data is compressible > no random data. > > 5. cp test mnt/ --> now, copy same test again back to mount point > (reverse of step 3) . Now, here I see during copying compression is > invoked. > > I am using kernel 4.9 and compress-foce is said to be working for > kernel > 2.13 from wiki ... so I wonder what's so special with cp > command which is not happening during fio writes??? "- Thanks Shally > -- > Have a nice day, > Timofey. >>> >>> I did try reproduce your problem on 4.14, i'm hard to explain that >>> happens and why file created by FIO will not use compression. >>> >>> Suggest workaround: >>> rm -v /mnt/test.0.0 >>> dd if=/dev/zero of=/mnt/test.0.0 bs=1M count=1024 >> >> You mean "compression" using "dd" write. >> >>> fio --directory=$HOME/test/ --numjobs=1 --direct=1 --buffered=0 >>> --bs=64k --rw=write --iodepth=128 --name=test --size=1G >>> --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer >>> --ioengine=libaio >> >> You mean "no compression" during fio writes and you see similar behavior as >> I? >> BTW here --direct=0 and --buffered = 1. >> >>> >>> You can check if data compressed by filefrag -v /mnt/test.0.0 (you >>> will see encoded extents) >>> >> if I do filefrag on mnt/test.0.0, then it shows up like this: >> ext: logical_offset:physical_offset: length: expected: flags: >>0:0.. 1:1329922.. 1329923: 2: >>1:2.. 3:1329923.. 1329924: 2:1329924: >>2:4.. 5:1329924.. 1329925: 2:1329925: >> >> An new to this, how does it says extents are encoded here? >> >> Thanks >> Shally >> >>> -- >>> Have a nice day, >>> Timofey. > > Yes, i was reproduce your problem, new file created by fio not compressed. > So i write you about workaround, because i don't want to deep dig in > that problem. > > Example of encoded: > ~ dd if=/dev/zero of=./test.0.0 bs=1M count=1M count=1; sync > ~ filefrag -v test.0.0 > Filesystem type is: 9123683e > File size of test.0.0 is 1048576 (256 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: > 0:0.. 31: 72779979.. 72780010: 32: encoded > 1: 32.. 63: 72751509.. 72751540: 32: 72780011: encoded > 2: 64.. 95: 72755246.. 72755277: 32: 72751541: encoded > 3: 96.. 127: 72755610.. 72755641: 32: 72755278: encoded > 4: 128.. 159: 72755631.. 72755662: 32: 72755642: encoded > 5: 160.. 191: 72755722.. 72755753: 32: 72755663: encoded > 6: 192.. 223: 72757862.. 72757893: 32: 72755754: encoded > 7: 224.. 255: 72769455.. 72769486: 32: 72757894: > last,encoded,eof > test.0.0: 8 extents found > Interesting part is I dont see "encoded" under flags. I couldn't understand if flags are retrieved from btrfs file metadata info. As you are running on 4.14 and I am on 4.9 So, am still under doubt, even with dd if files are getting compressed. What is the filesize shown if you run btrfs fi du /mnt/test0.0 .. is it less or actual size? Is there any command that i can run to confirm file has been compressed? So far, I had my prints enabled in kernel/fs/btrfs/compression.c and check in dmesg that code jumped to compress_page() func. Thanks Shally > -- > Have a nice day, > Timofey. -- To
Re: using fio to test btrfs compression
2017-09-20 13:13 GMT+03:00 shally verma: > On Wed, Sep 20, 2017 at 3:06 PM, Timofey Titovets > wrote: >> 2017-09-20 11:59 GMT+03:00 shally verma : >>> On Wed, Sep 20, 2017 at 2:25 PM, Timofey Titovets >>> wrote: 2017-09-20 11:44 GMT+03:00 shally verma : > One more catch... I am initiating fio from non-btrfs filesystem i.e. > pwd is ext4 based fs where as mount point is btrfs. > Could that make difference? > >> Thanks >> Shally Only matter are where you store test file =) If you store test file on btrfs, pwd does nothing. >>> >>> Then steps listed in previous mail should work right? Am listing them >>> again here: >>> >>> " 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k --rw=write --iodepth=128 --name=test --size=1G --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer --ioengine=libaio 1GN file written uncompressed. Here no compression invoked (though compress-force=zlib) 3. cp mnt/test ./ --> copy back fio generated test file from btrfs mount point to local drive 4. hex dump test file (all FFs) -- confirmed that data is compressible no random data. 5. cp test mnt/ --> now, copy same test again back to mount point (reverse of step 3) . Now, here I see during copying compression is invoked. I am using kernel 4.9 and compress-foce is said to be working for kernel > 2.13 from wiki ... so I wonder what's so special with cp command which is not happening during fio writes??? >>> >>> "- >>> >>> Thanks >>> Shally >>> >>> -- Have a nice day, Timofey. >> >> I did try reproduce your problem on 4.14, i'm hard to explain that >> happens and why file created by FIO will not use compression. >> >> Suggest workaround: >> rm -v /mnt/test.0.0 >> dd if=/dev/zero of=/mnt/test.0.0 bs=1M count=1024 > > You mean "compression" using "dd" write. > >> fio --directory=$HOME/test/ --numjobs=1 --direct=1 --buffered=0 >> --bs=64k --rw=write --iodepth=128 --name=test --size=1G >> --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer >> --ioengine=libaio > > You mean "no compression" during fio writes and you see similar behavior as I? > BTW here --direct=0 and --buffered = 1. > >> >> You can check if data compressed by filefrag -v /mnt/test.0.0 (you >> will see encoded extents) >> > if I do filefrag on mnt/test.0.0, then it shows up like this: > ext: logical_offset:physical_offset: length: expected: flags: >0:0.. 1:1329922.. 1329923: 2: >1:2.. 3:1329923.. 1329924: 2:1329924: >2:4.. 5:1329924.. 1329925: 2:1329925: > > An new to this, how does it says extents are encoded here? > > Thanks > Shally > >> -- >> Have a nice day, >> Timofey. Yes, i was reproduce your problem, new file created by fio not compressed. So i write you about workaround, because i don't want to deep dig in that problem. Example of encoded: ~ dd if=/dev/zero of=./test.0.0 bs=1M count=1M count=1; sync ~ filefrag -v test.0.0 Filesystem type is: 9123683e File size of test.0.0 is 1048576 (256 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 31: 72779979.. 72780010: 32: encoded 1: 32.. 63: 72751509.. 72751540: 32: 72780011: encoded 2: 64.. 95: 72755246.. 72755277: 32: 72751541: encoded 3: 96.. 127: 72755610.. 72755641: 32: 72755278: encoded 4: 128.. 159: 72755631.. 72755662: 32: 72755642: encoded 5: 160.. 191: 72755722.. 72755753: 32: 72755663: encoded 6: 192.. 223: 72757862.. 72757893: 32: 72755754: encoded 7: 224.. 255: 72769455.. 72769486: 32: 72757894: last,encoded,eof test.0.0: 8 extents found -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using fio to test btrfs compression
On Wed, Sep 20, 2017 at 3:06 PM, Timofey Titovetswrote: > 2017-09-20 11:59 GMT+03:00 shally verma : >> On Wed, Sep 20, 2017 at 2:25 PM, Timofey Titovets >> wrote: >>> 2017-09-20 11:44 GMT+03:00 shally verma : One more catch... I am initiating fio from non-btrfs filesystem i.e. pwd is ext4 based fs where as mount point is btrfs. Could that make difference? > Thanks > Shally >>> >>> Only matter are where you store test file =) >>> If you store test file on btrfs, pwd does nothing. >>> >> >> Then steps listed in previous mail should work right? Am listing them >> again here: >> >> " >>> >>> 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt >>> >>> 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k >>> --rw=write --iodepth=128 --name=test --size=1G >>> --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer >>> --ioengine=libaio >>> >>> 1GN file written uncompressed. Here no compression invoked (though >>> compress-force=zlib) >>> >>> 3. cp mnt/test ./ --> copy back fio generated test file from btrfs >>> mount point to local drive >>> >>> 4. hex dump test file (all FFs) -- confirmed that data is compressible >>> no random data. >>> >>> 5. cp test mnt/ --> now, copy same test again back to mount point >>> (reverse of step 3) . Now, here I see during copying compression is >>> invoked. >>> >>> I am using kernel 4.9 and compress-foce is said to be working for >>> kernel > 2.13 from wiki ... so I wonder what's so special with cp >>> command which is not happening during fio writes??? >> >> "- >> >> Thanks >> Shally >> >> >>> -- >>> Have a nice day, >>> Timofey. > > I did try reproduce your problem on 4.14, i'm hard to explain that > happens and why file created by FIO will not use compression. > > Suggest workaround: > rm -v /mnt/test.0.0 > dd if=/dev/zero of=/mnt/test.0.0 bs=1M count=1024 You mean "compression" using "dd" write. > fio --directory=$HOME/test/ --numjobs=1 --direct=1 --buffered=0 > --bs=64k --rw=write --iodepth=128 --name=test --size=1G > --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer > --ioengine=libaio You mean "no compression" during fio writes and you see similar behavior as I? BTW here --direct=0 and --buffered = 1. > > You can check if data compressed by filefrag -v /mnt/test.0.0 (you > will see encoded extents) > if I do filefrag on mnt/test.0.0, then it shows up like this: ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 1:1329922.. 1329923: 2: 1:2.. 3:1329923.. 1329924: 2:1329924: 2:4.. 5:1329924.. 1329925: 2:1329925: An new to this, how does it says extents are encoded here? Thanks Shally > -- > Have a nice day, > Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using fio to test btrfs compression
2017-09-20 11:59 GMT+03:00 shally verma: > On Wed, Sep 20, 2017 at 2:25 PM, Timofey Titovets > wrote: >> 2017-09-20 11:44 GMT+03:00 shally verma : >>> One more catch... I am initiating fio from non-btrfs filesystem i.e. >>> pwd is ext4 based fs where as mount point is btrfs. >>> Could that make difference? >>> Thanks Shally >> >> Only matter are where you store test file =) >> If you store test file on btrfs, pwd does nothing. >> > > Then steps listed in previous mail should work right? Am listing them > again here: > > " >> >> 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt >> >> 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k >> --rw=write --iodepth=128 --name=test --size=1G >> --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer >> --ioengine=libaio >> >> 1GN file written uncompressed. Here no compression invoked (though >> compress-force=zlib) >> >> 3. cp mnt/test ./ --> copy back fio generated test file from btrfs >> mount point to local drive >> >> 4. hex dump test file (all FFs) -- confirmed that data is compressible >> no random data. >> >> 5. cp test mnt/ --> now, copy same test again back to mount point >> (reverse of step 3) . Now, here I see during copying compression is >> invoked. >> >> I am using kernel 4.9 and compress-foce is said to be working for >> kernel > 2.13 from wiki ... so I wonder what's so special with cp >> command which is not happening during fio writes??? > > "- > > Thanks > Shally > > >> -- >> Have a nice day, >> Timofey. I did try reproduce your problem on 4.14, i'm hard to explain that happens and why file created by FIO will not use compression. Suggest workaround: rm -v /mnt/test.0.0 dd if=/dev/zero of=/mnt/test.0.0 bs=1M count=1024 fio --directory=$HOME/test/ --numjobs=1 --direct=1 --buffered=0 --bs=64k --rw=write --iodepth=128 --name=test --size=1G --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer --ioengine=libaio You can check if data compressed by filefrag -v /mnt/test.0.0 (you will see encoded extents) -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: crc32c vs xxhash32/64 (ARM)
Hi, About 3 years ago Liu Bo has try add xxhash support to btrfs[1]. For now (Kernel 4.14) kernel have a native library for xxhash. Main reason to not add that support is what: - CRC32C - has hardware support and amazing fast - CRC32C - Can fix 1 bit error (Yes that possible, but that not used in btrfs and i hard to find example of correction in the kernel) So that a not a valuable reason. xxhash32/64 did not have a HW support but must outperform CRC32c in SW mode. I'm not sure that ARM boxes, as example, some NAS have a HW crc32c. ([2] show that CRC32 can have HW on some types of ARMv8). ([3] for ARM v8.1-A CRC hw must be) May be that make a sense to add support for both xxhash32/64? And do some type of testing on module load like raid6 have, and use fastest for newly written data? I test speed on xxhash32/64 and crc32c, and use fastest function for new written data? Thanks! P.S. May be some one has ARMv8 boxes? Can you check which module used by btrfs: ~ dmesg| grep crc [0.666469] Btrfs loaded, crc32c=crc32c-intel May be you want/can test xxhash32/64 on arm, i backport kernel version in [4]. (But [4] doesn't have HW CRC32 support, i hard to find example userspace implementation to stole on github and didn't have ARM box) Links: 1. http://linux-btrfs.vger.kernel.narkive.com/pO9lOshJ/patch-2-3-crypto-xxhash-add-tests 2. https://github.com/torvalds/linux/blob/5518b69b76680a4f2df96b1deca260059db0c2de/arch/arm/crypto/crc32-ce-glue.c 3. https://en.wikipedia.org/wiki/ARM_architecture#ARMv8.1-A 4. https://github.com/Nefelim4ag/CRC32C_AND_8Byte_parity -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using fio to test btrfs compression
On Wed, Sep 20, 2017 at 2:25 PM, Timofey Titovetswrote: > 2017-09-20 11:44 GMT+03:00 shally verma : >> One more catch... I am initiating fio from non-btrfs filesystem i.e. >> pwd is ext4 based fs where as mount point is btrfs. >> Could that make difference? >> >>> Thanks >>> Shally > > Only matter are where you store test file =) > If you store test file on btrfs, pwd does nothing. > Then steps listed in previous mail should work right? Am listing them again here: " > > 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt > > 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k > --rw=write --iodepth=128 --name=test --size=1G > --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer > --ioengine=libaio > > 1GN file written uncompressed. Here no compression invoked (though > compress-force=zlib) > > 3. cp mnt/test ./ --> copy back fio generated test file from btrfs > mount point to local drive > > 4. hex dump test file (all FFs) -- confirmed that data is compressible > no random data. > > 5. cp test mnt/ --> now, copy same test again back to mount point > (reverse of step 3) . Now, here I see during copying compression is > invoked. > > I am using kernel 4.9 and compress-foce is said to be working for > kernel > 2.13 from wiki ... so I wonder what's so special with cp > command which is not happening during fio writes??? "- Thanks Shally > -- > Have a nice day, > Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using fio to test btrfs compression
2017-09-20 11:44 GMT+03:00 shally verma: > One more catch... I am initiating fio from non-btrfs filesystem i.e. > pwd is ext4 based fs where as mount point is btrfs. > Could that make difference? > >> Thanks >> Shally Only matter are where you store test file =) If you store test file on btrfs, pwd does nothing. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using fio to test btrfs compression
On Wed, Sep 20, 2017 at 2:06 PM, shally vermawrote: > On Mon, Sep 18, 2017 at 7:11 PM, Timofey Titovets > wrote: >> 2017-09-18 16:28 GMT+03:00 shally verma : >>> On Mon, Sep 18, 2017 at 1:56 PM, Timofey Titovets >>> wrote: 2017-09-18 10:36 GMT+03:00 shally verma : > Hi > > I wanted to test btrfs compression using fio command but somehow > during fio writes, I don't see code taking route of compression blocks > where as If I do a copy to btrfs compression enabled mount point then > I can easily see code falling through compression.c. > > Here's how I do my setup > > 1. mkfs.btrfs /dev/sdb1 > 2. mount -t btrfs -o compress=zlib,compress-force /dev/sdb1 /mnt > 3. cp /mnt > 4. dmesg shows print staments from compression.c and zlib.c confirming > compression routine was invoked during write > 5. now, copy back from btrfs mount point to home directory also shows > decompress call invokation > > Now, try same with fio commands: > > fio command > > fio --directory=/mnt/ --numjobs=1 --direct=0 --buffered=1 > --ioengine=libaio --group_reporting --bs=64k --rw=write --iodepth=128 > --name=test --size=10G --runtime=180 --time_based > > But it seems to write uncompressed data. > > Any help here? what's missing? > > Thanks > Shally > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html 1. mount -t btrfs -o compress=zlib,compress-force -> compress-force=zlib 2. Tune fio to generate compressible data >>> How do I "tune" fio to generate data. I had assumed once compression >>> is enabled on btrfs any system fwrite call will simply compress data >>> into it .Isn't it so? >>> Can you share fio command that I can test? >>> Thanks >>> Shally -- Have a nice day, Timofey. >> >> That useless to compress uncompressible data. >> Also, as you enable compress, not compress-force >> So after first uncompressible write btrfs just stop trying compress that >> file. >> >> From man fio: >> buffer_compress_percentage=int >> If this is set, then fio will attempt to provide I/O >> buffer content (on WRITEs) that compresses to the specified level. Fio >> does this by providing a mix of random data and a fixed pattern. The >> fixed pattern is either >> zeros, or the pattern specified by buffer_pattern. If >> the pattern option is used, it might skew the compression ratio >> slightly. Note that this is per block size unit, for file/disk wide >> compression level that matches this >> setting, you'll also want to set refill_buffers. >> >> buffer_compress_chunk=int >> See buffer_compress_percentage. This setting allows fio >> to manage how big the ranges of random data and zeroed data is. >> Without this set, fio will provide buffer_compress_percentage of >> blocksize random data, followed by >> the remaining zeroed. With this set to some chunk size >> smaller than the block size, fio can alternate random and zeroed data >> throughout the I/O buffer. >> >> Good luck :) > > Now. I did following: > > 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt > > 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k > --rw=write --iodepth=128 --name=test --size=1G > --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer > --ioengine=libaio > > 1GN file written uncompressed. Here no compression invoked (though > compress-force=zlib) > > 3. cp mnt/test ./ --> copy back fio generated test file from btrfs > mount point to local drive > > 4. hex dump test file (all FFs) -- confirmed that data is compressible > no random data. > > 5. cp test mnt/ --> now, copy same test again back to mount point > (reverse of step 3) . Now, here I see during copying compression is > invoked. > > I am using kernel 4.9 and compress-foce is said to be working for > kernel > 2.13 from wiki ... so I wonder what's so special with cp > command which is not happening during fio writes??? > One more catch... I am initiating fio from non-btrfs filesystem i.e. pwd is ext4 based fs where as mount point is btrfs. Could that make difference? > Thanks > Shally > > >> -- >> Have a nice day, >> Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using fio to test btrfs compression
On Mon, Sep 18, 2017 at 7:11 PM, Timofey Titovetswrote: > 2017-09-18 16:28 GMT+03:00 shally verma : >> On Mon, Sep 18, 2017 at 1:56 PM, Timofey Titovets >> wrote: >>> 2017-09-18 10:36 GMT+03:00 shally verma : Hi I wanted to test btrfs compression using fio command but somehow during fio writes, I don't see code taking route of compression blocks where as If I do a copy to btrfs compression enabled mount point then I can easily see code falling through compression.c. Here's how I do my setup 1. mkfs.btrfs /dev/sdb1 2. mount -t btrfs -o compress=zlib,compress-force /dev/sdb1 /mnt 3. cp /mnt 4. dmesg shows print staments from compression.c and zlib.c confirming compression routine was invoked during write 5. now, copy back from btrfs mount point to home directory also shows decompress call invokation Now, try same with fio commands: fio command fio --directory=/mnt/ --numjobs=1 --direct=0 --buffered=1 --ioengine=libaio --group_reporting --bs=64k --rw=write --iodepth=128 --name=test --size=10G --runtime=180 --time_based But it seems to write uncompressed data. Any help here? what's missing? Thanks Shally -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> 1. mount -t btrfs -o compress=zlib,compress-force -> compress-force=zlib >>> 2. Tune fio to generate compressible data >>> >> How do I "tune" fio to generate data. I had assumed once compression >> is enabled on btrfs any system fwrite call will simply compress data >> into it .Isn't it so? >> Can you share fio command that I can test? >> Thanks >> Shally >>> >>> -- >>> Have a nice day, >>> Timofey. > > That useless to compress uncompressible data. > Also, as you enable compress, not compress-force > So after first uncompressible write btrfs just stop trying compress that file. > > From man fio: > buffer_compress_percentage=int > If this is set, then fio will attempt to provide I/O > buffer content (on WRITEs) that compresses to the specified level. Fio > does this by providing a mix of random data and a fixed pattern. The > fixed pattern is either > zeros, or the pattern specified by buffer_pattern. If > the pattern option is used, it might skew the compression ratio > slightly. Note that this is per block size unit, for file/disk wide > compression level that matches this > setting, you'll also want to set refill_buffers. > > buffer_compress_chunk=int > See buffer_compress_percentage. This setting allows fio > to manage how big the ranges of random data and zeroed data is. > Without this set, fio will provide buffer_compress_percentage of > blocksize random data, followed by > the remaining zeroed. With this set to some chunk size > smaller than the block size, fio can alternate random and zeroed data > throughout the I/O buffer. > > Good luck :) Now. I did following: 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k --rw=write --iodepth=128 --name=test --size=1G --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer --ioengine=libaio 1GN file written uncompressed. Here no compression invoked (though compress-force=zlib) 3. cp mnt/test ./ --> copy back fio generated test file from btrfs mount point to local drive 4. hex dump test file (all FFs) -- confirmed that data is compressible no random data. 5. cp test mnt/ --> now, copy same test again back to mount point (reverse of step 3) . Now, here I see during copying compression is invoked. I am using kernel 4.9 and compress-foce is said to be working for kernel > 2.13 from wiki ... so I wonder what's so special with cp command which is not happening during fio writes??? Thanks Shally > -- > Have a nice day, > Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
I've had a very similar issue with the performance of my laptop dropping to very low levels, eventually solved by uninstalling Snapper, deleting snapshots, and then defragmenting the drive. This seems to be a common concern, I also had it happen on my desktop. Dmitry --- Thank you, Dmitry Kudriavtsev https://dkudriavtsev.xyz inexpensivecomputers.net ⠀⠀⠀⣸⣧⠀⠀⠀ ⠀⠀⣰⣿⣿⣆⠀⠀ ⠀⣀⡙⠿⣿⣿⣆⠀Hey, did you hear about that cool new OS? It's called ⣰⣿⣿⣷⣿⣿⣿⣆Arch Linux. I use Arch Linux. Have you ever used Arch ⠀⠀⠀⣰⣿⣿⣿⡿⢿⣿⣿⣿⣆⠀⠀⠀Linux? You should use Arch Linux. Everyone uses Arch! ⠀⠀⣰⣿⣿⣿⡏⠀⠀⢹⣿⣿⠿⡆⠀⠀Check out i3wm too! ⠀⣰⣿⣿⣿⡿⠇⠀⠀⠸⢿⣿⣷⣦⣄⠀ ⣼⠿⠛⠉⠉⠛⠿⣦ September 19 2017 11:38 PM, "Dave"wrote: >> On Thu 2017-08-31 (09:05), Ulli Horlacher wrote: >>> When I do a >>> btrfs filesystem defragment -r /directory >>> does it defragment really all files in this directory tree, even if it >>> contains subvolumes? >>> The man page does not mention subvolumes on this topic. >> >> No answer so far :-( >> >> But I found another problem in the man-page: >> >> Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as >> with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4 >> will break up the ref-links of COW data (for example files copied with >> cp --reflink, snapshots or de-duplicated data). This may cause >> considerable increase of space usage depending on the broken up >> ref-links. >> >> I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several >> snapshots. >> Therefore, I better should avoid calling "btrfs filesystem defragment -r"? >> >> What is the defragmenting best practice? >> Avoid it completly? > > My question is the same as the OP in this thread, so I came here to > read the answers before asking. However, it turns out that I still > need to ask something. Should I ask here or start a new thread? (I'll > assume here, since the topic is the same.) > > Based on the answers here, it sounds like I should not run defrag at > all. However, I have a performance problem I need to solve, so if I > don't defrag, I need to do something else. > > Here's my scenario. Some months ago I built an over-the-top powerful > desktop computer / workstation and I was looking forward to really > fantastic performance improvements over my 6 year old Ubuntu machine. > I installed Arch Linux on BTRFS on the new computer (on an SSD). To my > shock, it was no faster than my old machine. I focused a lot on > Firefox performance because I use Firefox a lot and that was one of > the applications in which I was most looking forward to better > performance. > > I tried everything I could think of and everything recommended to me > in various forums (except switching to Windows) and the performance > remained very disappointing. > > Then today I read the following: > > Gotchas - btrfs Wiki > https://btrfs.wiki.kernel.org/index.php/Gotchas > > Fragmentation: Files with a lot of random writes can become > heavily fragmented (1+ extents) causing excessive multi-second > spikes of CPU load on systems with an SSD or large amount a RAM. On > desktops this primarily affects application databases (including > Firefox). Workarounds include manually defragmenting your home > directory using btrfs fi defragment. Auto-defragment (mount option > autodefrag) should solve this problem. > > Upon reading that I am wondering if fragmentation in the Firefox > profile is part of my issue. That's one thing I never tested > previously. (BTW, this system has 256 GB of RAM and 20 cores.) > > Furthermore, on the same BTRFS Wiki page, it mentions the performance > penalties of many snapshots. I am keeping 30 to 50 snapshots of the > volume that contains the Firefox profile. > > Would these two things be enough to turn top-of-the-line hardware into > a mediocre-preforming desktop system? (The system performs fine on > benchmarks -- it's real life usage, particularly with Firefox where it > is disappointing.) > > After reading the info here, I am wondering if I should make a new > subvolume just for my Firefox profile(s) and not use COW and/or not > keep snapshots on it and mount it with the autodefrag option. > > As part of this strategy, I could send snapshots to another disk using > btrfs send-receive. That way I would have the benefits of snapshots > (which are important to me), but by not keeping any snapshots on the > live subvolume I could avoid the performance problems. > > What would you guys do in this situation? > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
>On Thu 2017-08-31 (09:05), Ulli Horlacher wrote: >> When I do a >> btrfs filesystem defragment -r /directory >> does it defragment really all files in this directory tree, even if it >> contains subvolumes? >> The man page does not mention subvolumes on this topic. > >No answer so far :-( > >But I found another problem in the man-page: > > Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as > with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4 > will break up the ref-links of COW data (for example files copied with > cp --reflink, snapshots or de-duplicated data). This may cause > considerable increase of space usage depending on the broken up > ref-links. > >I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several >snapshots. >Therefore, I better should avoid calling "btrfs filesystem defragment -r"? > >What is the defragmenting best practice? >Avoid it completly? My question is the same as the OP in this thread, so I came here to read the answers before asking. However, it turns out that I still need to ask something. Should I ask here or start a new thread? (I'll assume here, since the topic is the same.) Based on the answers here, it sounds like I should not run defrag at all. However, I have a performance problem I need to solve, so if I don't defrag, I need to do something else. Here's my scenario. Some months ago I built an over-the-top powerful desktop computer / workstation and I was looking forward to really fantastic performance improvements over my 6 year old Ubuntu machine. I installed Arch Linux on BTRFS on the new computer (on an SSD). To my shock, it was no faster than my old machine. I focused a lot on Firefox performance because I use Firefox a lot and that was one of the applications in which I was most looking forward to better performance. I tried everything I could think of and everything recommended to me in various forums (except switching to Windows) and the performance remained very disappointing. Then today I read the following: Gotchas - btrfs Wiki https://btrfs.wiki.kernel.org/index.php/Gotchas Fragmentation: Files with a lot of random writes can become heavily fragmented (1+ extents) causing excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM. On desktops this primarily affects application databases (including Firefox). Workarounds include manually defragmenting your home directory using btrfs fi defragment. Auto-defragment (mount option autodefrag) should solve this problem. Upon reading that I am wondering if fragmentation in the Firefox profile is part of my issue. That's one thing I never tested previously. (BTW, this system has 256 GB of RAM and 20 cores.) Furthermore, on the same BTRFS Wiki page, it mentions the performance penalties of many snapshots. I am keeping 30 to 50 snapshots of the volume that contains the Firefox profile. Would these two things be enough to turn top-of-the-line hardware into a mediocre-preforming desktop system? (The system performs fine on benchmarks -- it's real life usage, particularly with Firefox where it is disappointing.) After reading the info here, I am wondering if I should make a new subvolume just for my Firefox profile(s) and not use COW and/or not keep snapshots on it and mount it with the autodefrag option. As part of this strategy, I could send snapshots to another disk using btrfs send-receive. That way I would have the benefits of snapshots (which are important to me), but by not keeping any snapshots on the live subvolume I could avoid the performance problems. What would you guys do in this situation? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] btrfs: Make check for unaligned device size more gentle
Commit 7dfb8be11b5d ("btrfs: Round down values which are written for total_bytes_size") is fixing the unaligned device size caused by adding/shrinking device. It added a new WARN_ON() when device size is unaligned. This is fine for new device added to btrfs using v4.13 kernel, but not existing device whose total_bytes is already unaligned. And the WARN_ON() will get triggered every time a block group get created/removed on the unaligned device. This patch will only enable the WARN_ON() for CONFIG_BTRFS_DEBUG so developer won't miss any sanity check, but end user will not be troubled by tons of warning. And more useful fix will be output for end users. Reported-by: Rich RauenzahnFixes: 7dfb8be11b5d ("btrfs: Round down values which are written for total_bytes_size") Signed-off-by: Qu Wenruo --- changelog: v2: Use CONFIG_BTRFS_DEBUG to encapsulate the WARN_ON(), so end user won't see tons of kernel backtrace and developers can still catch the problem. --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/volumes.c | 16 2 files changed, 14 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 5a8933da39a7..19b1a1789c52 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1562,7 +1562,9 @@ static inline void btrfs_set_device_total_bytes(struct extent_buffer *eb, { BUILD_BUG_ON(sizeof(u64) != sizeof(((struct btrfs_dev_item *)0))->total_bytes); +#ifdef CONFIG_BTRFS_DEBUG WARN_ON(!IS_ALIGNED(val, eb->fs_info->sectorsize)); +#endif btrfs_set_64(eb, s, offsetof(struct btrfs_dev_item, total_bytes), val); } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0e8f16c305df..afae25df6a8c 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6472,15 +6472,23 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key, return 0; } -static void fill_device_from_item(struct extent_buffer *leaf, -struct btrfs_dev_item *dev_item, -struct btrfs_device *device) +static void fill_device_from_item(struct btrfs_fs_info *fs_info, + struct extent_buffer *leaf, + struct btrfs_dev_item *dev_item, + struct btrfs_device *device) { unsigned long ptr; device->devid = btrfs_device_id(leaf, dev_item); device->disk_total_bytes = btrfs_device_total_bytes(leaf, dev_item); device->total_bytes = device->disk_total_bytes; + if (!IS_ALIGNED(device->total_bytes, fs_info->sectorsize)) { + btrfs_warn(fs_info, + "devid %llu has unaligned total bytes %llu", + device->devid, device->disk_total_bytes); + btrfs_warn(fs_info, + "please shrink the device a little and resize back to fix it"); + } device->commit_total_bytes = device->disk_total_bytes; device->bytes_used = btrfs_device_bytes_used(leaf, dev_item); device->commit_bytes_used = device->bytes_used; @@ -6625,7 +6633,7 @@ static int read_one_dev(struct btrfs_fs_info *fs_info, return -EINVAL; } - fill_device_from_item(leaf, dev_item, device); + fill_device_from_item(fs_info, leaf, dev_item, device); device->in_fs_metadata = 1; if (device->writeable && !device->is_tgtdev_for_dev_replace) { device->fs_devices->total_rw_bytes += device->total_bytes; -- 2.14.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]
On 2017年09月20日 14:11, nborisov wrote: On 2017-09-20 07:39, Qu Wenruo wrote: On 2017年09月20日 13:10, Qu Wenruo wrote: On 2017年09月20日 12:59, Qu Wenruo wrote: On 2017年09月20日 12:49, Rich Rauenzahn wrote: On 9/19/2017 5:31 PM, Qu Wenruo wrote: On 2017年09月19日 23:56, Rich Rauenzahn wrote: [ 4.747377] WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs] Is that line the following WARN_ON()? --- static inline void btrfs_set_device_total_bytes(struct extent_buffer *eb, struct btrfs_dev_item *s, u64 val) { BUILD_BUG_ON(sizeof(u64) != sizeof(((struct btrfs_dev_item *)0))->total_bytes); WARN_ON(!IS_ALIGNED(val, eb->fs_info->sectorsize)); <<< btrfs_set_64(eb, s, offsetof(struct btrfs_dev_item, total_bytes), val); } --- If so, that means your devices size is not aligned to 4K. Is your block device still using old 512 block size? AFAIK nowadays most HDDs are using 4K blocksize and it's recommended to use it. It's not a big problem and one can easily remove the WARN_ON(). But I think we'd better fix the caller to do round_down() before calling this function. That's interesting! I believe I made an effort to align them when I set it up years ago, but never knew how to verify. Well, best verifying if that's the line causing the warning, since I don't have the source of RedHat kernel. I have three mirrored filesystems: [snip] Number Start (sector) End (sector) Size Code Name 1 40 3907029134 1.8 TiB 8300 BTRFS MEDIA GPT fdisk (gdisk) version 0.8.6 At least this size is not aligned to 4K. Partition table scan: [snip] .and one is aligned differently! Could it be /dev/sdd that's the issue? But it's aligned at 4096 -- so I'm not sure that's the issue after all. Its start sector is aligned, but end point is not, so the size is not aligned either. BTW, is /dev/sdd added to btrfs using "btrfs device add"? In my test, if making btrfs on a unaligned file, it will round down to its sectorsize boundary. Confirmed that "btrfs device add" won't round down the size. Check the btrfs-debug-tree output: -- item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98 devid 1 total_bytes 10737418240 bytes_used 2172649472 io_align 4096 io_width 4096 sector_size 4096 type 0 generation 0 start_offset 0 dev_group 0 seek_speed 0 bandwidth 0 uuid 243a1117-ca31-4d87-8656-81c5630aafb2 fsid 6452cde7-14d5-4541-aa07-b265a400bad0 item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98 devid 2 total_bytes 1073742336 bytes_used 0 io_align 4096 io_width 4096 sector_size 4096 type 0 generation 0 start_offset 0 dev_group 0 seek_speed 0 bandwidth 0 uuid 6bb07260-d230-4e22-88b1-1eabb46622ed fsid 6452cde7-14d5-4541-aa07-b265a400bad0 -- Sorry, the output is from v4.12.x, so no kernel warning nor the patch rounding down the value. Where first device is completely aligned, the 2nd device which is just 1G + 512, definitely not aligned. So if you're using single device purely created by mkfs.btrfs, you're OK. But if any new device added, you're not OK and causing the false alert. Any way, it should not be hard to fix. Just remove the WARN_ON() and add extra round_down when adding device. In v4.13 kernel, the newly added devices are in fact rounded down. But existing device doesn't get the round down. We got a report internally at Suse of this problem and it prevented a filesystem from being mounted due to the following check failing: http://elixir.free-electrons.com/linux/latest/source/fs/btrfs/volumes.c#L6893. Hence I added the rounding down fixes. And this warning was put specifically to catch future offenders and see if I had missed a place to patch it. So removing the warning is the wrong solution to the problem. Then at least only enable it for BTRFS_DEBUG. For end user it's just confusing. I have submitted a patch to do the check at mounting time, and warning user to do shrink to fix it. (Although still removed the WARN_ON) I think such warning breaks backward compatibility should be as gentle as possible for end users. Thanks, Qu Generally if balancing kicked in had to resize his disk everything would be back to normal. > So it's recommended to resize (shrink) your fs for very small size to fix it if you don't want to wait for the kernel fix. Thanks, Qu Thanks for the report, Qu So I'm wondering if it's caused by added new btrfs device. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list:
[PATCH] btrfs: Remove WARN_ON for unaligned device created before v4.13 and adds more user friendly output
Commit 7dfb8be11b5d ("btrfs: Round down values which are written for total_bytes_size") is fixing the unaligned device size caused by adding/shrinking device. It added a new WARN_ON() when device size is unaligned. This is fine for new device added to btrfs using v4.13 kernel, but not existing device whose total_bytes is already unaligned. And the WARN_ON() will get triggered every time a block group get created/removed on the unaligned device. This patch will remove the WARN_ON(), and warn user more gently what's happening and how to fix it. Reported-by: Rich RauenzahnFixes: 7dfb8be11b5d ("btrfs: Round down values which are written for total_bytes_size") Signed-off-by: Qu Wenruo --- fs/btrfs/ctree.h | 1 - fs/btrfs/volumes.c | 16 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 5a8933da39a7..4de9269e435a 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1562,7 +1562,6 @@ static inline void btrfs_set_device_total_bytes(struct extent_buffer *eb, { BUILD_BUG_ON(sizeof(u64) != sizeof(((struct btrfs_dev_item *)0))->total_bytes); - WARN_ON(!IS_ALIGNED(val, eb->fs_info->sectorsize)); btrfs_set_64(eb, s, offsetof(struct btrfs_dev_item, total_bytes), val); } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0e8f16c305df..afae25df6a8c 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6472,15 +6472,23 @@ static int read_one_chunk(struct btrfs_fs_info *fs_info, struct btrfs_key *key, return 0; } -static void fill_device_from_item(struct extent_buffer *leaf, -struct btrfs_dev_item *dev_item, -struct btrfs_device *device) +static void fill_device_from_item(struct btrfs_fs_info *fs_info, + struct extent_buffer *leaf, + struct btrfs_dev_item *dev_item, + struct btrfs_device *device) { unsigned long ptr; device->devid = btrfs_device_id(leaf, dev_item); device->disk_total_bytes = btrfs_device_total_bytes(leaf, dev_item); device->total_bytes = device->disk_total_bytes; + if (!IS_ALIGNED(device->total_bytes, fs_info->sectorsize)) { + btrfs_warn(fs_info, + "devid %llu has unaligned total bytes %llu", + device->devid, device->disk_total_bytes); + btrfs_warn(fs_info, + "please shrink the device a little and resize back to fix it"); + } device->commit_total_bytes = device->disk_total_bytes; device->bytes_used = btrfs_device_bytes_used(leaf, dev_item); device->commit_bytes_used = device->bytes_used; @@ -6625,7 +6633,7 @@ static int read_one_dev(struct btrfs_fs_info *fs_info, return -EINVAL; } - fill_device_from_item(leaf, dev_item, device); + fill_device_from_item(fs_info, leaf, dev_item, device); device->in_fs_metadata = 1; if (device->writeable && !device->is_tgtdev_for_dev_replace) { device->fs_devices->total_rw_bytes += device->total_bytes; -- 2.14.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] fs:btrfs: return -ENOMEM on allocation failure.
v2: return -ENOMEM from btrfsic_dev_state_alloc() too. Signed-off-by: Allen Pais--- fs/btrfs/check-integrity.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index 7d5a9b5..9db1e76 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -2913,7 +2913,7 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info, state = kvzalloc(sizeof(*state), GFP_KERNEL); if (!state) { pr_info("btrfs check-integrity: allocation failed!\n"); - return -1; + return -ENOMEM; } if (!btrfsic_is_initialized) { @@ -2945,7 +2945,7 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info, if (NULL == ds) { pr_info("btrfs check-integrity: kmalloc() failed!\n"); mutex_unlock(_mutex); - return -1; + return -ENOMEM; } ds->bdev = device->bdev; ds->state = state; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problems with debian install
On 09/20/2017 02:39 AM, Qu Wenruo wrote: On 2017年09月20日 05:35, Pierre Couderc wrote: I am trying to install stretch on a computer with 2 btrfs disks and EFI. Is there a howto to do that ? Did someone success ? I get problems as soon as the partionning in debian installer. What partitions do I need ? I understand the grub needs them even if btrfs does not require them. But I am lost, I think due to EFI... For EFI, you must have one EFI system partition, which must be formatted as FAT. So, at least you can't use Btrfs for ESP. Despite of that, everything can be read/understood by kernel can be your /boot (mainly for grub) or /. It will be even simpler if using systemd-boot (previously called gummiboot), just put kernel (with EFI-stub configured) and initrd into ESP, mounting ESP as /boot, then everything is done. For reference, my current fs layout is just like this: nvme0n1 259:00 238.5G 0 disk ├─nvme0n1p1 259:10 512M 0 part /boot └─nvme0n1p2 259:20 238G 0 part ├─system-root 254:0032G 0 lvm / ├─system-swap 254:10 4G 0 lvm [SWAP] └─system-home 254:20 150G 0 lvm /home As you can see, I'm using systemd-boot, reusing ESP as /boot. Then as long as one fs/stacked block device is supported by kernel, you can use it as you wish. I see nothing on that in google. Well, Archlinux has quite nice wiki page for this: https://wiki.archlinux.org/index.php/Unified_Extensible_Firmware_Interface Thank you very much. I have to study that carefully as I am lost in your terminology as I am used to debian... PC -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]
On 2017-09-20 07:39, Qu Wenruo wrote: On 2017年09月20日 13:10, Qu Wenruo wrote: On 2017年09月20日 12:59, Qu Wenruo wrote: On 2017年09月20日 12:49, Rich Rauenzahn wrote: On 9/19/2017 5:31 PM, Qu Wenruo wrote: On 2017年09月19日 23:56, Rich Rauenzahn wrote: [ 4.747377] WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs] Is that line the following WARN_ON()? --- static inline void btrfs_set_device_total_bytes(struct extent_buffer *eb, struct btrfs_dev_item *s, u64 val) { BUILD_BUG_ON(sizeof(u64) != sizeof(((struct btrfs_dev_item *)0))->total_bytes); WARN_ON(!IS_ALIGNED(val, eb->fs_info->sectorsize)); <<< btrfs_set_64(eb, s, offsetof(struct btrfs_dev_item, total_bytes), val); } --- If so, that means your devices size is not aligned to 4K. Is your block device still using old 512 block size? AFAIK nowadays most HDDs are using 4K blocksize and it's recommended to use it. It's not a big problem and one can easily remove the WARN_ON(). But I think we'd better fix the caller to do round_down() before calling this function. That's interesting! I believe I made an effort to align them when I set it up years ago, but never knew how to verify. Well, best verifying if that's the line causing the warning, since I don't have the source of RedHat kernel. I have three mirrored filesystems: [snip] Number Start (sector) End (sector) Size Code Name 1 40 3907029134 1.8 TiB 8300 BTRFS MEDIA GPT fdisk (gdisk) version 0.8.6 At least this size is not aligned to 4K. Partition table scan: [snip] .and one is aligned differently! Could it be /dev/sdd that's the issue? But it's aligned at 4096 -- so I'm not sure that's the issue after all. Its start sector is aligned, but end point is not, so the size is not aligned either. BTW, is /dev/sdd added to btrfs using "btrfs device add"? In my test, if making btrfs on a unaligned file, it will round down to its sectorsize boundary. Confirmed that "btrfs device add" won't round down the size. Check the btrfs-debug-tree output: -- item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98 devid 1 total_bytes 10737418240 bytes_used 2172649472 io_align 4096 io_width 4096 sector_size 4096 type 0 generation 0 start_offset 0 dev_group 0 seek_speed 0 bandwidth 0 uuid 243a1117-ca31-4d87-8656-81c5630aafb2 fsid 6452cde7-14d5-4541-aa07-b265a400bad0 item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98 devid 2 total_bytes 1073742336 bytes_used 0 io_align 4096 io_width 4096 sector_size 4096 type 0 generation 0 start_offset 0 dev_group 0 seek_speed 0 bandwidth 0 uuid 6bb07260-d230-4e22-88b1-1eabb46622ed fsid 6452cde7-14d5-4541-aa07-b265a400bad0 -- Sorry, the output is from v4.12.x, so no kernel warning nor the patch rounding down the value. Where first device is completely aligned, the 2nd device which is just 1G + 512, definitely not aligned. So if you're using single device purely created by mkfs.btrfs, you're OK. But if any new device added, you're not OK and causing the false alert. Any way, it should not be hard to fix. Just remove the WARN_ON() and add extra round_down when adding device. In v4.13 kernel, the newly added devices are in fact rounded down. But existing device doesn't get the round down. We got a report internally at Suse of this problem and it prevented a filesystem from being mounted due to the following check failing: http://elixir.free-electrons.com/linux/latest/source/fs/btrfs/volumes.c#L6893. Hence I added the rounding down fixes. And this warning was put specifically to catch future offenders and see if I had missed a place to patch it. So removing the warning is the wrong solution to the problem. Generally if balancing kicked in had to resize his disk everything would be back to normal. So it's recommended to resize (shrink) your fs for very small size to fix it if you don't want to wait for the kernel fix. Thanks, Qu Thanks for the report, Qu So I'm wondering if it's caused by added new btrfs device. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe