date:20170920

The previous patch series was based on v4.12.14, and this introductory
text was missing.

This patch series fixes some weaknesses in the btrfs LOGICAL_INO ioctl.

Background:

Suppose we have a file with one extent:

root@tester:~# zcat /usr/share/doc/cpio/changelog.gz > /test/a
root@tester:~# sync

Split the extent by overwriting it in the middle:

root@tester:~# cat /dev/urandom | dd bs=4k seek=2 skip=2 count=1 
conv=notrunc of=/test/a

We should now have 3 extent refs to 2 extents, with one block unreachable.
The extent tree looks like:

root@tester:~# btrfs-debug-tree /dev/vdc -t 2
[...]
item 9 key (1103101952 EXTENT_ITEM 73728) itemoff 15942 itemsize 53
extent refs 2 gen 29 flags DATA
extent data backref root 5 objectid 261 offset 0 count 2
[...]
item 11 key (1103175680 EXTENT_ITEM 4096) itemoff 15865 itemsize 53
extent refs 1 gen 30 flags DATA
extent data backref root 5 objectid 261 offset 8192 count 1
[...]

and the ref tree looks like:

root@tester:~# btrfs-debug-tree /dev/vdc -t 5
[...]
item 6 key (261 EXTENT_DATA 0) itemoff 15825 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 0 nr 8192 ram 73728
extent compression(none)
item 7 key (261 EXTENT_DATA 8192) itemoff 15772 itemsize 53
extent data disk byte 1103175680 nr 4096
extent data offset 0 nr 4096 ram 4096
extent compression(none)
item 8 key (261 EXTENT_DATA 12288) itemoff 15719 itemsize 53
extent data disk byte 1103101952 nr 73728
extent data offset 12288 nr 61440 ram 73728
extent compression(none)
[...]

There are two references to the same extent with different, non-overlapping
byte offsets:

[--72K extent at 1103101952--]
[--8K|--4K unreachable|--60K-]
^ ^
| |
[--8K ref offset 0--][--4K ref offset 0--][--60K ref offset 12K--]
 | 
 v
 [-4K extent-] at 1103175680

We now want to find all of the references to extent bytenr 1103101952.

Without the patch (and without running btrfs-debug-tree), we have to
do it with 18 LOGICAL_INO calls:

root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO
inode 261 offset 0 root 5

root@tester:~# for x in $(seq 0 17); do btrfs ins log $((1103101952 + x * 
4096)) -P /test/; done 2>&1 | grep inode
inode 261 offset 0 root 5
inode 261 offset 4096 root 5   <- same extent ref as offset 0
   (offset 8192 returns empty set, not 
reachable)
inode 261 offset 12288 root 5
inode 261 offset 16384 root 5  \
inode 261 offset 20480 root 5  |
inode 261 offset 24576 root 5  |
inode 261 offset 28672 root 5  |
inode 261 offset 32768 root 5  |
inode 261 offset 36864 root 5  \
inode 261 offset 40960 root 5   > all the same extent ref as offset 12288.
inode 261 offset 45056 root 5  /  More processing required in userspace
inode 261 offset 49152 root 5  |  to figure out these are all duplicates.
inode 261 offset 53248 root 5  |
inode 261 offset 57344 root 5  |
inode 261 offset 61440 root 5  |
inode 261 offset 65536 root 5  |
inode 261 offset 69632 root 5  /

In the worst case the extents are 128MB long, and we have to do 32768
iterations of the loop to find one 4K extent ref.

With the patch, we just use one call to map all refs to the extent at once:

root@tester:~# btrfs ins log 1103101952 -P /test/
Using LOGICAL_INO_V2
inode 261 offset 0 root 5
inode 261 offset 12288 root 5

The TREE_SEARCH ioctl allows userspace to retrieve the offset and
extent bytenr fields easily once the root, inode and offset are known.
This is sufficient information to build a complete map of the extent
and all of its references.  Userspace can use this information to make
better choices to dedup or defrag.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] btrfs: increase output size for LOGICAL_INO_V2 ioctl

Build-server workloads have hundreds of references per file after dedup.
Multiply by a few snapshots and we quickly exhaust the limit of 2730
references per extent that can fit into a 64K buffer.

Raise the limit to 16M to be consistent with other btrfs ioctls
(e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME).

To minimize surprising userspace behavior, apply this change only to
the LOGICAL_INO_V2 ioctl.

Signed-off-by: Zygo Blaxell 
---
 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 2bc3a9588d1d..4be9b1791f58 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4554,6 +4554,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
 
if (version == 1) {
ignore_offset = false;
+   size = min_t(u32, loi->size, SZ_64K);
} else {
/* Only accept flags we have defined so far */
if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) {
@@ -4561,6 +4562,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
goto out_loi;
}
ignore_offset = loi->flags & 
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
+   size = min_t(u32, loi->size, SZ_16M);
}
 
path = btrfs_alloc_path();
@@ -4569,7 +4571,6 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
goto out;
}
 
-   size = min_t(u32, loi->size, SZ_64K);
inodes = init_data_container(size);
if (IS_ERR(inodes)) {
ret = PTR_ERR(inodes);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2

Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.

Add a 'flags' field to the btrfs_logical_ino_args structure to disable extent
offset filtering, taking the place of one of the reserved[] fields.

Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values.  Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized.

To avoid any surprises, define a new ioctl LOGICAL_INO_V2 which uses
the same argument layout as LOGICAL_INO, but uses one of the reserved
fields for flags.  The V2 ioctl explicitly checks that unsupported flag
bits are zero so that userspace can probe for future feature bits as
they are defined.  If the other reserved fields are used in the future,
one of the remaining flag bits could specify that the other reserved
fields are valid, so we don't need to check those for now.

Since the memory layouts and behavior of the two ioctls' arguments
are almost identical, there is no need for a separate function for
logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search).
A version parameter and an 'if' statement will suffice.

Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.

Signed-off-by: Zygo Blaxell 
---
 fs/btrfs/ioctl.c   | 21 ++---
 include/uapi/linux/btrfs.h |  8 +++-
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b7de32568082..2bc3a9588d1d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4536,13 +4536,14 @@ static int build_ino_list(u64 inum, u64 offset, u64 
root, void *ctx)
 }
 
 static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info,
-   void __user *arg)
+   void __user *arg, int version)
 {
int ret = 0;
int size;
struct btrfs_ioctl_logical_ino_args *loi;
struct btrfs_data_container *inodes = NULL;
struct btrfs_path *path = NULL;
+   bool ignore_offset;
 
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
@@ -4551,6 +4552,17 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
if (IS_ERR(loi))
return PTR_ERR(loi);
 
+   if (version == 1) {
+   ignore_offset = false;
+   } else {
+   /* Only accept flags we have defined so far */
+   if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) {
+   ret = -EINVAL;
+   goto out_loi;
+   }
+   ignore_offset = loi->flags & 
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
+   }
+
path = btrfs_alloc_path();
if (!path) {
ret = -ENOMEM;
@@ -4566,7 +4578,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
}
 
ret = iterate_inodes_from_logical(loi->logical, fs_info, path,
- build_ino_list, inodes, false);
+ build_ino_list, inodes, 
ignore_offset);
if (ret == -EINVAL)
ret = -ENOENT;
if (ret < 0)
@@ -4580,6 +4592,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
 out:
btrfs_free_path(path);
kvfree(inodes);
+out_loi:
kfree(loi);
 
return ret;
@@ -5550,7 +5563,9 @@ long btrfs_ioctl(struct file *file, unsigned int
case BTRFS_IOC_INO_PATHS:
return btrfs_ioctl_ino_to_path(root, argp);
case BTRFS_IOC_LOGICAL_INO:
-   return btrfs_ioctl_logical_to_ino(fs_info, argp);
+   return btrfs_ioctl_logical_to_ino(fs_info, argp, 1);
+   case BTRFS_IOC_LOGICAL_INO_V2:
+   return btrfs_ioctl_logical_to_ino(fs_info, argp, 2);
case BTRFS_IOC_SPACE_INFO:
return btrfs_ioctl_space_info(fs_info, argp);
case BTRFS_IOC_SYNC: {
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 378230c163d5..0b3de597e04f 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -608,10 +608,14 @@ struct btrfs_ioctl_ino_path_args {
 struct btrfs_ioctl_logical_ino_args {
__u64   logical;/* in */
__u64   size;   /* in */
-   __u64   reserved[4];
+   __u64   flags;  /* in, v2 only */
+   __u64   reserved[3];
/* struct btrfs_data_container  *inodes;out   */
__u64   inodes;
 };
+/* Return every ref to the extent, not just those containing logical block.
+ * Requires logical ==

[PATCH 1/3] btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents

The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and
offset (encoded as a single logical address) to a list of extent refs.
LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping
(extent ref -> extent bytenr and offset, or logical address).  These are
useful capabilities for programs that manipulate extents and extent
references from userspace (e.g. dedup and defrag utilities).

When the extents are uncompressed (and not encrypted and not other),
check_extent_in_eb performs filtering of the extent refs to remove any
extent refs which do not contain the same extent offset as the 'logical'
parameter's extent offset.  This prevents LOGICAL_INO from returning
references to more than a single block.

To find the set of extent references to an uncompressed extent from [a,
b), userspace has to run a loop like this pseudocode:

for (i = a; i < b; ++i)
extent_ref_set += LOGICAL_INO(i);

At each iteration of the loop (up to 32768 iterations for a 128M extent),
data we are interested in is collected in the kernel, then deleted by
the filter in check_extent_in_eb.

When the extents are compressed (or encrypted or other), the 'logical'
parameter must be an extent bytenr (the 'a' parameter in the loop).
No filtering by extent offset is done (or possible?) so the result is
the complete set of extent refs for the entire extent.  This removes
the need for the loop, since we get all the extent refs in one call.

Add an 'ignore_offset' argument to iterate_inodes_from_logical,
[...several levels of function call graph...], and check_extent_in_eb, so
that we can disable the extent offset filtering for uncompressed extents.
This flag can be set by an improved version of the LOGICAL_INO ioctl to
get either behavior as desired.

There is no functional change in this patch.  The new flag is always
false.

Signed-off-by: Zygo Blaxell 
---
 fs/btrfs/backref.c| 63 ++-
 fs/btrfs/backref.h|  8 +++---
 fs/btrfs/inode.c  |  2 +-
 fs/btrfs/ioctl.c  |  2 +-
 fs/btrfs/qgroup.c |  8 +++---
 fs/btrfs/scrub.c  |  6 ++---
 fs/btrfs/send.c   |  2 +-
 fs/btrfs/tests/qgroup-tests.c | 20 +++---
 8 files changed, 63 insertions(+), 48 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index b517ef1477ea..a2609786cd86 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -40,12 +40,14 @@ static int check_extent_in_eb(const struct btrfs_key *key,
  const struct extent_buffer *eb,
  const struct btrfs_file_extent_item *fi,
  u64 extent_item_pos,
- struct extent_inode_elem **eie)
+ struct extent_inode_elem **eie,
+ bool ignore_offset)
 {
u64 offset = 0;
struct extent_inode_elem *e;
 
-   if (!btrfs_file_extent_compression(eb, fi) &&
+   if (!ignore_offset &&
+   !btrfs_file_extent_compression(eb, fi) &&
!btrfs_file_extent_encryption(eb, fi) &&
!btrfs_file_extent_other_encoding(eb, fi)) {
u64 data_offset;
@@ -84,7 +86,8 @@ static void free_inode_elem_list(struct extent_inode_elem 
*eie)
 
 static int find_extent_in_eb(const struct extent_buffer *eb,
 u64 wanted_disk_byte, u64 extent_item_pos,
-struct extent_inode_elem **eie)
+struct extent_inode_elem **eie,
+bool ignore_offset)
 {
u64 disk_byte;
struct btrfs_key key;
@@ -113,7 +116,7 @@ static int find_extent_in_eb(const struct extent_buffer *eb,
if (disk_byte != wanted_disk_byte)
continue;
 
-   ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie);
+   ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie, 
ignore_offset);
if (ret < 0)
return ret;
}
@@ -419,7 +422,7 @@ static int add_indirect_ref(const struct btrfs_fs_info 
*fs_info,
 static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
   struct ulist *parents, struct prelim_ref *ref,
   int level, u64 time_seq, const u64 *extent_item_pos,
-  u64 total_refs)
+  u64 total_refs, bool ignore_offset)
 {
int ret = 0;
int slot;
@@ -472,7 +475,7 @@ static int add_all_parents(struct btrfs_root *root, struct 
btrfs_path *path,
if (extent_item_pos) {
ret = check_extent_in_eb(, eb, fi,
*extent_item_pos,
-   );
+   ,

Re: btrfs-progs: suggestion of removing --commit-after option of subvol delete




On 2017年09月21日 10:49, Misono, Tomohiro wrote:

On 2017/09/20 23:21, Qu Wenruo wrote:



On 2017年09月20日 22:03, David Sterba wrote:

On Wed, Sep 20, 2017 at 08:22:54AM +0800, Qu Wenruo wrote:

The costly part will be tracking the filesystems of subvolumes.
We must do it for each subvolume and batch them to address the final
transaction commit for each fs.

I didn't see any easy ioctl to get the UUID from fd, meaning (if didn't
miss anything) we need to iterate the path until reaching the mount
boundary and then refer to mountinfo to find the fs.


The FS_INFO ioctl will tell you the fsid from a fd.  Iterating the paths
will not work reliably, due to some potentially very creative use of
mounts that could build up the path components.


The biggest concern is about how to get fsid reliably.
And since FS_INFO ioctl solves this, I'm completely fine.
(I just search "UUID" for ioctls and get no hit, so I assume it may be a
problem)




Not to mention that the possible softlink in the path may make things
more complex.

Yes, this may be fixed with tons of code, but I don't think the
complexity worthy for a minor feature.


So what if we fix it like that:

- iterate over subvolumes
- check if the fsid is same as for the previous subvolume
- if it is, continue
- if not, do the sync


I prefer to maintain a subvolume <-> fs mapping, and only address
transaction commit after all subvolumes are iterated.



IOW, sync when we leave subvolumes of one filesystem. In the degenerate
case, we can have subvolumes interleaved that we'd sync after each,
effectively --commit-each.


Since we have a subvolume <-> fs mapping, we can commit each fs after
iterating all subvolumes.

AFAIK "--commit-after" user would like to introduce the minimal commits,
so each fs only get committed once seems better to me.



The simple tracking should avoid keeping all the filedescriptors open
and I think this won't need tons of code to implement.


Indeed, since there is reliable fs UUID ioctl.

And above subvolume <-> fs mapping can also be used to resolve missing
fd problems.
Just adding a fds <-> fs mapping, and try to call commit ioctl on first
good fd.



The typical could be 'btrfs subvol delete -c path/*', ie. deleting from
just one filesystem. If we happen to cross more filesystems, each of
them will get the sync, but possibly more than one. I don't think this
is too serious.


Well, after we implemented above user case, some facility can be reused
for this case.

Thanks,
Qu


OK, I understood the points and will try to make a patch.
btw, by saying "some facility can be reused", do you mean we can reuse
fds <-> fs mapping function to other than delete?


Yes, I think there may be some other use cases can benefit from such 
mapping.


Thanks,
Qu



Thanks,
Tomohiro

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs-progs: suggestion of removing --commit-after option of subvol delete

2017-09-20 Thread Misono, Tomohiro

On 2017/09/20 23:21, Qu Wenruo wrote:
> 
> 
> On 2017年09月20日 22:03, David Sterba wrote:
>> On Wed, Sep 20, 2017 at 08:22:54AM +0800, Qu Wenruo wrote:
>>> The costly part will be tracking the filesystems of subvolumes.
>>> We must do it for each subvolume and batch them to address the final
>>> transaction commit for each fs.
>>>
>>> I didn't see any easy ioctl to get the UUID from fd, meaning (if didn't
>>> miss anything) we need to iterate the path until reaching the mount
>>> boundary and then refer to mountinfo to find the fs.
>>
>> The FS_INFO ioctl will tell you the fsid from a fd.  Iterating the paths
>> will not work reliably, due to some potentially very creative use of
>> mounts that could build up the path components.
> 
> The biggest concern is about how to get fsid reliably.
> And since FS_INFO ioctl solves this, I'm completely fine.
> (I just search "UUID" for ioctls and get no hit, so I assume it may be a 
> problem)
> 
>>
>>> Not to mention that the possible softlink in the path may make things
>>> more complex.
>>>
>>> Yes, this may be fixed with tons of code, but I don't think the
>>> complexity worthy for a minor feature.
>>
>> So what if we fix it like that:
>>
>> - iterate over subvolumes
>> - check if the fsid is same as for the previous subvolume
>>- if it is, continue
>>- if not, do the sync
> 
> I prefer to maintain a subvolume <-> fs mapping, and only address 
> transaction commit after all subvolumes are iterated.
> 
>>
>> IOW, sync when we leave subvolumes of one filesystem. In the degenerate
>> case, we can have subvolumes interleaved that we'd sync after each,
>> effectively --commit-each.
> 
> Since we have a subvolume <-> fs mapping, we can commit each fs after 
> iterating all subvolumes.
> 
> AFAIK "--commit-after" user would like to introduce the minimal commits, 
> so each fs only get committed once seems better to me.
> 
>>
>> The simple tracking should avoid keeping all the filedescriptors open
>> and I think this won't need tons of code to implement.
> 
> Indeed, since there is reliable fs UUID ioctl.
> 
> And above subvolume <-> fs mapping can also be used to resolve missing 
> fd problems.
> Just adding a fds <-> fs mapping, and try to call commit ioctl on first 
> good fd.
> 
>>
>> The typical could be 'btrfs subvol delete -c path/*', ie. deleting from
>> just one filesystem. If we happen to cross more filesystems, each of
>> them will get the sync, but possibly more than one. I don't think this
>> is too serious.
> 
> Well, after we implemented above user case, some facility can be reused 
> for this case.
> 
> Thanks,
> Qu

OK, I understood the points and will try to make a patch.
btw, by saying "some facility can be reused", do you mean we can reuse
fds <-> fs mapping function to other than delete?

Thanks,
Tomohiro

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] fstests: btrfs/150 regression test for reading compressed data

We had a bug in btrfs compression code which could end up with a
kernel panic.

This is adding a regression test for the bug and I've also sent a
kernel patch to fix the bug.

The patch is "Btrfs: fix kernel oops while reading compressed data".

Signed-off-by: Liu Bo 
---
 tests/btrfs/150 | 102 
 tests/btrfs/150.out |   3 ++
 tests/btrfs/group   |   1 +
 3 files changed, 106 insertions(+)
 create mode 100755 tests/btrfs/150
 create mode 100644 tests/btrfs/150.out

diff --git a/tests/btrfs/150 b/tests/btrfs/150
new file mode 100755
index 000..834be51
--- /dev/null
+++ b/tests/btrfs/150
@@ -0,0 +1,102 @@
+#! /bin/bash
+# FS QA Test btrfs/150
+#
+# This is a regression test which ends up with a kernel oops in btrfs.
+# It occurs when btrfs's read repair happens while reading a compressed
+# extent.
+# The patch for this is 
+# x
+#
+#---
+# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch
+_require_fail_make_request
+_require_scratch_dev_pool 2 
+
+SYSFS_BDEV=`_sysfs_dev $SCRATCH_DEV`
+enable_io_failure()
+{
+echo 100 > $DEBUGFS_MNT/fail_make_request/probability
+echo 1000 > $DEBUGFS_MNT/fail_make_request/times
+echo 0 > $DEBUGFS_MNT/fail_make_request/verbose
+echo 1 > $SYSFS_BDEV/make-it-fail
+}
+
+disable_io_failure()
+{
+echo 0 > $SYSFS_BDEV/make-it-fail
+echo 0 > $DEBUGFS_MNT/fail_make_request/probability
+echo 0 > $DEBUGFS_MNT/fail_make_request/times
+}
+
+_scratch_pool_mkfs "-d raid1 -b 1G" >> $seqres.full 2>&1
+
+# It doesn't matter which compression algorithm we use.
+_scratch_mount -ocompress
+
+# Create a file with all data being compressed
+$XFS_IO_PROG -f -c "pwrite -W 0 8K" $SCRATCH_MNT/foobar | _filter_xfs_io
+
+# Raid1 consists of two copies and btrfs decides which copy to read by reader's
+# %pid.  Now we inject errors to copy #1 and copy #0 is good.  We want to read
+# the bad copy to trigger read-repair.
+while true; do
+   disable_io_failure
+   # invalidate the page cache
+   $XFS_IO_PROG -f -c "fadvise -d 0 128K" $SCRATCH_MNT/foobar | 
_filter_xfs_io
+
+   enable_io_failure
+   od -x $SCRATCH_MNT/foobar > /dev/null &
+   pid=$!
+   wait
+   [ $((pid % 2)) == 1 ] && break
+done
+
+disable_io_failure
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/150.out b/tests/btrfs/150.out
new file mode 100644
index 000..c492c24
--- /dev/null
+++ b/tests/btrfs/150.out
@@ -0,0 +1,3 @@
+QA output created by 150
+wrote 8192/8192 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
diff --git a/tests/btrfs/group b/tests/btrfs/group
index 70c3f05..b70a122 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -152,3 +152,4 @@
 147 auto quick send
 148 auto quick rw
 149 auto quick send compress
+150 auto quick
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] Btrfs: skip checksum when reading compressed data if some IO have failed

Currently even if the underlying disk reports failure on IO,
compressed read endio still gets to verify checksum and reports it as
a checksum error.

In fact, if some IO have failed during reading a compressed data
extent , there's no way the checksum could match, therefore, we can
skip that in order to return error quickly to the upper layer.

Please note that we need to do this after recording the failed mirror
index so that read-repair in the upper layer's endio can work
properly.

Signed-off-by: Liu Bo 
---
 fs/btrfs/compression.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 9a4ccdf..d8d4678 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -108,7 +108,7 @@ static void end_compressed_bio_read(struct bio *bio)
struct page *page;
unsigned long index;
unsigned int mirror = btrfs_io_bio(bio)->mirror_num;
-   int ret;
+   int ret = 0;
 
if (bio->bi_status)
cb->errors = 1;
@@ -127,6 +127,13 @@ static void end_compressed_bio_read(struct bio *bio)
btrfs_io_bio(cb->orig_bio)->mirror_num = mirror;
cb->mirror_num = mirror;
 
+   /*
+* Some IO in this cb have failed, just skip checksum as there
+* is no way it could be correct.
+*/
+   if (cb->errors == 1)
+   goto csum_failed;
+
inode = cb->inode;
ret = check_compressed_csum(BTRFS_I(inode), cb,
(u64)bio->bi_iter.bi_sector << 9);
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] Btrfs: fix kernel oops while reading compressed data

The kernel oops happens at

kernel BUG at fs/btrfs/extent_io.c:2104!
...
RIP: clean_io_failure+0x263/0x2a0 [btrfs]

It's showing that read-repair code is using an improper mirror index.
This is due to the fact that compression read's endio hasn't recorded
the failed mirror index in %cb->orig_bio.

With this, btrfs's read-repair can work properly on reading compressed
data.

Signed-off-by: Liu Bo 
Reported-by: Paul Jones 
---
 fs/btrfs/compression.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index d2ef9ac..9a4ccdf 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -107,6 +107,7 @@ static void end_compressed_bio_read(struct bio *bio)
struct inode *inode;
struct page *page;
unsigned long index;
+   unsigned int mirror = btrfs_io_bio(bio)->mirror_num;
int ret;
 
if (bio->bi_status)
@@ -118,6 +119,14 @@ static void end_compressed_bio_read(struct bio *bio)
if (!refcount_dec_and_test(>pending_bios))
goto out;
 
+   /*
+* Record the correct mirror_num in cb->orig_bio so that
+* read-repair can work properly.
+*/
+   ASSERT(btrfs_io_bio(cb->orig_bio));
+   btrfs_io_bio(cb->orig_bio)->mirror_num = mirror;
+   cb->mirror_num = mirror;
+
inode = cb->inode;
ret = check_compressed_csum(BTRFS_I(inode), cb,
(u64)bio->bi_iter.bi_sector << 9);
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] btrfs: add a flag to iterate_inodes_from_logical to find all extent refs for uncompressed extents

The LOGICAL_INO ioctl provides a backward mapping from extent bytenr and
offset (encoded as a single logical address) to a list of extent refs.
LOGICAL_INO complements TREE_SEARCH, which provides the forward mapping
(extent ref -> extent bytenr and offset, or logical address).  These are
useful capabilities for programs that manipulate extents and extent
references from userspace (e.g. dedup and defrag utilities).

When the extents are uncompressed (and not encrypted and not other),
check_extent_in_eb performs filtering of the extent refs to remove any
extent refs which do not contain the same extent offset as the 'logical'
parameter's extent offset.  This prevents LOGICAL_INO from returning
references to more than a single block.

To find the set of extent references to an uncompressed extent from [a,
b), userspace has to run a loop like this pseudocode:

for (i = a; i < b; ++i)
extent_ref_set += LOGICAL_INO(i);

At each iteration of the loop (up to 32768 iterations for a 128M extent),
data we are interested in is collected in the kernel, then deleted by
the filter in check_extent_in_eb.

When the extents are compressed (or encrypted or other), the 'logical'
parameter must be an extent bytenr (the 'a' parameter in the loop).
No filtering by extent offset is done (or possible?) so the result is
the complete set of extent refs for the entire extent.  This removes
the need for the loop, since we get all the extent refs in one call.

Add an 'ignore_offset' argument to iterate_inodes_from_logical,
[...several levels of function call graph...], and check_extent_in_eb, so
that we can disable the extent offset filtering for uncompressed extents.
This flag can be set by an improved version of the LOGICAL_INO ioctl to
get either behavior as desired.

There is no functional change in this patch.  The new flag is always
false.

Signed-off-by: Zygo Blaxell 
---
 fs/btrfs/backref.c | 62 --
 fs/btrfs/backref.h |  8 ---
 fs/btrfs/inode.c   |  2 +-
 fs/btrfs/ioctl.c   |  2 +-
 fs/btrfs/qgroup.c  |  8 +++
 fs/btrfs/scrub.c   |  6 +++---
 fs/btrfs/send.c|  2 +-
 7 files changed, 52 insertions(+), 38 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 1d71a5a4b1b9..3bffd36c6897 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -302,12 +302,14 @@ static int ref_tree_add(struct ref_root *ref_tree, u64 
root_id, u64 object_id,
 static int check_extent_in_eb(struct btrfs_key *key, struct extent_buffer *eb,
struct btrfs_file_extent_item *fi,
u64 extent_item_pos,
-   struct extent_inode_elem **eie)
+   struct extent_inode_elem **eie,
+   bool ignore_offset)
 {
u64 offset = 0;
struct extent_inode_elem *e;
 
-   if (!btrfs_file_extent_compression(eb, fi) &&
+   if (!ignore_offset &&
+   !btrfs_file_extent_compression(eb, fi) &&
!btrfs_file_extent_encryption(eb, fi) &&
!btrfs_file_extent_other_encoding(eb, fi)) {
u64 data_offset;
@@ -346,7 +348,8 @@ static void free_inode_elem_list(struct extent_inode_elem 
*eie)
 
 static int find_extent_in_eb(struct extent_buffer *eb, u64 wanted_disk_byte,
u64 extent_item_pos,
-   struct extent_inode_elem **eie)
+   struct extent_inode_elem **eie,
+   bool ignore_offset)
 {
u64 disk_byte;
struct btrfs_key key;
@@ -375,7 +378,7 @@ static int find_extent_in_eb(struct extent_buffer *eb, u64 
wanted_disk_byte,
if (disk_byte != wanted_disk_byte)
continue;
 
-   ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie);
+   ret = check_extent_in_eb(, eb, fi, extent_item_pos, eie, 
ignore_offset);
if (ret < 0)
return ret;
}
@@ -511,7 +514,7 @@ static int __add_prelim_ref(struct list_head *head, u64 
root_id,
 static int add_all_parents(struct btrfs_root *root, struct btrfs_path *path,
   struct ulist *parents, struct __prelim_ref *ref,
   int level, u64 time_seq, const u64 *extent_item_pos,
-  u64 total_refs)
+  u64 total_refs, bool ignore_offset)
 {
int ret = 0;
int slot;
@@ -564,7 +567,7 @@ static int add_all_parents(struct btrfs_root *root, struct 
btrfs_path *path,
if (extent_item_pos) {
ret = check_extent_in_eb(, eb, fi,
*extent_item_pos,
-   );
+   , ignore_offset);

[PATCH 2/3] btrfs: add a flags argument to LOGICAL_INO and call it LOGICAL_INO_V2

Now that check_extent_in_eb()'s extent offset filter can be turned off,
we need a way to do it from userspace.

Add a 'flags' field to the btrfs_logical_ino_args structure to disable extent
offset filtering, taking the place of one of the reserved[] fields.

Previous versions of LOGICAL_INO neglected to check whether any of the
reserved fields have non-zero values.  Assigning meaning to those fields
now may change the behavior of existing programs that left these fields
uninitialized.

To avoid any surprises, define a new ioctl LOGICAL_INO_V2 which uses
the same argument layout as LOGICAL_INO, but uses one of the reserved
fields for flags.  The V2 ioctl explicitly checks that unsupported flag
bits are zero so that userspace can probe for future feature bits as
they are defined.  If the other reserved fields are used in the future,
one of the remaining flag bits could specify that the other reserved
fields are valid, so we don't need to check those for now.

Since the memory layouts and behavior of the two ioctls' arguments
are almost identical, there is no need for a separate function for
logical_to_ino_v2 (contrast with tree_search_v2 vs tree_search).
A version parameter and an 'if' statement will suffice.

Now that we have a flags field in logical_ino_args, add a flag
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET to get the behavior we want,
and pass it down the stack to iterate_inodes_from_logical.

Signed-off-by: Zygo Blaxell 
---
 fs/btrfs/ioctl.c   | 21 ++---
 include/uapi/linux/btrfs.h |  8 +++-
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index c6787660d91f..def0ab85134a 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4542,13 +4542,14 @@ static int build_ino_list(u64 inum, u64 offset, u64 
root, void *ctx)
 }
 
 static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info,
-   void __user *arg)
+   void __user *arg, int version)
 {
int ret = 0;
int size;
struct btrfs_ioctl_logical_ino_args *loi;
struct btrfs_data_container *inodes = NULL;
struct btrfs_path *path = NULL;
+   bool ignore_offset;
 
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
@@ -4557,6 +4558,17 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
if (IS_ERR(loi))
return PTR_ERR(loi);
 
+   if (version == 1) {
+   ignore_offset = false;
+   } else {
+   /* Only accept flags we have defined so far */
+   if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) {
+   ret = -EINVAL;
+   goto out_loi;
+   }
+   ignore_offset = loi->flags & 
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
+   }
+
path = btrfs_alloc_path();
if (!path) {
ret = -ENOMEM;
@@ -4572,7 +4584,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
}
 
ret = iterate_inodes_from_logical(loi->logical, fs_info, path,
- build_ino_list, inodes, false);
+ build_ino_list, inodes, 
ignore_offset);
if (ret == -EINVAL)
ret = -ENOENT;
if (ret < 0)
@@ -4586,6 +4598,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
 out:
btrfs_free_path(path);
vfree(inodes);
+out_loi:
kfree(loi);
 
return ret;
@@ -5559,7 +5572,9 @@ long btrfs_ioctl(struct file *file, unsigned int
case BTRFS_IOC_INO_PATHS:
return btrfs_ioctl_ino_to_path(root, argp);
case BTRFS_IOC_LOGICAL_INO:
-   return btrfs_ioctl_logical_to_ino(fs_info, argp);
+   return btrfs_ioctl_logical_to_ino(fs_info, argp, 1);
+   case BTRFS_IOC_LOGICAL_INO_V2:
+   return btrfs_ioctl_logical_to_ino(fs_info, argp, 2);
case BTRFS_IOC_SPACE_INFO:
return btrfs_ioctl_space_info(fs_info, argp);
case BTRFS_IOC_SYNC: {
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index a456e5309238..a23555026994 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -591,10 +591,14 @@ struct btrfs_ioctl_ino_path_args {
 struct btrfs_ioctl_logical_ino_args {
__u64   logical;/* in */
__u64   size;   /* in */
-   __u64   reserved[4];
+   __u64   flags;  /* in, v2 only */
+   __u64   reserved[3];
/* struct btrfs_data_container  *inodes;out   */
__u64   inodes;
 };
+/* Return every ref to the extent, not just those containing logical block.
+ * Requires logical ==

[PATCH 3/3] btrfs: increase output size for LOGICAL_INO_V2 ioctl

Build-server workloads have hundreds of references per file after dedup.
Multiply by a few snapshots and we quickly exhaust the limit of 2730
references per extent that can fit into a 64K buffer.

Raise the limit to 16M to be consistent with other btrfs ioctls
(e.g. TREE_SEARCH_V2, FILE_EXTENT_SAME).

To minimize surprising userspace behavior, apply this change only to
the LOGICAL_INO_V2 ioctl.

Signed-off-by: Zygo Blaxell 
---
 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index def0ab85134a..e13fea25ecb8 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4560,6 +4560,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
 
if (version == 1) {
ignore_offset = false;
+   size = min_t(u32, loi->size, SZ_64K);
} else {
/* Only accept flags we have defined so far */
if (loi->flags & ~(BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET)) {
@@ -4567,6 +4568,7 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
goto out_loi;
}
ignore_offset = loi->flags & 
BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET;
+   size = min_t(u32, loi->size, SZ_16M);
}
 
path = btrfs_alloc_path();
@@ -4575,7 +4577,6 @@ static long btrfs_ioctl_logical_to_ino(struct 
btrfs_fs_info *fs_info,
goto out;
}
 
-   size = min_t(u32, loi->size, SZ_64K);
inodes = init_data_container(size);
if (IS_ERR(inodes)) {
ret = PTR_ERR(inodes);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ERROR: parent determination failed (btrfs send-receive)

2017-09-20 Thread Dave

On Tue, Sep 19, 2017 at 3:37 PM, Andrei Borzenkov  wrote:
> 18.09.2017 09:10, Dave пишет:
>> I use snap-sync to create and send snapshots.
>>
>> GitHub - wesbarnett/snap-sync: Use snapper snapshots to backup to external 
>> drive
>> https://github.com/wesbarnett/snap-sync
>>
>
> Are you trying to backup top-level subvolume?

No. I never mount the top-level subvolume.

>  I just reproduced this > behavior with this tool. The problem is, snapshots 
> of top-level
> subvolume do not have parent UUID (I am not even sure if UUID exists at
> all TBH). If you mount any other subvolume, it will work.

Well, that's not exactly my issue but it still helps me gain
additional understanding.  It may help me reproduce my original
problem too. Thank you.

>
>
> As I told you in the first reply, showing output of "btrfs su li -qu
> /path/to/src" would explain your problem much earlier.
>
> Actually if snap-sync used "btrfs send -p" instead of "btrfs send -c" it
> would work as well, as then no parent search would be needed (and as I
> mentioned in another mail both commands are functionally equivalent).
> But this becomes really off-topic on this list. As already suggested,
> open issue for snap-sync.

Thank you for this also. I will simply edit my current copy of the
snap-sync script to use "-p" now. That's a simple change I am
implement immediately.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: Btrfs does not store checksum type? Only size? O_o




On 2017年09月21日 04:31, Timofey Titovets wrote:

I was reading the kernel code and docs and didn't find anything about
some info like "cheksum type" as btrfs have for compression code.
(I only found some silent "hack" in btrfs-progs, that print "checksum
type", but get "checksum size")


Checksum type is global, so one btrfs fs uses the same checksum type for 
all its meta and data.


So checksum type is in superblock. Check btrfs_super_block->csum_type.



If i understood all correctly btrfs only store checksum size, i.e. 4
for current CRC32C.


Yes.



May be that can be useful to rewrite some part of code and reuse
checksum size, as checksum type.

For backward compatibility current CRC32C size can be used for identify CRC32C.


Since we have superblock csum type, why introducing new one?
(Unless you're going to allow multi csum algorithm to co-exist in one 
btrfs fs)


Thanks,
Qu

As that will not add any new checksum, that allow make that with zero "blood".
And later if we add new checksum type, that possible to add compat
flag for that.

What you think?

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]




On 2017年09月21日 02:10, Rich Rauenzahn wrote:


On 9/20/2017 9:58 AM, Rich Rauenzahn wrote:


What's the most direct way to do that?  (Was about to risk breaking 
the mirror and repartitioning!  I'd rather not!)


Hmm -- maybe this worked:

$ sudo btrfs filesystem resize -1m /.MEDIA/
Resize '/.MEDIA/' of '-1m'
No, doesn't seem to have worked.  I used btrfs fi show --raw /.MEDIA to 
see the size in bytes, and then ...


sudo btrfs filesystem resize N:SIZE /.MEDIA/



Yes, this is what I did to my fs.
While in fact you can do it better by using:

btrfs fi resize N:max 

Or

btrfs fi resize N:-1 

The first one will resize the devid N to its max size.
The 2nd one will resize the devid N to current size - 1 (byte).

Both will go through shrink routine, which will do the round_down.

Thanks,
Qu

Where N was the disk number in the show output, and SIZE was the number 
of bytes to set it to that was a multiple of 4k.


Rich

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Does btrfs use crc32 for error correction?

2017-09-19 19:51 GMT+03:00 Eric Sandeen :
> On 9/19/17 10:35 AM, Timofey Titovets wrote:
>> Stupid question:
>> Does btrfs use crc32 for error correction?
>> If no, why?
>>
>> (AFAIK if using CRC that possible to fix 1 bit flip)
>>
>> P.S. I try check that (i create image, create text file, flip bit, try
>> read and btrfs show IO-error)
>>
>> Thanks!
>
> I wasn't aware that crc32 could (in general) be used for single bit 
> correction; I've read up on that, and it seems pretty cool.
>
> However, I don't think that the generator polynomial used in crc32c /can/ be 
> used for error correction.  I just skimmed some reading, but that seems to be 
> the case if I understand it correctly.
>
> -Eric

That possible, but if i understood doc about CRC16 bit correction
correctly, that need some tricks (As example preprocess all possible
bit flips and have a map of that) for fast find bit error.
Also in theory for CRC32C that possible with some state machine, to
move algo backward and forward over data.

I also check [1], so CRC32C (for our case) have hamming distance about 4.
So CRC32 in theory can fix up to HD-1 ~= 3 bit (If i understand all
things correctly);

So that possible to just brute force over data and flip 1-3 bit, and
try check if CRC fixed %)
In theory that safe for 1-2 bit error (Because again HD for input
data) (Proof of concept[2]).

Thanks.

1. https://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koopman.pdf
2. 
https://github.com/Nefelim4ag/CRC32C_AND_8Byte_parity/commit/cb0fe76eac2ff036468bbc5ea75ebc06cb1e9928

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD caching an existing btrfs raid1

2017-09-20 Thread Kai Krakow

Am Wed, 20 Sep 2017 17:51:15 +0200
schrieb Psalle :

> On 19/09/17 17:47, Austin S. Hemmelgarn wrote:
> (...)
> >
> > A better option if you can afford to remove a single device from
> > that array temporarily is to use bcache.  Bcache has one specific
> > advantage in this case, multiple backend devices can share the same
> > cache device. This means you don't have to carve out dedicated
> > cache space for each disk on the SSD and leave some unused space so
> > that you can add new devices if needed.  The downside is that you
> > can't convert each device in-place, but because you're using BTRFS,
> > you can still convert the volume as a whole in-place.  The
> > procedure for doing so looks like this:
> >
> > 1. Format the SSD as a bcache cache.
> > 2. Use `btrfs device delete` to remove a single hard drive from the 
> > array.
> > 3. Set up the drive you just removed as a bcache backing device
> > bound to the cache you created in step 1.
> > 4. Add the new bcache device to the array.
> > 5. Repeat from step 2 until the whole array is converted.
> >
> > A similar procedure can actually be used to do almost any
> > underlying storage conversion (for example, switching to whole disk
> > encryption, or adding LVM underneath BTRFS) provided all your data
> > can fit on one less disk than you have.  
> 
> Thanks Austin, that's just great. For some reason I had discarded
> bcache thinking that it would force me to rebuild from scratch, but
> this kind of incremental migration is exactly why I hoped was
> possible. I have plenty of space to replace the devices one by one.
> 
> I will report back my experience in a few days, I hope.

I've done it exactly that way in the past and it worked flawlessly (but
it took 24+ hours). But it was easy for me because I was also adding a
third disk to the pool, so existing stuff could easily move.

I suggest to initialize bcache to writearound mode while converting, so
your maybe terabytes of disk don't go through the SSD.

If you later decide to remove bcache or not sure about future bcache
usage, you can wrap any partition into a bcache container - just don't
connect it to a cache and it will work like a normal partition.


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: Btrfs does not store checksum type? Only size? O_o

2017-09-20 Thread Hans van Kranenburg

On 09/20/2017 10:31 PM, Timofey Titovets wrote:
> I was reading the kernel code and docs and didn't find anything about
> some info like "cheksum type" as btrfs have for compression code.
> (I only found some silent "hack" in btrfs-progs, that print "checksum
> type", but get "checksum size")
> 
> If i understood all correctly btrfs only store checksum size, i.e. 4
> for current CRC32C.

The checksum type is set in the superblock:

csum_type   0 (crc32c)
csum_size   4
[...]

So in the current design, you probably could have different checksum
types, but you will have to choose one of them when creating a filesystem.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RFC: Btrfs does not store checksum type? Only size? O_o

I was reading the kernel code and docs and didn't find anything about
some info like "cheksum type" as btrfs have for compression code.
(I only found some silent "hack" in btrfs-progs, that print "checksum
type", but get "checksum size")

If i understood all correctly btrfs only store checksum size, i.e. 4
for current CRC32C.

May be that can be useful to rewrite some part of code and reuse
checksum size, as checksum type.

For backward compatibility current CRC32C size can be used for identify CRC32C.
As that will not add any new checksum, that allow make that with zero "blood".
And later if we add new checksum type, that possible to add compat
flag for that.

What you think?

Thanks.
-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel BUG at fs/btrfs/extent_io.c:1989

On Wed, Sep 20, 2017 at 02:53:57PM +0200, David Sterba wrote:
> On Tue, Sep 19, 2017 at 10:12:39AM -0600, Liu Bo wrote:
> > On Tue, Sep 19, 2017 at 05:07:25PM +0200, David Sterba wrote:
> > > On Tue, Sep 19, 2017 at 11:32:46AM +, Paul Jones wrote:
> > > > > This 'mirror 0' looks fishy, (as mirror comes from 
> > > > > btrfs_io_bio->mirror_num,
> > > > > which should be at least 1 if raid1 setup is in use.)
> > > > > 
> > > > > Not sure if 4.13.2-gentoo made any changes on btrfs, but can you 
> > > > > please
> > > > > verify with the upstream kernel, say, v4.13?
> > > > 
> > > > It's basically a vanilla kernel with a handful of unrelated patches.
> > > > The filesystem fell apart overnight, there were a few thousand
> > > > checksum errors and eventually it went read-only. I tried to remount
> > > > it, but got open_ctree failed. Btrfs check segfaulted, lowmem mode
> > > > completed with so many errors I gave up and will restore from the
> > > > backup.
> > > > 
> > > > I think I know the problem now - the lvm cache was in writeback mode
> > > > (by accident) so during a defrag there would be gigabytes of unwritten
> > > > data in memory only, which was all lost when the system crashed
> > > > (motherboard failure). No wonder the filesystem didn't quite survive.
> > > 
> > > Yeah, the caching layer was my first suspicion, and lack of propagating
> > > of the barriers. Good that you were able to confirm that as the root 
> > > cause.
> > > 
> > > > I must say though, I'm seriously impressed at the data integrity of
> > > > BTRFS - there were near 10,000 checksum errors, 4 which were
> > > > uncorrectable, and from what I could tell nearly all of the data was
> > > > still intact according to rsync checksums.
> > > 
> > > Yay!
> > 
> > But still don't get why mirror_num is 0, do you have an idea on how
> > does writeback cache make that?
> 
> My first idea was that the cached blocks were zeroed, so we'd see the ino
> and mirror as 0. But this is not correct as the blocks would not pass
> the checksum tests, so the blocks must be from some previous generation.
> Ie. the transid verify failure. And all the error reports appear after
> that so I'm slightly suspicious about the way it's actually reported.
> 
> btrfs_print_data_csum_error takes mirror from either io_bio or
> compressed_bio structures, so there might be a case when the structures
> are initialized. If the transid check is ok, then the structures are
> updated. If the check fails we'd see the initial mirror number. All of
> that is just a hypothesis, I haven't checked with the code.
>

Thanks a lot for the input, you're right, mirror_num 0 should come
from compressed read where it doesn't record the bbio->mirror_num but
the mirror passing from the upper layer, and it's not metadata as we
don't yet compress metadata, so this all makes sense.

I think it also disables the ability of read-repair from raid1 for
compressed data, and that's what caused the bug where it hits
BUG_ON(mirror_num == 0) in cleanup_io_failure().

The good news is that I can reproduce it, will send a patch and a
testcase.

> I don't have a theoretical explanation for the ino 0. The inode pointer
> that goes to btrfs_print_data_csum_error should be from a properly
> initialized inode and we print the number using btrfs_ino. That will use
> the vfs i_ino value and we should never get 0 out of that.

ino 0 comes from metadata read-repair, some cleanup may be needed to
make it less confusing.

thanks,
-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: difference between -c and -p for send-receive?

2017-09-20 Thread Andrei Borzenkov

20.09.2017 22:05, Antoine Belvire пишет:
> Hello,
> 
>> All snapshots listed in -c options and snapshot that we want to
>> transfer must have the same parent uuid, unless -p is explicitly
>> provided.
> 
> It's rather the same mount point than the same parent uuid, like cp
> --reflink, isn't it?

Sorry, I do not understand this sentence. Could you rephrase?

> 
> ~# btrfs subvolume create /test2/
> Create subvolume '//test2'
> ~# btrfs subvolume create /test2/foo
> Create subvolume '/test2/foo'
> ~# cd /test2   
> ~# btrfs subvolume snapshot -r . .1
> Create a readonly snapshot of '.' in './.1'
> ~#
> ~# # a: 40 MiB in /test2/
> ~# dd if=/dev/urandom of=a bs=4k count=10k
> 10240+0 records in
> 10240+0 records out
> 41943040 bytes (42 MB, 40 MiB) copied, 0.198961 s, 211 MB/s
> ~#
> ~# # b: 80 MiB in /test2/foo
> ~# dd if=/dev/urandom of=foo/b bs=4k count=20k
> 0480+0 records in
> 20480+0 records out
> 83886080 bytes (84 MB, 80 MiB) copied, 0.393823 s, 213 MB/s
> ~#
> ~# # copy-clone /test2/foo/b to /test2/b
> ~# cp --reflink foo/b .
> ~#
> ~# btrfs subvolume -s . .2
> Create a readonly snapshot of '.' in './.2'
> ~#
> ~# # Sending .2 with only .1 as parent (.1 already sent)
> ~# btrfs send -p .1 .2 | wc -c
> At subvol .2
> 125909258 # 120 Mio = 'a' + 'b'
> ~#
> ~# # Sending .2 with .1 and foo as clone sources (.1 and foo already
> ~# # sent), .1 is automatically picked as parent
> ~# btrfs property set foo ro true
> ~# btrfs send -c .1 -c foo .2 | wc -c
> At subvol .2
> 41970349 # 40 Mio, only 'a'
> ~#
> 
> UUIDs on the sending side:
> 
> ~# btrfs subvolume list -uq / | grep test2
> ID 6141 gen 454658 top level 6049 parent_uuid - uuid
> bbf936dd-ca84-f749-9b9b-09f7081879a2 path test2
> ID 6142 gen 454658 top level 6141 parent_uuid - uuid
> 54a7cdea-6198-424a-9349-8116172d0c17 path test2/foo

Yes, sorry, I misread the code. We need at least one snapshot from -c
option that has the same parent uuid (i.e. is snapshot of the same
subvolume) as the snapshot we want to transfer, not all of them. In your
case

btrfs send -c .1 -c foo .2

it will select .1 as base snapshot and additionally try to clone from
foo if possible. In wiki example the only two snapshots are from
completely different subvolumes which will fail. In discussion that lead
to this mail snapshots probably did not have any parent uuid at all.

> ID 6143 gen 454655 top level 6141 parent_uuid
> bbf936dd-ca84-f749-9b9b-09f7081879a2 uuid
> 28f1d7db-7341-f545-a2ac-d8819d22a5b5 path test2/.1
> ID 6144 gen 454658 top level 6141 parent_uuid
> bbf936dd-ca84-f749-9b9b-09f7081879a2 uuid
> db9ad2b1-aee1-544c-b368-d698b4a05119 path test2/.2
> ~#
> 
> On the receiving side, .1 is used as parent:
> 
> ~# btrfs subvolume list -uq /var/run/media/antoine/backups/ | grep dest
> ID 298 gen 443 top level 5 parent_uuid - uuid
> 7695cba7-dfbf-2f44-bd79-18c9820fdb2f path dest/.1
> ID 299 gen 443 top level 5 parent_uuid - uuid
> c32c06ec-0a17-cf42-9b04-3804ad72f836 path dest/foo
> ID 300 gen 446 top level 5 parent_uuid
> 7695cba7-dfbf-2f44-bd79-18c9820fdb2f uuid
> 552b1d51-38bf-d546-a47f-4bc667ec4128 path dest/.2
> ~#
> 
> Regards,
> 
> -- 
> Antoine

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: difference between -c and -p for send-receive?

2017-09-20 Thread Antoine Belvire


Hello,

> All snapshots listed in -c options and snapshot that we want to
> transfer must have the same parent uuid, unless -p is explicitly
> provided.

It's rather the same mount point than the same parent uuid, like cp 
--reflink, isn't it?


~# btrfs subvolume create /test2/
Create subvolume '//test2'
~# btrfs subvolume create /test2/foo
Create subvolume '/test2/foo'
~# cd /test2
~# btrfs subvolume snapshot -r . .1
Create a readonly snapshot of '.' in './.1'
~#
~# # a: 40 MiB in /test2/
~# dd if=/dev/urandom of=a bs=4k count=10k
10240+0 records in
10240+0 records out
41943040 bytes (42 MB, 40 MiB) copied, 0.198961 s, 211 MB/s
~#
~# # b: 80 MiB in /test2/foo
~# dd if=/dev/urandom of=foo/b bs=4k count=20k
0480+0 records in
20480+0 records out
83886080 bytes (84 MB, 80 MiB) copied, 0.393823 s, 213 MB/s
~#
~# # copy-clone /test2/foo/b to /test2/b
~# cp --reflink foo/b .
~#
~# btrfs subvolume -s . .2
Create a readonly snapshot of '.' in './.2'
~#
~# # Sending .2 with only .1 as parent (.1 already sent)
~# btrfs send -p .1 .2 | wc -c
At subvol .2
125909258 # 120 Mio = 'a' + 'b'
~#
~# # Sending .2 with .1 and foo as clone sources (.1 and foo already
~# # sent), .1 is automatically picked as parent
~# btrfs property set foo ro true
~# btrfs send -c .1 -c foo .2 | wc -c
At subvol .2
41970349 # 40 Mio, only 'a'
~#

UUIDs on the sending side:

~# btrfs subvolume list -uq / | grep test2
ID 6141 gen 454658 top level 6049 parent_uuid - uuid 
bbf936dd-ca84-f749-9b9b-09f7081879a2 path test2
ID 6142 gen 454658 top level 6141 parent_uuid - uuid 
54a7cdea-6198-424a-9349-8116172d0c17 path test2/foo
ID 6143 gen 454655 top level 6141 parent_uuid 
bbf936dd-ca84-f749-9b9b-09f7081879a2 uuid 
28f1d7db-7341-f545-a2ac-d8819d22a5b5 path test2/.1
ID 6144 gen 454658 top level 6141 parent_uuid 
bbf936dd-ca84-f749-9b9b-09f7081879a2 uuid 
db9ad2b1-aee1-544c-b368-d698b4a05119 path test2/.2

~#

On the receiving side, .1 is used as parent:

~# btrfs subvolume list -uq /var/run/media/antoine/backups/ | grep dest
ID 298 gen 443 top level 5 parent_uuid - uuid 
7695cba7-dfbf-2f44-bd79-18c9820fdb2f path dest/.1
ID 299 gen 443 top level 5 parent_uuid - uuid 
c32c06ec-0a17-cf42-9b04-3804ad72f836 path dest/foo
ID 300 gen 446 top level 5 parent_uuid 
7695cba7-dfbf-2f44-bd79-18c9820fdb2f uuid 
552b1d51-38bf-d546-a47f-4bc667ec4128 path dest/.2

~#

Regards,

--
Antoine
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]

2017-09-20 Thread Rich Rauenzahn



On 9/20/2017 9:58 AM, Rich Rauenzahn wrote:


What's the most direct way to do that?  (Was about to risk breaking 
the mirror and repartitioning!  I'd rather not!)


Hmm -- maybe this worked:

$ sudo btrfs filesystem resize -1m /.MEDIA/
Resize '/.MEDIA/' of '-1m'
No, doesn't seem to have worked.  I used btrfs fi show --raw /.MEDIA to 
see the size in bytes, and then ...


sudo btrfs filesystem resize N:SIZE /.MEDIA/

Where N was the disk number in the show output, and SIZE was the number 
of bytes to set it to that was a multiple of 4k.


Rich
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]

2017-09-20 Thread Rich Rauenzahn




On 9/19/2017 10:39 PM, Qu Wenruo wrote:


In v4.13 kernel, the newly added devices are in fact rounded down.
But existing device doesn't get the round down.

So it's recommended to resize (shrink) your fs for very small size to 
fix it if you don't want to wait for the kernel fix.




What's the most direct way to do that?  (Was about to risk breaking the 
mirror and repartitioning!  I'd rather not!)


Hmm -- maybe this worked:

$ sudo btrfs filesystem resize -1m /.MEDIA/
Resize '/.MEDIA/' of '-1m'


Thanks,
Qu


So I'm wondering if it's caused by added new btrfs device.



Possibly... I've added and removed disks to some of the btrfs fs's.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]

2017-09-20 Thread Rich Rauenzahn

For the warning -- is there anyway to add in the filesystem/disk causing 
the issue?  I didn't see any identifier in the message that told me 
which it was.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD caching an existing btrfs raid1

2017-09-20 Thread Psalle




On 19/09/17 17:47, Austin S. Hemmelgarn wrote:
(...)


A better option if you can afford to remove a single device from that 
array temporarily is to use bcache.  Bcache has one specific advantage 
in this case, multiple backend devices can share the same cache 
device. This means you don't have to carve out dedicated cache space 
for each disk on the SSD and leave some unused space so that you can 
add new devices if needed.  The downside is that you can't convert 
each device in-place, but because you're using BTRFS, you can still 
convert the volume as a whole in-place.  The procedure for doing so 
looks like this:


1. Format the SSD as a bcache cache.
2. Use `btrfs device delete` to remove a single hard drive from the 
array.
3. Set up the drive you just removed as a bcache backing device bound 
to the cache you created in step 1.

4. Add the new bcache device to the array.
5. Repeat from step 2 until the whole array is converted.

A similar procedure can actually be used to do almost any underlying 
storage conversion (for example, switching to whole disk encryption, 
or adding LVM underneath BTRFS) provided all your data can fit on one 
less disk than you have.


Thanks Austin, that's just great. For some reason I had discarded bcache 
thinking that it would force me to rebuild from scratch, but this kind 
of incremental migration is exactly why I hoped was possible. I have 
plenty of space to replace the devices one by one.


I will report back my experience in a few days, I hope.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]




On 2017年09月20日 22:14, nborisov wrote:

On 2017-09-20 08:23, Qu Wenruo wrote:

On 2017年09月20日 14:11, nborisov wrote:

On 2017-09-20 07:39, Qu Wenruo wrote:

On 2017年09月20日 13:10, Qu Wenruo wrote:



On 2017年09月20日 12:59, Qu Wenruo wrote:



On 2017年09月20日 12:49, Rich Rauenzahn wrote:



On 9/19/2017 5:31 PM, Qu Wenruo wrote:

On 2017年09月19日 23:56, Rich Rauenzahn wrote:

[    4.747377] WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559
btrfs_update_device+0x1c5/0x1d0 [btrfs]


Is that line the following WARN_ON()?
---
static inline void btrfs_set_device_total_bytes(struct 
extent_buffer *eb,

    struct btrfs_dev_item *s,
    u64 val)
{
BUILD_BUG_ON(sizeof(u64) !=
 sizeof(((struct btrfs_dev_item *)0))->total_bytes);
WARN_ON(!IS_ALIGNED(val, eb->fs_info->sectorsize)); <<<
btrfs_set_64(eb, s, offsetof(struct btrfs_dev_item, 
total_bytes), val);

}
---

If so, that means your devices size is not aligned to 4K.

Is your block device still using old 512 block size?
AFAIK nowadays most HDDs are using 4K blocksize and it's 
recommended to use it.


It's not a big problem and one can easily remove the WARN_ON().
But I think we'd better fix the caller to do round_down() before 
calling this function.




That's interesting!  I believe I made an effort to align them 
when I set it up years ago, but never knew how to verify.


Well, best verifying if that's the line causing the warning, since 
I don't have the source of RedHat kernel.




I have three mirrored filesystems:


[snip]


Number  Start (sector)    End (sector)  Size   Code Name
    1  40  3907029134   1.8 TiB 8300 BTRFS MEDIA
GPT fdisk (gdisk) version 0.8.6


At least this size is not aligned to 4K.



Partition table scan:

[snip]


.and one is aligned differently!

Could it be /dev/sdd that's the issue?  But it's aligned at 4096 
-- so I'm not sure that's the issue after all.


Its start sector is aligned, but end point is not, so the size is 
not aligned either.


BTW, is /dev/sdd added to btrfs using "btrfs device add"?
In my test, if making btrfs on a unaligned file, it will round 
down to its sectorsize boundary.


Confirmed that "btrfs device add" won't round down the size.
Check the btrfs-debug-tree output:
--
 item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
 devid 1 total_bytes 10737418240 bytes_used 2172649472
 io_align 4096 io_width 4096 sector_size 4096 type 0
 generation 0 start_offset 0 dev_group 0
 seek_speed 0 bandwidth 0
 uuid 243a1117-ca31-4d87-8656-81c5630aafb2
 fsid 6452cde7-14d5-4541-aa07-b265a400bad0
 item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98
 devid 2 total_bytes 1073742336 bytes_used 0
 io_align 4096 io_width 4096 sector_size 4096 type 0
 generation 0 start_offset 0 dev_group 0
 seek_speed 0 bandwidth 0
 uuid 6bb07260-d230-4e22-88b1-1eabb46622ed
 fsid 6452cde7-14d5-4541-aa07-b265a400bad0
--


Sorry, the output is from v4.12.x, so no kernel warning nor the patch
rounding down the value.



Where first device is completely aligned, the 2nd device which is 
just 1G + 512, definitely not aligned.


So if you're using single device purely created by mkfs.btrfs, 
you're OK.
But if any new device added, you're not OK and causing the false 
alert.


Any way, it should not be hard to fix.
Just remove the WARN_ON() and add extra round_down when adding device.


In v4.13 kernel, the newly added devices are in fact rounded down.
But existing device doesn't get the round down.


We got a report internally at Suse of this problem and it prevented a 
filesystem from being mounted due to the
following check failing: 
http://elixir.free-electrons.com/linux/latest/source/fs/btrfs/volumes.c#L6893. 
Hence I added the rounding down fixes. And this warning was put 
specifically to catch future offenders and see
if I had missed a place to patch it. So removing the warning is the 
wrong solution to the problem.


Then at least only enable it for BTRFS_DEBUG.


No, the idea is that if a bug in the code causes such a "corruption" we 
ought to be able to catch it in the first instance and not post factum. 
If a bug in btrfs is introduced in a call path which invokes 
btrfs_update_device with misaligned values we won't see which was the 
culprit. If one then enables BTRFS_DEBUG and starts seeing the warnings 
it will likely yield no useful information since the value will already 
be corrupted.


I totally understand the fix and introduced WARN_ON().

But the truth is, the fix (along with Liu Bo's validation checker) 
doesn't consider existing fs with unaligned bytenr.


No user would be happy to see tons of kernel backtrace just because they 
updated their kernel.






For end user it's just confusing.

I have submitted a

Re: btrfs-progs: suggestion of removing --commit-after option of subvol delete




On 2017年09月20日 22:03, David Sterba wrote:

On Wed, Sep 20, 2017 at 08:22:54AM +0800, Qu Wenruo wrote:

The costly part will be tracking the filesystems of subvolumes.
We must do it for each subvolume and batch them to address the final
transaction commit for each fs.

I didn't see any easy ioctl to get the UUID from fd, meaning (if didn't
miss anything) we need to iterate the path until reaching the mount
boundary and then refer to mountinfo to find the fs.


The FS_INFO ioctl will tell you the fsid from a fd.  Iterating the paths
will not work reliably, due to some potentially very creative use of
mounts that could build up the path components.


The biggest concern is about how to get fsid reliably.
And since FS_INFO ioctl solves this, I'm completely fine.
(I just search "UUID" for ioctls and get no hit, so I assume it may be a 
problem)





Not to mention that the possible softlink in the path may make things
more complex.

Yes, this may be fixed with tons of code, but I don't think the
complexity worthy for a minor feature.


So what if we fix it like that:

- iterate over subvolumes
- check if the fsid is same as for the previous subvolume
   - if it is, continue
   - if not, do the sync


I prefer to maintain a subvolume <-> fs mapping, and only address 
transaction commit after all subvolumes are iterated.




IOW, sync when we leave subvolumes of one filesystem. In the degenerate
case, we can have subvolumes interleaved that we'd sync after each,
effectively --commit-each.


Since we have a subvolume <-> fs mapping, we can commit each fs after 
iterating all subvolumes.


AFAIK "--commit-after" user would like to introduce the minimal commits, 
so each fs only get committed once seems better to me.




The simple tracking should avoid keeping all the filedescriptors open
and I think this won't need tons of code to implement.


Indeed, since there is reliable fs UUID ioctl.

And above subvolume <-> fs mapping can also be used to resolve missing 
fd problems.
Just adding a fds <-> fs mapping, and try to call commit ioctl on first 
good fd.




The typical could be 'btrfs subvol delete -c path/*', ie. deleting from
just one filesystem. If we happen to cross more filesystems, each of
them will get the sync, but possibly more than one. I don't think this
is too serious.


Well, after we implemented above user case, some facility can be reused 
for this case.


Thanks,
Qu

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]

2017-09-20 Thread nborisov


On 2017-09-20 08:23, Qu Wenruo wrote:

On 2017年09月20日 14:11, nborisov wrote:

On 2017-09-20 07:39, Qu Wenruo wrote:

On 2017年09月20日 13:10, Qu Wenruo wrote:



On 2017年09月20日 12:59, Qu Wenruo wrote:



On 2017年09月20日 12:49, Rich Rauenzahn wrote:



On 9/19/2017 5:31 PM, Qu Wenruo wrote:

On 2017年09月19日 23:56, Rich Rauenzahn wrote:

[    4.747377] WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559
btrfs_update_device+0x1c5/0x1d0 [btrfs]


Is that line the following WARN_ON()?
---
static inline void btrfs_set_device_total_bytes(struct 
extent_buffer *eb,

    struct btrfs_dev_item *s,
    u64 val)
{
BUILD_BUG_ON(sizeof(u64) !=
 sizeof(((struct btrfs_dev_item *)0))->total_bytes);
WARN_ON(!IS_ALIGNED(val, eb->fs_info->sectorsize)); <<<
btrfs_set_64(eb, s, offsetof(struct btrfs_dev_item, 
total_bytes), val);

}
---

If so, that means your devices size is not aligned to 4K.

Is your block device still using old 512 block size?
AFAIK nowadays most HDDs are using 4K blocksize and it's 
recommended to use it.


It's not a big problem and one can easily remove the WARN_ON().
But I think we'd better fix the caller to do round_down() before 
calling this function.




That's interesting!  I believe I made an effort to align them when 
I set it up years ago, but never knew how to verify.


Well, best verifying if that's the line causing the warning, since 
I don't have the source of RedHat kernel.




I have three mirrored filesystems:


[snip]


Number  Start (sector)    End (sector)  Size   Code Name
    1  40  3907029134   1.8 TiB 8300 BTRFS 
MEDIA

GPT fdisk (gdisk) version 0.8.6


At least this size is not aligned to 4K.



Partition table scan:

[snip]


.and one is aligned differently!

Could it be /dev/sdd that's the issue?  But it's aligned at 4096 
-- so I'm not sure that's the issue after all.


Its start sector is aligned, but end point is not, so the size is 
not aligned either.


BTW, is /dev/sdd added to btrfs using "btrfs device add"?
In my test, if making btrfs on a unaligned file, it will round down 
to its sectorsize boundary.


Confirmed that "btrfs device add" won't round down the size.
Check the btrfs-debug-tree output:
--
 item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
 devid 1 total_bytes 10737418240 bytes_used 
2172649472

 io_align 4096 io_width 4096 sector_size 4096 type 0
 generation 0 start_offset 0 dev_group 0
 seek_speed 0 bandwidth 0
 uuid 243a1117-ca31-4d87-8656-81c5630aafb2
 fsid 6452cde7-14d5-4541-aa07-b265a400bad0
 item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98
 devid 2 total_bytes 1073742336 bytes_used 0
 io_align 4096 io_width 4096 sector_size 4096 type 0
 generation 0 start_offset 0 dev_group 0
 seek_speed 0 bandwidth 0
 uuid 6bb07260-d230-4e22-88b1-1eabb46622ed
 fsid 6452cde7-14d5-4541-aa07-b265a400bad0
--


Sorry, the output is from v4.12.x, so no kernel warning nor the patch
rounding down the value.



Where first device is completely aligned, the 2nd device which is 
just 1G + 512, definitely not aligned.


So if you're using single device purely created by mkfs.btrfs, 
you're OK.
But if any new device added, you're not OK and causing the false 
alert.


Any way, it should not be hard to fix.
Just remove the WARN_ON() and add extra round_down when adding 
device.


In v4.13 kernel, the newly added devices are in fact rounded down.
But existing device doesn't get the round down.


We got a report internally at Suse of this problem and it prevented a 
filesystem from being mounted due to the
following check failing: 
http://elixir.free-electrons.com/linux/latest/source/fs/btrfs/volumes.c#L6893. 
Hence I added the rounding down fixes. And this warning was put 
specifically to catch future offenders and see
if I had missed a place to patch it. So removing the warning is the 
wrong solution to the problem.


Then at least only enable it for BTRFS_DEBUG.


No, the idea is that if a bug in the code causes such a "corruption" we 
ought to be able to catch it in the first instance and not post factum. 
If a bug in btrfs is introduced in a call path which invokes 
btrfs_update_device with misaligned values we won't see which was the 
culprit. If one then enables BTRFS_DEBUG and starts seeing the warnings 
it will likely yield no useful information since the value will already 
be corrupted.




For end user it's just confusing.

I have submitted a patch to do the check at mounting time, and warning
user to do shrink to fix it.
(Although still removed the WARN_ON)

I think such warning breaks backward compatibility should be as gentle
as possible for end users.


How exactly is it breaking backward compatibility?



Thanks,
Qu


Generally if
balancing kicked

Re: btrfs-progs: suggestion of removing --commit-after option of subvol delete

2017-09-20 Thread David Sterba

On Wed, Sep 20, 2017 at 08:22:54AM +0800, Qu Wenruo wrote:
> The costly part will be tracking the filesystems of subvolumes.
> We must do it for each subvolume and batch them to address the final 
> transaction commit for each fs.
> 
> I didn't see any easy ioctl to get the UUID from fd, meaning (if didn't 
> miss anything) we need to iterate the path until reaching the mount 
> boundary and then refer to mountinfo to find the fs.

The FS_INFO ioctl will tell you the fsid from a fd.  Iterating the paths
will not work reliably, due to some potentially very creative use of
mounts that could build up the path components.

> Not to mention that the possible softlink in the path may make things 
> more complex.
> 
> Yes, this may be fixed with tons of code, but I don't think the 
> complexity worthy for a minor feature.

So what if we fix it like that:

- iterate over subvolumes
- check if the fsid is same as for the previous subvolume
  - if it is, continue
  - if not, do the sync

IOW, sync when we leave subvolumes of one filesystem. In the degenerate
case, we can have subvolumes interleaved that we'd sync after each,
effectively --commit-each.

The simple tracking should avoid keeping all the filedescriptors open
and I think this won't need tons of code to implement.

The typical could be 'btrfs subvol delete -c path/*', ie. deleting from
just one filesystem. If we happen to cross more filesystems, each of
them will get the sync, but possibly more than one. I don't think this
is too serious.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel BUG at fs/btrfs/extent_io.c:1989

2017-09-20 Thread David Sterba

On Tue, Sep 19, 2017 at 10:12:39AM -0600, Liu Bo wrote:
> On Tue, Sep 19, 2017 at 05:07:25PM +0200, David Sterba wrote:
> > On Tue, Sep 19, 2017 at 11:32:46AM +, Paul Jones wrote:
> > > > This 'mirror 0' looks fishy, (as mirror comes from 
> > > > btrfs_io_bio->mirror_num,
> > > > which should be at least 1 if raid1 setup is in use.)
> > > > 
> > > > Not sure if 4.13.2-gentoo made any changes on btrfs, but can you please
> > > > verify with the upstream kernel, say, v4.13?
> > > 
> > > It's basically a vanilla kernel with a handful of unrelated patches.
> > > The filesystem fell apart overnight, there were a few thousand
> > > checksum errors and eventually it went read-only. I tried to remount
> > > it, but got open_ctree failed. Btrfs check segfaulted, lowmem mode
> > > completed with so many errors I gave up and will restore from the
> > > backup.
> > > 
> > > I think I know the problem now - the lvm cache was in writeback mode
> > > (by accident) so during a defrag there would be gigabytes of unwritten
> > > data in memory only, which was all lost when the system crashed
> > > (motherboard failure). No wonder the filesystem didn't quite survive.
> > 
> > Yeah, the caching layer was my first suspicion, and lack of propagating
> > of the barriers. Good that you were able to confirm that as the root cause.
> > 
> > > I must say though, I'm seriously impressed at the data integrity of
> > > BTRFS - there were near 10,000 checksum errors, 4 which were
> > > uncorrectable, and from what I could tell nearly all of the data was
> > > still intact according to rsync checksums.
> > 
> > Yay!
> 
> But still don't get why mirror_num is 0, do you have an idea on how
> does writeback cache make that?

My first idea was that the cached blocks were zeroed, so we'd see the ino
and mirror as 0. But this is not correct as the blocks would not pass
the checksum tests, so the blocks must be from some previous generation.
Ie. the transid verify failure. And all the error reports appear after
that so I'm slightly suspicious about the way it's actually reported.

btrfs_print_data_csum_error takes mirror from either io_bio or
compressed_bio structures, so there might be a case when the structures
are initialized. If the transid check is ok, then the structures are
updated. If the check fails we'd see the initial mirror number. All of
that is just a hypothesis, I haven't checked with the code.

I don't have a theoretical explanation for the ino 0. The inode pointer
that goes to btrfs_print_data_csum_error should be from a properly
initialized inode and we print the number using btrfs_ino. That will use
the vfs i_ino value and we should never get 0 out of that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] fs:btrfs: return -ENOMEM on allocation failure.

2017-09-20 Thread Anand Jain




On 09/20/2017 02:17 PM, Allen Pais wrote:

v2: return -ENOMEM from btrfsic_dev_state_alloc() too.

Signed-off-by: Allen Pais 


Reviewed-by: Anand Jain 


---
  fs/btrfs/check-integrity.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 7d5a9b5..9db1e76 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -2913,7 +2913,7 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
state = kvzalloc(sizeof(*state), GFP_KERNEL);
if (!state) {
pr_info("btrfs check-integrity: allocation failed!\n");
-   return -1;
+   return -ENOMEM;
}
  
  	if (!btrfsic_is_initialized) {

@@ -2945,7 +2945,7 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
if (NULL == ds) {
pr_info("btrfs check-integrity: kmalloc() failed!\n");
mutex_unlock(_mutex);
-   return -1;
+   return -ENOMEM;
}
ds->bdev = device->bdev;
ds->state = state;


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: defragmenting best practice?

2017-09-20 Thread Austin S. Hemmelgarn


On 2017-09-20 02:38, Dave wrote:

On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:

When I do a
btrfs filesystem defragment -r /directory
does it defragment really all files in this directory tree, even if it
contains subvolumes?
The man page does not mention subvolumes on this topic.


No answer so far :-(

But I found another problem in the man-page:

  Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as
  with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4
  will break up the ref-links of COW data (for example files copied with
  cp --reflink, snapshots or de-duplicated data). This may cause
  considerable increase of space usage depending on the broken up
  ref-links.

I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
snapshots.
Therefore, I better should avoid calling "btrfs filesystem defragment -r"?

What is the defragmenting best practice?
Avoid it completly?


My question is the same as the OP in this thread, so I came here to
read the answers before asking. However, it turns out that I still
need to ask something. Should I ask here or start a new thread? (I'll
assume here, since the topic is the same.)

Based on the answers here, it sounds like I should not run defrag at
all. However, I have a performance problem I need to solve, so if I
don't defrag, I need to do something else.

Here's my scenario. Some months ago I built an over-the-top powerful
desktop computer / workstation and I was looking forward to really
fantastic performance improvements over my 6 year old Ubuntu machine.
I installed Arch Linux on BTRFS on the new computer (on an SSD). To my
shock, it was no faster than my old machine. I focused a lot on
Firefox performance because I use Firefox a lot and that was one of
the applications in which I was most looking forward to better
performance.

I tried everything I could think of and everything recommended to me
in various forums (except switching to Windows) and the performance
remained very disappointing.
Switching to Windows won't help any more than switching to ext4 would. 
If you were running Chrome, it might (Chrome actually has better 
performance on Windows than Linux by a small margin last time I 
checked), but Firefox gets pretty much the same performance on both 
platforms.


Then today I read the following:

 Gotchas - btrfs Wiki
 https://btrfs.wiki.kernel.org/index.php/Gotchas

 Fragmentation: Files with a lot of random writes can become
heavily fragmented (1+ extents) causing excessive multi-second
spikes of CPU load on systems with an SSD or large amount a RAM. On
desktops this primarily affects application databases (including
Firefox). Workarounds include manually defragmenting your home
directory using btrfs fi defragment. Auto-defragment (mount option
autodefrag) should solve this problem.

Upon reading that I am wondering if fragmentation in the Firefox
profile is part of my issue. That's one thing I never tested
previously. (BTW, this system has 256 GB of RAM and 20 cores.)
Almost certainly.  Most modern web browsers are brain-dead and insist on 
using SQLite databases (or traditional DB files) for everything, 
including the cache, and the usage for the cache in particular kills 
performance when fragmentation is an issue.


Furthermore, on the same BTRFS Wiki page, it mentions the performance
penalties of many snapshots. I am keeping 30 to 50 snapshots of the
volume that contains the Firefox profile.

Would these two things be enough to turn top-of-the-line hardware into
a mediocre-preforming desktop system? (The system performs fine on
benchmarks -- it's real life usage, particularly with Firefox where it
is disappointing.)
Even ignoring fragmentation and reflink issues (it's reflinks, not 
snapshots that are the issue, snapshots just have tons of reflinks), 
BTRFS is slower than ext4 or XFS simply because of the fact that it's 
doing way more work.  The difference should have limited impact on an 
SSD if you get a handle on the other issues though.


After reading the info here, I am wondering if I should make a new
subvolume just for my Firefox profile(s) and not use COW and/or not
keep snapshots on it and mount it with the autodefrag option.

As part of this strategy, I could send snapshots to another disk using
btrfs send-receive. That way I would have the benefits of snapshots
(which are important to me), but by not keeping any snapshots on the
live subvolume I could avoid the performance problems.

What would you guys do in this situation?
Personally?  Use Chrome or Chromium and turn on the simple cache backend 
(chrome://flags/#enable-simple-cache-backend) which doesn't have issues 
with fragmentation because it doesn't use a database file to store the 
cache and lets the filesystem handle the allocations.  The difference in 
performance in Chrome itself from flipping this switch is pretty amazing 
to be honest.  They're also faster than Firefox in general in my 
experience, but

Re: using fio to test btrfs compression

On Wed, Sep 20, 2017 at 4:00 PM, Timofey Titovets  wrote:
> 2017-09-20 13:13 GMT+03:00 shally verma :
>> On Wed, Sep 20, 2017 at 3:06 PM, Timofey Titovets  
>> wrote:
>>> 2017-09-20 11:59 GMT+03:00 shally verma :
 On Wed, Sep 20, 2017 at 2:25 PM, Timofey Titovets  
 wrote:
> 2017-09-20 11:44 GMT+03:00 shally verma :
>> One more catch... I am initiating fio from non-btrfs filesystem i.e.
>> pwd is ext4 based fs where as mount point is btrfs.
>> Could that make difference?
>>
>>> Thanks
>>> Shally
>
> Only matter are where you store test file =)
> If you store test file on btrfs, pwd does nothing.
>

 Then steps listed in previous mail should work right?  Am listing them
 again here:

 " 
>
> 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt
>
> 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k
> --rw=write --iodepth=128 --name=test --size=1G
> --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer
> --ioengine=libaio
>
> 1GN file written uncompressed. Here no compression invoked (though
> compress-force=zlib)
>
> 3. cp mnt/test ./ --> copy back fio generated test file from btrfs
> mount point to local drive
>
> 4. hex dump test file (all FFs) -- confirmed that data is compressible
> no random data.
>
> 5. cp test mnt/  --> now, copy same test again back to mount point
> (reverse of step 3) . Now, here I see during copying compression is
> invoked.
>
> I am using kernel 4.9 and compress-foce is said to be working for
> kernel > 2.13 from wiki ... so I wonder what's so special with cp
> command which is not happening during fio writes???

 "-

 Thanks
 Shally


> --
> Have a nice day,
> Timofey.
>>>
>>> I did try reproduce your problem on 4.14, i'm hard to explain that
>>> happens and why file created by FIO will not use compression.
>>>
>>> Suggest workaround:
>>> rm -v /mnt/test.0.0
>>> dd if=/dev/zero of=/mnt/test.0.0 bs=1M count=1024
>>
>> You mean "compression" using "dd" write.
>>
>>> fio --directory=$HOME/test/ --numjobs=1 --direct=1 --buffered=0
>>> --bs=64k --rw=write --iodepth=128 --name=test --size=1G
>>> --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer
>>> --ioengine=libaio
>>
>> You mean "no compression" during fio writes and you see similar behavior as 
>> I?
>> BTW here --direct=0 and --buffered = 1.
>>
>>>
>>> You can check if data compressed by filefrag -v /mnt/test.0.0 (you
>>> will see encoded extents)
>>>
>> if I do filefrag on mnt/test.0.0, then it shows up like this:
>> ext: logical_offset:physical_offset: length:   expected: flags:
>>0:0..   1:1329922..   1329923:  2:
>>1:2..   3:1329923..   1329924:  2:1329924:
>>2:4..   5:1329924..   1329925:  2:1329925:
>>
>> An new to this, how does it says extents are encoded here?
>>
>> Thanks
>> Shally
>>
>>> --
>>> Have a nice day,
>>> Timofey.
>
> Yes, i was reproduce your problem, new file created by fio not compressed.
> So i write you about workaround, because i don't want to deep dig in
> that problem.
>
> Example of encoded:
> ~ dd if=/dev/zero of=./test.0.0 bs=1M count=1M count=1; sync
> ~ filefrag -v test.0.0
> Filesystem type is: 9123683e
> File size of test.0.0 is 1048576 (256 blocks of 4096 bytes)
> ext: logical_offset:physical_offset: length:   expected: flags:
>   0:0..  31:   72779979..  72780010: 32: encoded
>   1:   32..  63:   72751509..  72751540: 32:   72780011: encoded
>   2:   64..  95:   72755246..  72755277: 32:   72751541: encoded
>   3:   96.. 127:   72755610..  72755641: 32:   72755278: encoded
>   4:  128.. 159:   72755631..  72755662: 32:   72755642: encoded
>   5:  160.. 191:   72755722..  72755753: 32:   72755663: encoded
>   6:  192.. 223:   72757862..  72757893: 32:   72755754: encoded
>   7:  224.. 255:   72769455..  72769486: 32:   72757894:
> last,encoded,eof
> test.0.0: 8 extents found
>
Interesting part is I dont see "encoded" under flags. I couldn't
understand if flags are retrieved from btrfs file metadata info. As
you are running on 4.14 and I am on 4.9

So, am still under doubt, even with dd if files are getting compressed.

What is the filesize shown if you run
btrfs fi du /mnt/test0.0 .. is it less or actual size?

Is there any command that i can run to confirm file has been compressed?

So far, I had my prints enabled in kernel/fs/btrfs/compression.c and
check in dmesg that code jumped to compress_page() func.

Thanks
Shally

> --
> Have a nice day,
> Timofey.
--
To

Re: using fio to test btrfs compression

2017-09-20 13:13 GMT+03:00 shally verma :
> On Wed, Sep 20, 2017 at 3:06 PM, Timofey Titovets  
> wrote:
>> 2017-09-20 11:59 GMT+03:00 shally verma :
>>> On Wed, Sep 20, 2017 at 2:25 PM, Timofey Titovets  
>>> wrote:
 2017-09-20 11:44 GMT+03:00 shally verma :
> One more catch... I am initiating fio from non-btrfs filesystem i.e.
> pwd is ext4 based fs where as mount point is btrfs.
> Could that make difference?
>
>> Thanks
>> Shally

 Only matter are where you store test file =)
 If you store test file on btrfs, pwd does nothing.

>>>
>>> Then steps listed in previous mail should work right?  Am listing them
>>> again here:
>>>
>>> " 

 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt

 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k
 --rw=write --iodepth=128 --name=test --size=1G
 --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer
 --ioengine=libaio

 1GN file written uncompressed. Here no compression invoked (though
 compress-force=zlib)

 3. cp mnt/test ./ --> copy back fio generated test file from btrfs
 mount point to local drive

 4. hex dump test file (all FFs) -- confirmed that data is compressible
 no random data.

 5. cp test mnt/  --> now, copy same test again back to mount point
 (reverse of step 3) . Now, here I see during copying compression is
 invoked.

 I am using kernel 4.9 and compress-foce is said to be working for
 kernel > 2.13 from wiki ... so I wonder what's so special with cp
 command which is not happening during fio writes???
>>>
>>> "-
>>>
>>> Thanks
>>> Shally
>>>
>>>
 --
 Have a nice day,
 Timofey.
>>
>> I did try reproduce your problem on 4.14, i'm hard to explain that
>> happens and why file created by FIO will not use compression.
>>
>> Suggest workaround:
>> rm -v /mnt/test.0.0
>> dd if=/dev/zero of=/mnt/test.0.0 bs=1M count=1024
>
> You mean "compression" using "dd" write.
>
>> fio --directory=$HOME/test/ --numjobs=1 --direct=1 --buffered=0
>> --bs=64k --rw=write --iodepth=128 --name=test --size=1G
>> --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer
>> --ioengine=libaio
>
> You mean "no compression" during fio writes and you see similar behavior as I?
> BTW here --direct=0 and --buffered = 1.
>
>>
>> You can check if data compressed by filefrag -v /mnt/test.0.0 (you
>> will see encoded extents)
>>
> if I do filefrag on mnt/test.0.0, then it shows up like this:
> ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   1:1329922..   1329923:  2:
>1:2..   3:1329923..   1329924:  2:1329924:
>2:4..   5:1329924..   1329925:  2:1329925:
>
> An new to this, how does it says extents are encoded here?
>
> Thanks
> Shally
>
>> --
>> Have a nice day,
>> Timofey.

Yes, i was reproduce your problem, new file created by fio not compressed.
So i write you about workaround, because i don't want to deep dig in
that problem.

Example of encoded:
~ dd if=/dev/zero of=./test.0.0 bs=1M count=1M count=1; sync
~ filefrag -v test.0.0
Filesystem type is: 9123683e
File size of test.0.0 is 1048576 (256 blocks of 4096 bytes)
ext: logical_offset:physical_offset: length:   expected: flags:
  0:0..  31:   72779979..  72780010: 32: encoded
  1:   32..  63:   72751509..  72751540: 32:   72780011: encoded
  2:   64..  95:   72755246..  72755277: 32:   72751541: encoded
  3:   96.. 127:   72755610..  72755641: 32:   72755278: encoded
  4:  128.. 159:   72755631..  72755662: 32:   72755642: encoded
  5:  160.. 191:   72755722..  72755753: 32:   72755663: encoded
  6:  192.. 223:   72757862..  72757893: 32:   72755754: encoded
  7:  224.. 255:   72769455..  72769486: 32:   72757894:
last,encoded,eof
test.0.0: 8 extents found

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: using fio to test btrfs compression

On Wed, Sep 20, 2017 at 3:06 PM, Timofey Titovets  wrote:
> 2017-09-20 11:59 GMT+03:00 shally verma :
>> On Wed, Sep 20, 2017 at 2:25 PM, Timofey Titovets  
>> wrote:
>>> 2017-09-20 11:44 GMT+03:00 shally verma :
 One more catch... I am initiating fio from non-btrfs filesystem i.e.
 pwd is ext4 based fs where as mount point is btrfs.
 Could that make difference?

> Thanks
> Shally
>>>
>>> Only matter are where you store test file =)
>>> If you store test file on btrfs, pwd does nothing.
>>>
>>
>> Then steps listed in previous mail should work right?  Am listing them
>> again here:
>>
>> " 
>>>
>>> 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt
>>>
>>> 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k
>>> --rw=write --iodepth=128 --name=test --size=1G
>>> --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer
>>> --ioengine=libaio
>>>
>>> 1GN file written uncompressed. Here no compression invoked (though
>>> compress-force=zlib)
>>>
>>> 3. cp mnt/test ./ --> copy back fio generated test file from btrfs
>>> mount point to local drive
>>>
>>> 4. hex dump test file (all FFs) -- confirmed that data is compressible
>>> no random data.
>>>
>>> 5. cp test mnt/  --> now, copy same test again back to mount point
>>> (reverse of step 3) . Now, here I see during copying compression is
>>> invoked.
>>>
>>> I am using kernel 4.9 and compress-foce is said to be working for
>>> kernel > 2.13 from wiki ... so I wonder what's so special with cp
>>> command which is not happening during fio writes???
>>
>> "-
>>
>> Thanks
>> Shally
>>
>>
>>> --
>>> Have a nice day,
>>> Timofey.
>
> I did try reproduce your problem on 4.14, i'm hard to explain that
> happens and why file created by FIO will not use compression.
>
> Suggest workaround:
> rm -v /mnt/test.0.0
> dd if=/dev/zero of=/mnt/test.0.0 bs=1M count=1024

You mean "compression" using "dd" write.

> fio --directory=$HOME/test/ --numjobs=1 --direct=1 --buffered=0
> --bs=64k --rw=write --iodepth=128 --name=test --size=1G
> --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer
> --ioengine=libaio

You mean "no compression" during fio writes and you see similar behavior as I?
BTW here --direct=0 and --buffered = 1.

>
> You can check if data compressed by filefrag -v /mnt/test.0.0 (you
> will see encoded extents)
>
if I do filefrag on mnt/test.0.0, then it shows up like this:
ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   1:1329922..   1329923:  2:
   1:2..   3:1329923..   1329924:  2:1329924:
   2:4..   5:1329924..   1329925:  2:1329925:

An new to this, how does it says extents are encoded here?

Thanks
Shally

> --
> Have a nice day,
> Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: using fio to test btrfs compression

2017-09-20 11:59 GMT+03:00 shally verma :
> On Wed, Sep 20, 2017 at 2:25 PM, Timofey Titovets  
> wrote:
>> 2017-09-20 11:44 GMT+03:00 shally verma :
>>> One more catch... I am initiating fio from non-btrfs filesystem i.e.
>>> pwd is ext4 based fs where as mount point is btrfs.
>>> Could that make difference?
>>>
 Thanks
 Shally
>>
>> Only matter are where you store test file =)
>> If you store test file on btrfs, pwd does nothing.
>>
>
> Then steps listed in previous mail should work right?  Am listing them
> again here:
>
> " 
>>
>> 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt
>>
>> 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k
>> --rw=write --iodepth=128 --name=test --size=1G
>> --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer
>> --ioengine=libaio
>>
>> 1GN file written uncompressed. Here no compression invoked (though
>> compress-force=zlib)
>>
>> 3. cp mnt/test ./ --> copy back fio generated test file from btrfs
>> mount point to local drive
>>
>> 4. hex dump test file (all FFs) -- confirmed that data is compressible
>> no random data.
>>
>> 5. cp test mnt/  --> now, copy same test again back to mount point
>> (reverse of step 3) . Now, here I see during copying compression is
>> invoked.
>>
>> I am using kernel 4.9 and compress-foce is said to be working for
>> kernel > 2.13 from wiki ... so I wonder what's so special with cp
>> command which is not happening during fio writes???
>
> "-
>
> Thanks
> Shally
>
>
>> --
>> Have a nice day,
>> Timofey.

I did try reproduce your problem on 4.14, i'm hard to explain that
happens and why file created by FIO will not use compression.

Suggest workaround:
rm -v /mnt/test.0.0
dd if=/dev/zero of=/mnt/test.0.0 bs=1M count=1024
fio --directory=$HOME/test/ --numjobs=1 --direct=1 --buffered=0
--bs=64k --rw=write --iodepth=128 --name=test --size=1G
--buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer
--ioengine=libaio

You can check if data compressed by filefrag -v /mnt/test.0.0 (you
will see encoded extents)

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RFC: crc32c vs xxhash32/64 (ARM)

Hi,
About 3 years ago Liu Bo has try add xxhash support to btrfs[1].
For now (Kernel 4.14) kernel have a native library for xxhash.

Main reason to not add that support is what:
- CRC32C - has hardware support and amazing fast
- CRC32C - Can fix 1 bit error
(Yes that possible, but that not used in btrfs and i hard to find
example of correction in the kernel)
So that a not a valuable reason.

xxhash32/64 did not have a HW support but must outperform CRC32c in SW mode.

I'm not sure that ARM boxes, as example, some NAS have a HW crc32c.
([2] show that CRC32 can have HW on some types of ARMv8).
([3] for ARM v8.1-A CRC hw must be)

May be that make a sense to add support for both xxhash32/64?
And do some type of testing on module load like raid6 have, and use
fastest for newly written data?

I test speed on xxhash32/64 and crc32c, and use fastest function for
new written data?

Thanks!

P.S.
May be some one has ARMv8 boxes?
Can you check which module used by btrfs:
~ dmesg| grep crc
[0.666469] Btrfs loaded, crc32c=crc32c-intel
May be you want/can test xxhash32/64 on arm, i backport kernel version in [4].
(But [4] doesn't have HW CRC32 support, i hard to find example
userspace implementation to stole on github and didn't have ARM box)

Links:
1.
http://linux-btrfs.vger.kernel.narkive.com/pO9lOshJ/patch-2-3-crypto-xxhash-add-tests
2.
https://github.com/torvalds/linux/blob/5518b69b76680a4f2df96b1deca260059db0c2de/arch/arm/crypto/crc32-ce-glue.c
3. https://en.wikipedia.org/wiki/ARM_architecture#ARMv8.1-A
4. https://github.com/Nefelim4ag/CRC32C_AND_8Byte_parity
--
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: using fio to test btrfs compression

On Wed, Sep 20, 2017 at 2:25 PM, Timofey Titovets  wrote:
> 2017-09-20 11:44 GMT+03:00 shally verma :
>> One more catch... I am initiating fio from non-btrfs filesystem i.e.
>> pwd is ext4 based fs where as mount point is btrfs.
>> Could that make difference?
>>
>>> Thanks
>>> Shally
>
> Only matter are where you store test file =)
> If you store test file on btrfs, pwd does nothing.
>

Then steps listed in previous mail should work right?  Am listing them
again here:

" 
>
> 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt
>
> 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k
> --rw=write --iodepth=128 --name=test --size=1G
> --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer
> --ioengine=libaio
>
> 1GN file written uncompressed. Here no compression invoked (though
> compress-force=zlib)
>
> 3. cp mnt/test ./ --> copy back fio generated test file from btrfs
> mount point to local drive
>
> 4. hex dump test file (all FFs) -- confirmed that data is compressible
> no random data.
>
> 5. cp test mnt/  --> now, copy same test again back to mount point
> (reverse of step 3) . Now, here I see during copying compression is
> invoked.
>
> I am using kernel 4.9 and compress-foce is said to be working for
> kernel > 2.13 from wiki ... so I wonder what's so special with cp
> command which is not happening during fio writes???

"-

Thanks
Shally


> --
> Have a nice day,
> Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: using fio to test btrfs compression

2017-09-20 11:44 GMT+03:00 shally verma :
> One more catch... I am initiating fio from non-btrfs filesystem i.e.
> pwd is ext4 based fs where as mount point is btrfs.
> Could that make difference?
>
>> Thanks
>> Shally

Only matter are where you store test file =)
If you store test file on btrfs, pwd does nothing.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: using fio to test btrfs compression

On Wed, Sep 20, 2017 at 2:06 PM, shally verma
 wrote:
> On Mon, Sep 18, 2017 at 7:11 PM, Timofey Titovets  
> wrote:
>> 2017-09-18 16:28 GMT+03:00 shally verma :
>>> On Mon, Sep 18, 2017 at 1:56 PM, Timofey Titovets  
>>> wrote:
 2017-09-18 10:36 GMT+03:00 shally verma :
> Hi
>
> I wanted to test btrfs compression using fio command but somehow
> during fio writes, I don't see code taking route of compression blocks
> where as If I do a copy to btrfs compression enabled mount point then
> I can easily see code falling through compression.c.
>
> Here's how I do my setup
>
> 1. mkfs.btrfs /dev/sdb1
> 2. mount -t btrfs -o compress=zlib,compress-force /dev/sdb1 /mnt
> 3. cp  /mnt
> 4. dmesg shows print staments from compression.c and zlib.c confirming
> compression routine was invoked during write
> 5. now, copy back from btrfs mount point to home directory also shows
> decompress call invokation
>
> Now, try same with fio commands:
>
> fio command
>
> fio --directory=/mnt/ --numjobs=1 --direct=0 --buffered=1
> --ioengine=libaio --group_reporting --bs=64k --rw=write --iodepth=128
> --name=test --size=10G --runtime=180 --time_based
>
> But it seems to write uncompressed data.
>
> Any help here? what's missing?
>
> Thanks
> Shally
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

 1. mount -t btrfs -o compress=zlib,compress-force -> compress-force=zlib
 2. Tune fio to generate compressible data

>>> How do I "tune" fio to generate data. I had assumed once compression
>>> is enabled on btrfs any system fwrite call will simply compress data
>>> into it .Isn't it so?
>>> Can you share fio command that I can test?
>>> Thanks
>>> Shally

 --
 Have a nice day,
 Timofey.
>>
>> That useless to compress uncompressible data.
>> Also, as you enable compress, not compress-force
>> So after first uncompressible write btrfs just stop trying compress that 
>> file.
>>
>> From man fio:
>> buffer_compress_percentage=int
>>  If this is set, then fio will attempt to provide I/O
>> buffer content (on WRITEs) that compresses to the specified level. Fio
>> does this by providing a mix of random data and a fixed  pattern.  The
>>  fixed  pattern  is  either
>>  zeros,  or the pattern specified by buffer_pattern. If
>> the pattern option is used, it might skew the compression ratio
>> slightly. Note that this is per block size unit, for file/disk wide
>> compression level that matches this
>>  setting, you'll also want to set refill_buffers.
>>
>>   buffer_compress_chunk=int
>>  See buffer_compress_percentage. This setting allows fio
>> to manage how big the ranges of random data and zeroed data is.
>> Without this set, fio will provide buffer_compress_percentage of
>> blocksize random  data,  followed  by
>>  the remaining zeroed. With this set to some chunk size
>> smaller than the block size, fio can alternate random and zeroed data
>> throughout the I/O buffer.
>>
>> Good luck :)
>
> Now. I did following:
>
> 1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt
>
> 2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k
> --rw=write --iodepth=128 --name=test --size=1G
> --buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer
> --ioengine=libaio
>
> 1GN file written uncompressed. Here no compression invoked (though
> compress-force=zlib)
>
> 3. cp mnt/test ./ --> copy back fio generated test file from btrfs
> mount point to local drive
>
> 4. hex dump test file (all FFs) -- confirmed that data is compressible
> no random data.
>
> 5. cp test mnt/  --> now, copy same test again back to mount point
> (reverse of step 3) . Now, here I see during copying compression is
> invoked.
>
> I am using kernel 4.9 and compress-foce is said to be working for
> kernel > 2.13 from wiki ... so I wonder what's so special with cp
> command which is not happening during fio writes???
>
One more catch... I am initiating fio from non-btrfs filesystem i.e.
pwd is ext4 based fs where as mount point is btrfs.
Could that make difference?

> Thanks
> Shally
>
>
>> --
>> Have a nice day,
>> Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: using fio to test btrfs compression

On Mon, Sep 18, 2017 at 7:11 PM, Timofey Titovets  wrote:
> 2017-09-18 16:28 GMT+03:00 shally verma :
>> On Mon, Sep 18, 2017 at 1:56 PM, Timofey Titovets  
>> wrote:
>>> 2017-09-18 10:36 GMT+03:00 shally verma :
 Hi

 I wanted to test btrfs compression using fio command but somehow
 during fio writes, I don't see code taking route of compression blocks
 where as If I do a copy to btrfs compression enabled mount point then
 I can easily see code falling through compression.c.

 Here's how I do my setup

 1. mkfs.btrfs /dev/sdb1
 2. mount -t btrfs -o compress=zlib,compress-force /dev/sdb1 /mnt
 3. cp  /mnt
 4. dmesg shows print staments from compression.c and zlib.c confirming
 compression routine was invoked during write
 5. now, copy back from btrfs mount point to home directory also shows
 decompress call invokation

 Now, try same with fio commands:

 fio command

 fio --directory=/mnt/ --numjobs=1 --direct=0 --buffered=1
 --ioengine=libaio --group_reporting --bs=64k --rw=write --iodepth=128
 --name=test --size=10G --runtime=180 --time_based

 But it seems to write uncompressed data.

 Any help here? what's missing?

 Thanks
 Shally
 --
 To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> 1. mount -t btrfs -o compress=zlib,compress-force -> compress-force=zlib
>>> 2. Tune fio to generate compressible data
>>>
>> How do I "tune" fio to generate data. I had assumed once compression
>> is enabled on btrfs any system fwrite call will simply compress data
>> into it .Isn't it so?
>> Can you share fio command that I can test?
>> Thanks
>> Shally
>>>
>>> --
>>> Have a nice day,
>>> Timofey.
>
> That useless to compress uncompressible data.
> Also, as you enable compress, not compress-force
> So after first uncompressible write btrfs just stop trying compress that file.
>
> From man fio:
> buffer_compress_percentage=int
>  If this is set, then fio will attempt to provide I/O
> buffer content (on WRITEs) that compresses to the specified level. Fio
> does this by providing a mix of random data and a fixed  pattern.  The
>  fixed  pattern  is  either
>  zeros,  or the pattern specified by buffer_pattern. If
> the pattern option is used, it might skew the compression ratio
> slightly. Note that this is per block size unit, for file/disk wide
> compression level that matches this
>  setting, you'll also want to set refill_buffers.
>
>   buffer_compress_chunk=int
>  See buffer_compress_percentage. This setting allows fio
> to manage how big the ranges of random data and zeroed data is.
> Without this set, fio will provide buffer_compress_percentage of
> blocksize random  data,  followed  by
>  the remaining zeroed. With this set to some chunk size
> smaller than the block size, fio can alternate random and zeroed data
> throughout the I/O buffer.
>
> Good luck :)

Now. I did following:

1. mount -t btrfs -o compress-force=zlib /dev/sdb1 mnt

2. fio --directory=mnt/ --numjobs=1 --direct=0 --buffered=1 --bs=64k
--rw=write --iodepth=128 --name=test --size=1G
--buffer_compress_percentage=100 --buffer_pattern=0xFF --refill_buffer
--ioengine=libaio

1GN file written uncompressed. Here no compression invoked (though
compress-force=zlib)

3. cp mnt/test ./ --> copy back fio generated test file from btrfs
mount point to local drive

4. hex dump test file (all FFs) -- confirmed that data is compressible
no random data.

5. cp test mnt/  --> now, copy same test again back to mount point
(reverse of step 3) . Now, here I see during copying compression is
invoked.

I am using kernel 4.9 and compress-foce is said to be working for
kernel > 2.13 from wiki ... so I wonder what's so special with cp
command which is not happening during fio writes???

Thanks
Shally

> --
> Have a nice day,
> Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: defragmenting best practice?

2017-09-20 Thread Dmitry Kudriavtsev

I've had a very similar issue with the performance of my laptop dropping to 
very low levels, eventually solved by uninstalling Snapper, deleting snapshots, 
and then defragmenting the drive.

This seems to be a common concern, I also had it happen on my desktop.

Dmitry

---

Thank you,
Dmitry Kudriavtsev

https://dkudriavtsev.xyz
inexpensivecomputers.net

⠀⠀⠀⣸⣧⠀⠀⠀
⠀⠀⣰⣿⣿⣆⠀⠀
⠀⣀⡙⠿⣿⣿⣆⠀Hey, did you hear about that cool new OS? It's called
⣰⣿⣿⣷⣿⣿⣿⣆Arch Linux. I use Arch Linux. Have you ever used Arch
⠀⠀⠀⣰⣿⣿⣿⡿⢿⣿⣿⣿⣆⠀⠀⠀Linux? You should use Arch Linux. Everyone uses Arch!
⠀⠀⣰⣿⣿⣿⡏⠀⠀⢹⣿⣿⠿⡆⠀⠀Check out i3wm too!
⠀⣰⣿⣿⣿⡿⠇⠀⠀⠸⢿⣿⣷⣦⣄⠀
⣼⠿⠛⠉⠉⠛⠿⣦
September 19 2017 11:38 PM, "Dave"  wrote:
>> On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
>>> When I do a
>>> btrfs filesystem defragment -r /directory
>>> does it defragment really all files in this directory tree, even if it
>>> contains subvolumes?
>>> The man page does not mention subvolumes on this topic.
>> 
>> No answer so far :-(
>> 
>> But I found another problem in the man-page:
>> 
>> Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as
>> with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4
>> will break up the ref-links of COW data (for example files copied with
>> cp --reflink, snapshots or de-duplicated data). This may cause
>> considerable increase of space usage depending on the broken up
>> ref-links.
>> 
>> I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
>> snapshots.
>> Therefore, I better should avoid calling "btrfs filesystem defragment -r"?
>> 
>> What is the defragmenting best practice?
>> Avoid it completly?
> 
> My question is the same as the OP in this thread, so I came here to
> read the answers before asking. However, it turns out that I still
> need to ask something. Should I ask here or start a new thread? (I'll
> assume here, since the topic is the same.)
> 
> Based on the answers here, it sounds like I should not run defrag at
> all. However, I have a performance problem I need to solve, so if I
> don't defrag, I need to do something else.
> 
> Here's my scenario. Some months ago I built an over-the-top powerful
> desktop computer / workstation and I was looking forward to really
> fantastic performance improvements over my 6 year old Ubuntu machine.
> I installed Arch Linux on BTRFS on the new computer (on an SSD). To my
> shock, it was no faster than my old machine. I focused a lot on
> Firefox performance because I use Firefox a lot and that was one of
> the applications in which I was most looking forward to better
> performance.
> 
> I tried everything I could think of and everything recommended to me
> in various forums (except switching to Windows) and the performance
> remained very disappointing.
> 
> Then today I read the following:
> 
> Gotchas - btrfs Wiki
> https://btrfs.wiki.kernel.org/index.php/Gotchas
> 
> Fragmentation: Files with a lot of random writes can become
> heavily fragmented (1+ extents) causing excessive multi-second
> spikes of CPU load on systems with an SSD or large amount a RAM. On
> desktops this primarily affects application databases (including
> Firefox). Workarounds include manually defragmenting your home
> directory using btrfs fi defragment. Auto-defragment (mount option
> autodefrag) should solve this problem.
> 
> Upon reading that I am wondering if fragmentation in the Firefox
> profile is part of my issue. That's one thing I never tested
> previously. (BTW, this system has 256 GB of RAM and 20 cores.)
> 
> Furthermore, on the same BTRFS Wiki page, it mentions the performance
> penalties of many snapshots. I am keeping 30 to 50 snapshots of the
> volume that contains the Firefox profile.
> 
> Would these two things be enough to turn top-of-the-line hardware into
> a mediocre-preforming desktop system? (The system performs fine on
> benchmarks -- it's real life usage, particularly with Firefox where it
> is disappointing.)
> 
> After reading the info here, I am wondering if I should make a new
> subvolume just for my Firefox profile(s) and not use COW and/or not
> keep snapshots on it and mount it with the autodefrag option.
> 
> As part of this strategy, I could send snapshots to another disk using
> btrfs send-receive. That way I would have the benefits of snapshots
> (which are important to me), but by not keeping any snapshots on the
> live subvolume I could avoid the performance problems.
> 
> What would you guys do in this situation?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: defragmenting best practice?

2017-09-20 Thread Dave

>On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
>> When I do a
>> btrfs filesystem defragment -r /directory
>> does it defragment really all files in this directory tree, even if it
>> contains subvolumes?
>> The man page does not mention subvolumes on this topic.
>
>No answer so far :-(
>
>But I found another problem in the man-page:
>
>  Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as well as
>  with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or >= 3.13.4
>  will break up the ref-links of COW data (for example files copied with
>  cp --reflink, snapshots or de-duplicated data). This may cause
>  considerable increase of space usage depending on the broken up
>  ref-links.
>
>I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
>snapshots.
>Therefore, I better should avoid calling "btrfs filesystem defragment -r"?
>
>What is the defragmenting best practice?
>Avoid it completly?

My question is the same as the OP in this thread, so I came here to
read the answers before asking. However, it turns out that I still
need to ask something. Should I ask here or start a new thread? (I'll
assume here, since the topic is the same.)

Based on the answers here, it sounds like I should not run defrag at
all. However, I have a performance problem I need to solve, so if I
don't defrag, I need to do something else.

Here's my scenario. Some months ago I built an over-the-top powerful
desktop computer / workstation and I was looking forward to really
fantastic performance improvements over my 6 year old Ubuntu machine.
I installed Arch Linux on BTRFS on the new computer (on an SSD). To my
shock, it was no faster than my old machine. I focused a lot on
Firefox performance because I use Firefox a lot and that was one of
the applications in which I was most looking forward to better
performance.

I tried everything I could think of and everything recommended to me
in various forums (except switching to Windows) and the performance
remained very disappointing.

Then today I read the following:

Gotchas - btrfs Wiki
https://btrfs.wiki.kernel.org/index.php/Gotchas

Fragmentation: Files with a lot of random writes can become
heavily fragmented (1+ extents) causing excessive multi-second
spikes of CPU load on systems with an SSD or large amount a RAM. On
desktops this primarily affects application databases (including
Firefox). Workarounds include manually defragmenting your home
directory using btrfs fi defragment. Auto-defragment (mount option
autodefrag) should solve this problem.

Upon reading that I am wondering if fragmentation in the Firefox
profile is part of my issue. That's one thing I never tested
previously. (BTW, this system has 256 GB of RAM and 20 cores.)

Furthermore, on the same BTRFS Wiki page, it mentions the performance
penalties of many snapshots. I am keeping 30 to 50 snapshots of the
volume that contains the Firefox profile.

Would these two things be enough to turn top-of-the-line hardware into
a mediocre-preforming desktop system? (The system performs fine on
benchmarks -- it's real life usage, particularly with Firefox where it
is disappointing.)

After reading the info here, I am wondering if I should make a new
subvolume just for my Firefox profile(s) and not use COW and/or not
keep snapshots on it and mount it with the autodefrag option.

As part of this strategy, I could send snapshots to another disk using
btrfs send-receive. That way I would have the benefits of snapshots
(which are important to me), but by not keeping any snapshots on the
live subvolume I could avoid the performance problems.

What would you guys do in this situation?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] btrfs: Make check for unaligned device size more gentle

Commit 7dfb8be11b5d ("btrfs: Round down values which are written for
total_bytes_size") is fixing the unaligned device size caused by
adding/shrinking device.

It added a new WARN_ON() when device size is unaligned.
This is fine for new device added to btrfs using v4.13 kernel, but not
existing device whose total_bytes is already unaligned.

And the WARN_ON() will get triggered every time a block group get
created/removed on the unaligned device.

This patch will only enable the WARN_ON() for CONFIG_BTRFS_DEBUG so
developer won't miss any sanity check, but end user will not be troubled
by tons of warning.
And more useful fix will be output for end users.

Reported-by: Rich Rauenzahn 
Fixes: 7dfb8be11b5d ("btrfs: Round down values which are written for
total_bytes_size")
Signed-off-by: Qu Wenruo 
---
changelog:
v2:
  Use CONFIG_BTRFS_DEBUG to encapsulate the WARN_ON(), so end user won't
  see tons of kernel backtrace and developers can still catch the
  problem.
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/volumes.c | 16 
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5a8933da39a7..19b1a1789c52 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1562,7 +1562,9 @@ static inline void btrfs_set_device_total_bytes(struct 
extent_buffer *eb,
 {
BUILD_BUG_ON(sizeof(u64) !=
 sizeof(((struct btrfs_dev_item *)0))->total_bytes);
+#ifdef CONFIG_BTRFS_DEBUG
WARN_ON(!IS_ALIGNED(val, eb->fs_info->sectorsize));
+#endif
btrfs_set_64(eb, s, offsetof(struct btrfs_dev_item, total_bytes), val);
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0e8f16c305df..afae25df6a8c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6472,15 +6472,23 @@ static int read_one_chunk(struct btrfs_fs_info 
*fs_info, struct btrfs_key *key,
return 0;
 }
 
-static void fill_device_from_item(struct extent_buffer *leaf,
-struct btrfs_dev_item *dev_item,
-struct btrfs_device *device)
+static void fill_device_from_item(struct btrfs_fs_info *fs_info,
+ struct extent_buffer *leaf,
+ struct btrfs_dev_item *dev_item,
+ struct btrfs_device *device)
 {
unsigned long ptr;
 
device->devid = btrfs_device_id(leaf, dev_item);
device->disk_total_bytes = btrfs_device_total_bytes(leaf, dev_item);
device->total_bytes = device->disk_total_bytes;
+   if (!IS_ALIGNED(device->total_bytes, fs_info->sectorsize)) {
+   btrfs_warn(fs_info,
+  "devid %llu has unaligned total bytes %llu",
+  device->devid, device->disk_total_bytes);
+   btrfs_warn(fs_info,
+  "please shrink the device a little and resize back 
to fix it");
+   }
device->commit_total_bytes = device->disk_total_bytes;
device->bytes_used = btrfs_device_bytes_used(leaf, dev_item);
device->commit_bytes_used = device->bytes_used;
@@ -6625,7 +6633,7 @@ static int read_one_dev(struct btrfs_fs_info *fs_info,
return -EINVAL;
}
 
-   fill_device_from_item(leaf, dev_item, device);
+   fill_device_from_item(fs_info, leaf, dev_item, device);
device->in_fs_metadata = 1;
if (device->writeable && !device->is_tgtdev_for_dev_replace) {
device->fs_devices->total_rw_bytes += device->total_bytes;
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs]




On 2017年09月20日 14:11, nborisov wrote:

On 2017-09-20 07:39, Qu Wenruo wrote:

On 2017年09月20日 13:10, Qu Wenruo wrote:



On 2017年09月20日 12:59, Qu Wenruo wrote:



On 2017年09月20日 12:49, Rich Rauenzahn wrote:



On 9/19/2017 5:31 PM, Qu Wenruo wrote:

On 2017年09月19日 23:56, Rich Rauenzahn wrote:

[    4.747377] WARNING: CPU: 3 PID: 439 at fs/btrfs/ctree.h:1559
btrfs_update_device+0x1c5/0x1d0 [btrfs]


Is that line the following WARN_ON()?
---
static inline void btrfs_set_device_total_bytes(struct 
extent_buffer *eb,

    struct btrfs_dev_item *s,
    u64 val)
{
BUILD_BUG_ON(sizeof(u64) !=
 sizeof(((struct btrfs_dev_item *)0))->total_bytes);
WARN_ON(!IS_ALIGNED(val, eb->fs_info->sectorsize)); <<<
btrfs_set_64(eb, s, offsetof(struct btrfs_dev_item, 
total_bytes), val);

}
---

If so, that means your devices size is not aligned to 4K.

Is your block device still using old 512 block size?
AFAIK nowadays most HDDs are using 4K blocksize and it's 
recommended to use it.


It's not a big problem and one can easily remove the WARN_ON().
But I think we'd better fix the caller to do round_down() before 
calling this function.




That's interesting!  I believe I made an effort to align them when 
I set it up years ago, but never knew how to verify.


Well, best verifying if that's the line causing the warning, since I 
don't have the source of RedHat kernel.




I have three mirrored filesystems:


[snip]


Number  Start (sector)    End (sector)  Size   Code Name
    1  40  3907029134   1.8 TiB 8300 BTRFS MEDIA
GPT fdisk (gdisk) version 0.8.6


At least this size is not aligned to 4K.



Partition table scan:

[snip]


.and one is aligned differently!

Could it be /dev/sdd that's the issue?  But it's aligned at 4096 -- 
so I'm not sure that's the issue after all.


Its start sector is aligned, but end point is not, so the size is 
not aligned either.


BTW, is /dev/sdd added to btrfs using "btrfs device add"?
In my test, if making btrfs on a unaligned file, it will round down 
to its sectorsize boundary.


Confirmed that "btrfs device add" won't round down the size.
Check the btrfs-debug-tree output:
--
 item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
 devid 1 total_bytes 10737418240 bytes_used 2172649472
 io_align 4096 io_width 4096 sector_size 4096 type 0
 generation 0 start_offset 0 dev_group 0
 seek_speed 0 bandwidth 0
 uuid 243a1117-ca31-4d87-8656-81c5630aafb2
 fsid 6452cde7-14d5-4541-aa07-b265a400bad0
 item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98
 devid 2 total_bytes 1073742336 bytes_used 0
 io_align 4096 io_width 4096 sector_size 4096 type 0
 generation 0 start_offset 0 dev_group 0
 seek_speed 0 bandwidth 0
 uuid 6bb07260-d230-4e22-88b1-1eabb46622ed
 fsid 6452cde7-14d5-4541-aa07-b265a400bad0
--


Sorry, the output is from v4.12.x, so no kernel warning nor the patch
rounding down the value.



Where first device is completely aligned, the 2nd device which is 
just 1G + 512, definitely not aligned.


So if you're using single device purely created by mkfs.btrfs, you're 
OK.

But if any new device added, you're not OK and causing the false alert.

Any way, it should not be hard to fix.
Just remove the WARN_ON() and add extra round_down when adding device.


In v4.13 kernel, the newly added devices are in fact rounded down.
But existing device doesn't get the round down.


We got a report internally at Suse of this problem and it prevented a 
filesystem from being mounted due to the
following check failing: 
http://elixir.free-electrons.com/linux/latest/source/fs/btrfs/volumes.c#L6893. 



Hence I added the rounding down fixes. And this warning was put 
specifically to catch future offenders and see
if I had missed a place to patch it. So removing the warning is the 
wrong solution to the problem.


Then at least only enable it for BTRFS_DEBUG.

For end user it's just confusing.

I have submitted a patch to do the check at mounting time, and warning 
user to do shrink to fix it.

(Although still removed the WARN_ON)

I think such warning breaks backward compatibility should be as gentle 
as possible for end users.


Thanks,
Qu


Generally if
balancing kicked in had to resize his disk everything would be back to 
normal. >


So it's recommended to resize (shrink) your fs for very small size to
fix it if you don't want to wait for the kernel fix.

Thanks,
Qu


Thanks for the report,
Qu



So I'm wondering if it's caused by added new btrfs device.

Thanks,
Qu

-- To unsubscribe from this list: send the line "unsubscribe 
linux-btrfs" in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- To unsubscribe from this list:

[PATCH] btrfs: Remove WARN_ON for unaligned device created before v4.13 and adds more user friendly output