Re: Disk usage is more than double the snapshots exclusive data

2017-06-15 Thread Duncan
Vianney Stroebel posted on Thu, 15 Jun 2017 11:44:34 +0200 as excerpted:

> On a backup drive for a home computer, disk usage as shown by 'btrfs fi
> show' is more than double the snapshots exclusive data as shown by
> "btrfs qgroup show" (574 GB vs 265 GB).
> 
> I've done a lot of research online and I couldn't find any answer to
> this problem.
> 
> Output of some commands:
> 
> sudo btrfs fi show
> Label: 'btrfs-backup'  uuid: [...]
>   Total devices 1 FS bytes used 573.89GiB
>   devid1 size 698.64GiB used 697.04GiB path /dev/sdb1
> 
> btrfs fi df /mnt/btrfs-backup
> Data, single: total=670.01GiB, used=564.26GiB
> System, DUP: total=8.00MiB, used=112.00KiB
> System, single: total=4.00MiB, used=0.00B
> Metadata, DUP: total=13.50GiB, used=8.64GiB
> Metadata, single: total=8.00MiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=14.95MiB

[Summary of quote-omitted: kernel 4.10.0 ubuntu, progs 4.9.1, subvolume 
total exclusive 265.23GB]

Five points to make here, two explaining the data that you may be 
misunderstanding, and three recommendations.

But first a disclaimer.  Note that I'm a btrfs user and list regular, but 
not a dev.  So if a btrfs dev says different, go with what they say, not 
what I say.  Also, my particular use-case doesn't use either quotas or 
snapshots, so no personal experience with them, with my knowledge there 
being from the list.  But this is list-common knowledge, and my reply 
will mean others don't have to.

The recommendations:

1) Try: btrfs balance start -dusage=0 -musage=0 /mnt/btrfs-backup

That should quickly get rid of the unused single-mode system and metadata 
chunks, which are an artifact from the way older mkfs.btrfs created the 
filesystem.  Newer mkfs.btrfs doesn't create them any more, but you still 
have to remove the ones created by old mkfs.btrfs with a manual balance.

It might or might not free up additional space from the used single data 
and dup metadata chunks as well.

2) See that devid line under btrfs fi show?  That displays how much space 
is actually allocated (see explanation points for discussion), and by 
subtraction, what remains unallocated, on the filesystem.

You only have about a gig and a half actually unallocated, which is 
alarmingly small.  You really want a few GiB unallocated, at least, in 
ordered to allow btrfs balance, should you need to run it, to do its 
thing.

And actually, btrfs global reserve (which is part of metadata, so 
metadata used will never show zero because the global reserve comes from 
it too) is normally not used at all -- that the reported usage there 
isn't zero indicates that btrfs DID run out of normal space at one point 
and had to use its reserves!  That demonstrates the severity of your 
situation!

You're going to need to try to reduce unused allocations.  I'll explain 
how it works in the explanation below, but here's what you're going to 
try to do (N being a percentage, so 0-100):

btrfs balance start -musage=N /mnt/btrfs-backup

btrfs balance start -dusage=N /mnt/btrfs-backup

You'll need to start small, say with 10 in place of the N.  That says 
only try to balance chunks 10% or less full.  The rebalance process 
rewrites chunks, combining partially full ones in the process, so with 
usage=10, if there's any chunks with that little usage to balance, the 
balance will create one new one to write into, and should be able to fit 
at least 10 old ones, 10% full or less, into it.  So you'll free 9 of the 
10. =:^)

Your goal will be at least 10 GiB unallocated before you stop.  Given a 
filesystem size of 698.64 GiB, that means you'll want to try to get the 
devid line usage below 688 GiB.

But of course you may not have that many that are only 10% full, so you 
may have to go higher to free enough space.  Try 20, 30, 50...  Above 50% 
however, things will slow down as 50% full means you rewrite two to 
recover one, 67% means rewriting three to recover one, etc.  And the 
fuller they are the longer they will take to write!  So if you get what 
you need with 10, great, but you may have to go higher and take longer to 
rewrite it all, and once you hit 67%, you'll be taking long enough to 
write and getting little enough back, that it's probably not worth even 
trying anything higher.

If at any point you get an error saying there wasn't enough space, you'll 
need to try something lower.  If usage=10 fails, try usage=5, down to 
usage=1 if necessary.  Then try increasing it again, say 1, 2, 5...  The 
more space you get unallocated, the easier it'll be, until at several 
gigs, you shouldn't have to worry about running out of space any more.  
That's why I recommended a goal of at least 10 GiB unallocated.

If you don't get much beyond those unused single metadata and system 
chunks returned by balance with usage=0, and balance with usage=1 returns 
an error when tried with either metadata (-m) or data (-d), and/or 
usage=1 goes fine but increasing usage= you still get an ENOSPC before 

Re: [PATCH] btrfs/145: Test various btrfs operations rounding behavior

2017-06-15 Thread Eryu Guan
On Thu, Jun 15, 2017 at 04:45:38PM +0300, Nikolay Borisov wrote:
> When changing the size of disks/filesystem we should always be
> rounding down to a multiple of sectorsize. This is a test for the following
> btrfs patche: 
> 
> btrfs: Round up values which are written for total_bytes_size
> 
> Signed-off-by: Nikolay Borisov 

[ cc btrfs list ]

I'd like to use some help from other btrfs developers on this test :)

I'll comment from fstests perspective of view.

> ---
>  tests/btrfs/145 | 147 
> 
>  tests/btrfs/145.out |  20 +++
>  tests/btrfs/group   |   1 +
>  3 files changed, 168 insertions(+)
>  create mode 100755 tests/btrfs/145
>  create mode 100644 tests/btrfs/145.out
> 
> diff --git a/tests/btrfs/145 b/tests/btrfs/145
> new file mode 100755
> index 000..4461b37
> --- /dev/null
> +++ b/tests/btrfs/145
> @@ -0,0 +1,147 @@
> +#! /bin/bash
> +# FS QA Test No. btrfs/145
> +# 

Trailing whitespace issue in above line, and many other lines in this
test, please remove them all.

> +# Test that various code paths which deal with modifying the total size
> +# of devices/superblock correctly round the value to a multiple of 
> +# sector size
> +#
> +#---
> +#
> +# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved.
> +# Author: Nikolay Borisov 
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + $UMOUNT_PROG $loop_mnt 
> + _destroy_loop_device $loop_dev1
> + _destroy_loop_device $loop_dev2
> + cd /
> + rm -f $tmp.*
> + 
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_loop
> +_require_block_device $SCRATCH_DEV

I see 'btrfs inspect-internal dump-super' is used, I think we need

_require_btrfs_command inspect-internal dump-super

> +
> +rm -f $seqres.full
> +
> +_scratch_mkfs >>$seqres.full 2>&1
> +_scratch_mount
> +
> +
> +#Size of devices which are going to be half a page larger than 
> +#default sectorsize (4kb)

Nitpick, please add a space after '#' for comments.

> +expectedsize=$(( 1 * 1024 * 1024 * 1024 )) 
> +filesize=$(( $expectedsize + 2048 ))
> +loop_mnt=$SCRATCH_MNT/mount
> +fs_img1=$SCRATCH_MNT/disk1.img
> +fs_img2=$SCRATCH_MNT/disk2.img
> +
> +mkdir $loop_mnt
> +
> +#create files to hold dummy devices
> +$XFS_IO_PROG -f -c "falloc 0 $filesize" $fs_img1 &> /dev/null
> +$XFS_IO_PROG -f -c "falloc 0 $filesize" $fs_img2 &> /dev/null

Perhaps we need to make sure $SCRATCH_MNT has enough free space to hold
the two fs imgs.

_require_fs_space $SCRATCH_MNT 

> +
> +loop_dev1=$(_create_loop_device $fs_img1)
> +loop_dev2=$(_create_loop_device $fs_img2) 
> +
> +#create fs only on the first device
> +_mkfs_dev $loop_dev1 
> +_mount $loop_dev1 $loop_mnt
> +
> +echo "Size from mkfs"
> +$BTRFS_UTIL_PROG inspect-internal dump-super /dev/loop0 | grep total
^^ $loop_dev1
> +
> +#resize down to 768mb + 2k
> +$BTRFS_UTIL_PROG fi resize 824182784 $loop_mnt >>$seqres.full 2>&1 

Better to use full subcommand name, i.e. s/fi/filesystem/, I remembered
there was a patch to do this

2fd80ca btrfs: use full subcommand names

> +sync
> +
> +echo ""
> +
> +echo "Size after resizing down"
> +$BTRFS_UTIL_PROG inspect-internal dump-super $loop_dev1 | grep total
> +
> +echo ""
> +
> +#now resize back up to 1 gb
> +$BTRFS_UTIL_PROG fi resize max $loop_mnt >>$seqres.full 2>&1 
> +sync
> +
> +echo "Size after resizing up"
> +$BTRFS_UTIL_PROG inspect-internal dump-super /dev/loop0 | grep total
> +
> +#Add fs load for later balance operations, we need to do this 
> +#before adding a second device
> +$FSSTRESS_PROG -w -n 15000 -p 2 -d $loop_mnt >> $seqres.full 2>&1
> +
> +#add second unaligned device, this checks the btrfs_init_new_device codepath 
> +#device should be aligned
> +$BTRFS_UTIL_PROG device add $loop_dev2 

[PATCH 1/2] btrfs-progs: Fix false alert about EXTENT_DATA shouldn't be hole

2017-06-15 Thread Qu Wenruo
Since incompat feature NO_HOLES still allow us to have explicit hole
file extent, current check is too restrict and will cause false alert
like:

root 5 EXTENT_DATA[257, 0] shouldn't be hole

Fix it by removing the restrict hole file extent check.

Reported-by: Henk Slager 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index c052f66e..7bd57677 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -4841,11 +4841,7 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_key *fkey,
}
 
/* Check EXTENT_DATA hole */
-   if (no_holes && is_hole) {
-   err |= FILE_EXTENT_ERROR;
-   error("root %llu EXTENT_DATA[%llu %llu] shouldn't be hole",
- root->objectid, fkey->objectid, fkey->offset);
-   } else if (!no_holes && *end != fkey->offset) {
+   if (!no_holes && *end != fkey->offset) {
err |= FILE_EXTENT_ERROR;
error("root %llu EXTENT_DATA[%llu %llu] interrupt",
  root->objectid, fkey->objectid, fkey->offset);
-- 
2.13.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs-progs: Add test case to check hole file extents with NO_HOLES flag

2017-06-15 Thread Qu Wenruo
Add test case which we have NO_HOLES incompat flag while still have
hole file extent.

This can be created by enabling NO_HOLES feature on an existing
filesystem, which lowmem mode would cause false alert for it.

Signed-off-by: Qu Wenruo 
---
 tests/fsck-tests/025-file-extents/test.sh | 17 +
 1 file changed, 17 insertions(+)

diff --git a/tests/fsck-tests/025-file-extents/test.sh 
b/tests/fsck-tests/025-file-extents/test.sh
index e4bc4247..66cfbaab 100755
--- a/tests/fsck-tests/025-file-extents/test.sh
+++ b/tests/fsck-tests/025-file-extents/test.sh
@@ -5,6 +5,7 @@ source "$TOP/tests/common"
 
 check_prereq btrfs
 check_prereq mkfs.btrfs
+check_prereq btrfstune
 check_global_prereq dd
 check_global_prereq fallocate
 
@@ -38,5 +39,21 @@ test_compressed_inline_extent()
run_check "$TOP/btrfs" check "$TEST_DEV"
 }
 
+# Hole file extent with NO_HOLES incompat flag
+# Lowmem mode will cause false alert as it doesn't allow any hole file extent
+# exist, while we can set NO_HOLES at anytime we want, it's definitely a false
+# alert
+test_hole_extent_with_no_holes_flag()
+{
+   run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -f "$TEST_DEV"
+   run_check_mount_test_dev
+
+   run_check $SUDO_HELPER truncate -s 16K "$TEST_MNT/tmp"
+   run_check_umount_test_dev
+   run_check $SUDO_HELPER "$TOP/btrfstune" -n "$TEST_DEV"
+   run_check "$TOP/btrfs" check "$TEST_DEV"
+}
+
 test_paritical_write_into_prealloc
 test_compressed_inline_extent
+test_hole_extent_with_no_holes_flag
-- 
2.13.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] btrfs: add fs flag to force device flushing

2017-06-15 Thread Anand Jain






diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 59a732a13370..659a3b4645d2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3494,7 +3494,8 @@ static void write_dev_flush(struct btrfs_device 
*device)

  struct request_queue *q = bdev_get_queue(device->bdev);
  struct bio *bio = device->flush_bio;
-if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
+if (!test_bit(BTRFS_FS_FORCE_DEV_FLUSH, >fs_info->flags)
+&& !test_bit(QUEUE_FLAG_WC, >queue_flags))
  return;



  Now I understand what you meant. But the most common case in our test
  set up is a device with write cache. So BTRFS_FS_FORCE_DEV_FLUSH does
  not bring any additional force. IMO.

Thanks, Anand


 Or one another idea is we could remove

   !test_bit(QUEUE_FLAG_WC, >queue_flags)

 which purpose is to only fail early.

 If we remove it there is consistency in our code with or
 with out the write cache.

Thanks, Anand
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] btrfs: sysfs: export the force_dev_flush flag

2017-06-15 Thread Anand Jain



On 06/16/2017 12:49 AM, David Sterba wrote:

Add per-filesystem tunable to switch on/off the device barriers,
regardless of the actual device support. This is a debugging feature.

The path to the sysfs file is /sys/fs/btrfs/UUID/force_dev_flush,
allowed values are 0 and 1. Default is 0.



 Anyway the flush command won't touch the device if its write through.
 So unless I am missing something, IMO we don't need something
 like this.

---
generic_make_request_checks(struct bio *bio)
::
/*
 * Filter flush bio's early so that make_request based
 * drivers without flush support don't have to worry
 * about them.
 */
if (op_is_flush(bio->bi_opf) &&
!test_bit(QUEUE_FLAG_WC, >queue_flags)) {
bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA);
if (!nr_sectors) {
err = 0;
goto end_io;
}
}
---



Thanks, Anand



Signed-off-by: David Sterba 
---
  fs/btrfs/sysfs.c | 47 +++
  1 file changed, 47 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index c2d5f3580b4c..197d43911936 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -487,12 +487,59 @@ static ssize_t quota_override_store(struct kobject *kobj,
  
  BTRFS_ATTR_RW(quota_override, quota_override_show, quota_override_store);
  
+static ssize_t force_dev_flush_show(struct kobject *kobj,

+  struct kobj_attribute *a, char *buf)
+{
+   struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+   int force;
+
+   if (!fs_info)
+   return -EPERM;
+
+   force = !!(test_bit(BTRFS_FS_FORCE_DEV_FLUSH, _info->flags));
+   return snprintf(buf, PAGE_SIZE, "%d\n", force);
+}
+
+static ssize_t force_dev_flush_store(struct kobject *kobj,
+   struct kobj_attribute *a,
+   const char *buf, size_t len)
+{
+   struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+   unsigned long force;
+   int err;
+
+   if (!fs_info)
+   return -EPERM;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   err = kstrtoul(buf, 10, );
+   if (err)
+   return err;
+   if (force > 1)
+   return -EINVAL;
+
+   if (force) {
+   set_bit(BTRFS_FS_FORCE_DEV_FLUSH, _info->flags);
+   btrfs_info(fs_info, "Forced device flushes enabled");
+   } else {
+   clear_bit(BTRFS_FS_FORCE_DEV_FLUSH, _info->flags);
+   btrfs_info(fs_info, "Forced device flushes disabled");
+   }
+
+   return len;
+}
+
+BTRFS_ATTR_RW(force_dev_flush, force_dev_flush_show, force_dev_flush_store);
+
  static const struct attribute *btrfs_attrs[] = {
BTRFS_ATTR_PTR(label),
BTRFS_ATTR_PTR(nodesize),
BTRFS_ATTR_PTR(sectorsize),
BTRFS_ATTR_PTR(clone_alignment),
BTRFS_ATTR_PTR(quota_override),
+   BTRFS_ATTR_PTR(force_dev_flush),
NULL,
  };
  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] btrfs: add fs flag to force device flushing

2017-06-15 Thread Anand Jain



On 06/16/2017 12:49 AM, David Sterba wrote:

We need a device capable of device barriers so we can test the flush
code. To aid testing, add a per-filesystem status flag that affects the
barriers regerdless of the device capabilities and obviously does not
give the same guarantees.

It's off by default, sysfs tunable will follow.

Signed-off-by: David Sterba 
---
  fs/btrfs/ctree.h   | 1 +
  fs/btrfs/disk-io.c | 8 +---
  fs/btrfs/volumes.h | 1 +
  3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f0f5f28784b6..dcf4404f7d61 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -716,6 +716,7 @@ struct btrfs_delayed_root;
  #define BTRFS_FS_LOG1_ERR 12
  #define BTRFS_FS_LOG2_ERR 13
  #define BTRFS_FS_QUOTA_OVERRIDE   14
+#define BTRFS_FS_FORCE_DEV_FLUSH   15
  
  /*

   * Indicate that a whole-filesystem exclusive operation is running
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 59a732a13370..659a3b4645d2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3494,7 +3494,8 @@ static void write_dev_flush(struct btrfs_device *device)
struct request_queue *q = bdev_get_queue(device->bdev);
struct bio *bio = device->flush_bio;
  
-	if (!test_bit(QUEUE_FLAG_WC, >queue_flags))

+   if (!test_bit(BTRFS_FS_FORCE_DEV_FLUSH, >fs_info->flags)
+   && !test_bit(QUEUE_FLAG_WC, >queue_flags))
return;



 Now I understand what you meant. But the most common case in our test
 set up is a device with write cache. So BTRFS_FS_FORCE_DEV_FLUSH does
 not bring any additional force. IMO.

Thanks, Anand


bio_reset(bio);
@@ -3505,6 +3506,7 @@ static void write_dev_flush(struct btrfs_device *device)
bio->bi_private = >flush_wait;
  
  	submit_bio(bio);

+   device->flush_bio_sent = 1;
  }
  
  /*

@@ -3512,12 +3514,12 @@ static void write_dev_flush(struct btrfs_device *device)
   */
  static int wait_dev_flush(struct btrfs_device *device)
  {
-   struct request_queue *q = bdev_get_queue(device->bdev);
struct bio *bio = device->flush_bio;
  
-	if (!test_bit(QUEUE_FLAG_WC, >queue_flags))

+   if (!device->flush_bio_sent)
return 0;
  
+	device->flush_bio_sent = 0;

wait_for_completion_io(>flush_wait);
  
  	return bio->bi_error;

diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 35327efecdbb..6f45fd60d15a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -75,6 +75,7 @@ struct btrfs_device {
int can_discard;
int is_tgtdev_for_dev_replace;
int last_flush_error;
+   int flush_bio_sent;
  
  #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED

seqcount_t data_seqcount;


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] btrfs: move dev stats accounting out of wait_dev_flush

2017-06-15 Thread Anand Jain



On 06/16/2017 12:49 AM, David Sterba wrote:

We should really just wait in wait_dev_flush and let the caller decide
what to do with the error value.


 Nice.
 Reviewed-by: Anand Jain 



Signed-off-by: David Sterba 
---
  fs/btrfs/disk-io.c | 11 +++
  1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 05ff81ecb887..59a732a13370 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3512,7 +3512,6 @@ static void write_dev_flush(struct btrfs_device *device)
   */
  static int wait_dev_flush(struct btrfs_device *device)
  {
-   int ret = 0;
struct request_queue *q = bdev_get_queue(device->bdev);
struct bio *bio = device->flush_bio;
  
@@ -3521,13 +3520,7 @@ static int wait_dev_flush(struct btrfs_device *device)
  
  	wait_for_completion_io(>flush_wait);
  
-	if (bio->bi_error) {

-   ret = bio->bi_error;
-   btrfs_dev_stat_inc_and_print(device,
-   BTRFS_DEV_STAT_FLUSH_ERRS);
-   }
-
-   return ret;
+   return bio->bi_error;
  }
  
  static int check_barrier_error(struct btrfs_fs_devices *fsdevs)

@@ -3586,6 +3579,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
ret = wait_dev_flush(dev);
if (ret) {
dev->last_flush_error = ret;
+   btrfs_dev_stat_inc_and_print(dev,
+   BTRFS_DEV_STAT_FLUSH_ERRS);
errors_wait++;
}
}


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] btrfs: preallocate device flush bio

2017-06-15 Thread Anand Jain



On 06/16/2017 12:49 AM, David Sterba wrote:

For devices that support flushing, we allocate a bio, submit, wait for
it and then free it. The bio allocation does not fail so ENOMEM is not a
problem but we still may unnecessarily stress the allocation subsystem.

Instead, we can allocate the device at the same time we allocate the
device and reuse it each time we need to flush the barriers. The bio is
reset before each use. Reference counting is simplified to just device
allocation (get) and freeing (put).

Note for write_dev_flush: we check the queue flush status again as we
can't use the existence of bio as before.


 Looks good few items as below..


Signed-off-by: David Sterba 
---
  fs/btrfs/disk-io.c | 24 ++--
  fs/btrfs/volumes.c | 12 
  2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2b00ebff13f8..27d44d6ab775 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3482,9 +3482,7 @@ static int write_dev_supers(struct btrfs_device *device,
   */
  static void btrfs_end_empty_barrier(struct bio *bio)
  {
-   if (bio->bi_private)
-   complete(bio->bi_private);
-   bio_put(bio);
+   complete(bio->bi_private);
  }
  
  /*

@@ -3494,26 +3492,19 @@ static void btrfs_end_empty_barrier(struct bio *bio)
  static void write_dev_flush(struct btrfs_device *device)
  {
struct request_queue *q = bdev_get_queue(device->bdev);
-   struct bio *bio;
+   struct bio *bio = device->flush_bio;
  
  	if (!test_bit(QUEUE_FLAG_WC, >queue_flags))

return;
  
-	/*

-* one reference for us, and we leave it for the
-* caller
-*/
-   device->flush_bio = NULL;
-   bio = btrfs_io_bio_alloc(0);
+   bio_reset(bio);
bio->bi_end_io = btrfs_end_empty_barrier;
bio->bi_bdev = device->bdev;
bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH;
init_completion(>flush_wait);
bio->bi_private = >flush_wait;
-   device->flush_bio = bio;
  
-	bio_get(bio);

-   btrfsic_submit_bio(bio);
+   submit_bio(bio);


 Originally it went through the btrfsic. There is no mention
 of this change if its not an oversight.


  }
  
  /*

@@ -3522,9 +3513,10 @@ static void write_dev_flush(struct btrfs_device *device)
  static int wait_dev_flush(struct btrfs_device *device)
  {
int ret = 0;
+   struct request_queue *q = bdev_get_queue(device->bdev);
struct bio *bio = device->flush_bio;
  
-	if (!bio)

+   if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
return 0;


 It returns here if its write through. Which can be toggled
 after write_dev_flush() has been called such as..

  echo "write back" > /sys/block/sdd/queue/write_cache
  write_dev_flush(sdd)
  echo "write through" > /sys/block/sdd/queue/write_cache
  wait_dev_flush(sdd)

 So it would fails to check error.



wait_for_completion(>flush_wait);
@@ -3535,10 +3527,6 @@ static int wait_dev_flush(struct btrfs_device *device)
BTRFS_DEV_STAT_FLUSH_ERRS);
}
  
-	/* drop the reference from the wait == 0 run */

-   bio_put(bio);
-   device->flush_bio = NULL;
-
return ret;
  }
  
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c

index 8bb1f4e5905a..251ae81e4363 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -242,6 +242,17 @@ static struct btrfs_device *__alloc_device(void)
if (!dev)
return ERR_PTR(-ENOMEM);
  
+	/*

+* Preallocate a bio that's always going to be used for flushing device
+* barriers and matches the device lifespan
+*/
+   dev->flush_bio = bio_alloc_bioset(GFP_KERNEL, 0, NULL);


  Nice.

Thanks, Anand



+   if (!dev->flush_bio) {
+   kfree(dev);
+   return ERR_PTR(-ENOMEM);
+   }
+   bio_get(dev->flush_bio);
+
INIT_LIST_HEAD(>dev_list);
INIT_LIST_HEAD(>dev_alloc_list);
INIT_LIST_HEAD(>resized_list);
@@ -838,6 +849,7 @@ static void __free_device(struct work_struct *work)
  
  	device = container_of(work, struct btrfs_device, rcu_work);

rcu_string_free(device->name);
+   bio_put(device->flush_bio);
kfree(device);
  }
  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use GFP_KERNEL in btrfs_init_dev_replace_tgtdev

2017-06-15 Thread Anand Jain



On 06/15/2017 11:15 PM, David Sterba wrote:

The function is called from ioctl context and we don't hold any locks
that take part in writeback. Right now it's only fs_info::volume_mutex.


 yeah volume_mutex is fine.

Reviewed-by: Anand Jain 


Signed-off-by: David Sterba 
---
  fs/btrfs/dev-replace.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index c6cd1d599c36..f7df85509125 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -242,7 +242,7 @@ static int btrfs_init_dev_replace_tgtdev(struct 
btrfs_fs_info *fs_info,
goto error;
}
  
-	name = rcu_string_strdup(device_path, GFP_NOFS);

+   name = rcu_string_strdup(device_path, GFP_KERNEL);
if (!name) {
kfree(device);
ret = -ENOMEM;


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use GFP_KERNEL in btrfs_calc_avail_data_space

2017-06-15 Thread Anand Jain



On 06/15/2017 09:07 PM, David Sterba wrote:

We don't hold any locks here. Inidirectly called from statfs.


 Reviewed-by: Anand Jain 


Signed-off-by: David Sterba 
---
  fs/btrfs/super.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4f1cdd5058f1..6f2e68b36362 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1927,7 +1927,7 @@ static int btrfs_calc_avail_data_space(struct 
btrfs_fs_info *fs_info,
}
  
  	devices_info = kmalloc_array(nr_devices, sizeof(*devices_info),

-  GFP_NOFS);
+  GFP_KERNEL);
if (!devices_info)
return -ENOMEM;
  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3 v2.1] btrfs: write_dev_flush does not return ENOMEM anymore

2017-06-15 Thread Anand Jain





Given that we probably don't have great testing
coverage of devices with the flush capabilities,


   Any hints on what type of test coverage ?


Devices used for normal testing, any sort of test basically, that do not
have the flush capabilities will never exercise the in write_dev_flush.


 Right. Same with -o nobarrier as well.


So any reference counts or failures will not get covered.


 hm. Still I didn't understand your point of view, sorry.
 But the contrary is also true that when device cache is enabled,
 the write_dev_flush() gets exercised, so we are fine.

 I toggled the device write cache manually for testing this set.
# echo "write through" > /sys/block/sdd/queue/write_cache
# echo "write back" > /sys/block/sdd/queue/write_cache


Thanks, Anand


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: cleanup duplicate return value in insert_inline_extent

2017-06-15 Thread David Sterba
The pattern when err is used for function exit and ret is used for
return values of callees is not used here.

Signed-off-by: David Sterba 
---
 fs/btrfs/inode.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a21a984d84d9..e837713d0d05 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -177,7 +177,6 @@ static int insert_inline_extent(struct btrfs_trans_handle 
*trans,
char *kaddr;
unsigned long ptr;
struct btrfs_file_extent_item *ei;
-   int err = 0;
int ret;
size_t cur_size = size;
unsigned long offset;
@@ -199,10 +198,8 @@ static int insert_inline_extent(struct btrfs_trans_handle 
*trans,
path->leave_spinning = 1;
ret = btrfs_insert_empty_item(trans, root, path, ,
  datasize);
-   if (ret) {
-   err = ret;
+   if (ret)
goto fail;
-   }
}
leaf = path->nodes[0];
ei = btrfs_item_ptr(leaf, path->slots[0],
@@ -257,9 +254,8 @@ static int insert_inline_extent(struct btrfs_trans_handle 
*trans,
BTRFS_I(inode)->disk_i_size = inode->i_size;
ret = btrfs_update_inode(trans, root, inode);
 
-   return ret;
 fail:
-   return err;
+   return ret;
 }
 
 
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: move fs_info::fs_frozen to the flags

2017-06-15 Thread David Sterba
We can keep the state among the other fs_info flags, there's no reason
why fs_frozen would need to be separate.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h   | 5 ++---
 fs/btrfs/disk-io.c | 1 -
 fs/btrfs/super.c   | 6 --
 fs/btrfs/transaction.c | 3 ++-
 4 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f0f5f28784b6..6375e57a5a69 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -716,6 +716,8 @@ struct btrfs_delayed_root;
 #define BTRFS_FS_LOG1_ERR  12
 #define BTRFS_FS_LOG2_ERR  13
 #define BTRFS_FS_QUOTA_OVERRIDE14
+/* Used to record internally whether fs has been frozen */
+#define BTRFS_FS_FROZEN15
 
 /*
  * Indicate that a whole-filesystem exclusive operation is running
@@ -1107,9 +1109,6 @@ struct btrfs_fs_info {
 */
struct list_head pinned_chunks;
 
-   /* Used to record internally whether fs has been frozen */
-   int fs_frozen;
-
/* Cached block sizes */
u32 nodesize;
u32 sectorsize;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2b00ebff13f8..2ac0a35f4450 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2661,7 +2661,6 @@ int open_ctree(struct super_block *sb,
atomic_set(_info->qgroup_op_seq, 0);
atomic_set(_info->reada_works_cnt, 0);
atomic64_set(_info->tree_mod_seq, 0);
-   fs_info->fs_frozen = 0;
fs_info->sb = sb;
fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
fs_info->metadata_ratio = 0;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3371213924bd..1138301f0cab 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2242,7 +2242,7 @@ static int btrfs_freeze(struct super_block *sb)
struct btrfs_fs_info *fs_info = btrfs_sb(sb);
struct btrfs_root *root = fs_info->tree_root;
 
-   fs_info->fs_frozen = 1;
+   set_bit(BTRFS_FS_FROZEN, _info->flags);
/*
 * We don't need a barrier here, we'll wait for any transaction that
 * could be in progress on other threads (and do delayed iputs that
@@ -2261,7 +2261,9 @@ static int btrfs_freeze(struct super_block *sb)
 
 static int btrfs_unfreeze(struct super_block *sb)
 {
-   btrfs_sb(sb)->fs_frozen = 0;
+   struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+
+   clear_bit(BTRFS_FS_FROZEN, _info->flags);
return 0;
 }
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index ab030fb22530..97e33513b195 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2314,7 +2314,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
 * it'll result in deadlock about SB_FREEZE_FS.
 */
if (current != fs_info->transaction_kthread &&
-   current != fs_info->cleaner_kthread && !fs_info->fs_frozen)
+   current != fs_info->cleaner_kthread &&
+   !test_bit(BTRFS_FS_FROZEN, _info->flags))
btrfs_run_delayed_iputs(fs_info);
 
return ret;
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] btrfs: sysfs: export the force_dev_flush flag

2017-06-15 Thread David Sterba
Add per-filesystem tunable to switch on/off the device barriers,
regardless of the actual device support. This is a debugging feature.

The path to the sysfs file is /sys/fs/btrfs/UUID/force_dev_flush,
allowed values are 0 and 1. Default is 0.

Signed-off-by: David Sterba 
---
 fs/btrfs/sysfs.c | 47 +++
 1 file changed, 47 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index c2d5f3580b4c..197d43911936 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -487,12 +487,59 @@ static ssize_t quota_override_store(struct kobject *kobj,
 
 BTRFS_ATTR_RW(quota_override, quota_override_show, quota_override_store);
 
+static ssize_t force_dev_flush_show(struct kobject *kobj,
+  struct kobj_attribute *a, char *buf)
+{
+   struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+   int force;
+
+   if (!fs_info)
+   return -EPERM;
+
+   force = !!(test_bit(BTRFS_FS_FORCE_DEV_FLUSH, _info->flags));
+   return snprintf(buf, PAGE_SIZE, "%d\n", force);
+}
+
+static ssize_t force_dev_flush_store(struct kobject *kobj,
+   struct kobj_attribute *a,
+   const char *buf, size_t len)
+{
+   struct btrfs_fs_info *fs_info = to_fs_info(kobj);
+   unsigned long force;
+   int err;
+
+   if (!fs_info)
+   return -EPERM;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   err = kstrtoul(buf, 10, );
+   if (err)
+   return err;
+   if (force > 1)
+   return -EINVAL;
+
+   if (force) {
+   set_bit(BTRFS_FS_FORCE_DEV_FLUSH, _info->flags);
+   btrfs_info(fs_info, "Forced device flushes enabled");
+   } else {
+   clear_bit(BTRFS_FS_FORCE_DEV_FLUSH, _info->flags);
+   btrfs_info(fs_info, "Forced device flushes disabled");
+   }
+
+   return len;
+}
+
+BTRFS_ATTR_RW(force_dev_flush, force_dev_flush_show, force_dev_flush_store);
+
 static const struct attribute *btrfs_attrs[] = {
BTRFS_ATTR_PTR(label),
BTRFS_ATTR_PTR(nodesize),
BTRFS_ATTR_PTR(sectorsize),
BTRFS_ATTR_PTR(clone_alignment),
BTRFS_ATTR_PTR(quota_override),
+   BTRFS_ATTR_PTR(force_dev_flush),
NULL,
 };
 
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] btrfs: add fs flag to force device flushing

2017-06-15 Thread David Sterba
We need a device capable of device barriers so we can test the flush
code. To aid testing, add a per-filesystem status flag that affects the
barriers regerdless of the device capabilities and obviously does not
give the same guarantees.

It's off by default, sysfs tunable will follow.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h   | 1 +
 fs/btrfs/disk-io.c | 8 +---
 fs/btrfs/volumes.h | 1 +
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f0f5f28784b6..dcf4404f7d61 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -716,6 +716,7 @@ struct btrfs_delayed_root;
 #define BTRFS_FS_LOG1_ERR  12
 #define BTRFS_FS_LOG2_ERR  13
 #define BTRFS_FS_QUOTA_OVERRIDE14
+#define BTRFS_FS_FORCE_DEV_FLUSH   15
 
 /*
  * Indicate that a whole-filesystem exclusive operation is running
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 59a732a13370..659a3b4645d2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3494,7 +3494,8 @@ static void write_dev_flush(struct btrfs_device *device)
struct request_queue *q = bdev_get_queue(device->bdev);
struct bio *bio = device->flush_bio;
 
-   if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
+   if (!test_bit(BTRFS_FS_FORCE_DEV_FLUSH, >fs_info->flags)
+   && !test_bit(QUEUE_FLAG_WC, >queue_flags))
return;
 
bio_reset(bio);
@@ -3505,6 +3506,7 @@ static void write_dev_flush(struct btrfs_device *device)
bio->bi_private = >flush_wait;
 
submit_bio(bio);
+   device->flush_bio_sent = 1;
 }
 
 /*
@@ -3512,12 +3514,12 @@ static void write_dev_flush(struct btrfs_device *device)
  */
 static int wait_dev_flush(struct btrfs_device *device)
 {
-   struct request_queue *q = bdev_get_queue(device->bdev);
struct bio *bio = device->flush_bio;
 
-   if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
+   if (!device->flush_bio_sent)
return 0;
 
+   device->flush_bio_sent = 0;
wait_for_completion_io(>flush_wait);
 
return bio->bi_error;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 35327efecdbb..6f45fd60d15a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -75,6 +75,7 @@ struct btrfs_device {
int can_discard;
int is_tgtdev_for_dev_replace;
int last_flush_error;
+   int flush_bio_sent;
 
 #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED
seqcount_t data_seqcount;
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] btrfs: account as waiting for IO, while waiting fot the flush bio completion

2017-06-15 Thread David Sterba
Similar to what submit_bio_wait does, we should account for IO while
waiting for a bio completion. This has marginal visible effects, flush
bio is short-lived.

Signed-off-by: David Sterba 
---
 fs/btrfs/disk-io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 27d44d6ab775..05ff81ecb887 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3519,7 +3519,7 @@ static int wait_dev_flush(struct btrfs_device *device)
if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
return 0;
 
-   wait_for_completion(>flush_wait);
+   wait_for_completion_io(>flush_wait);
 
if (bio->bi_error) {
ret = bio->bi_error;
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] btrfs: move dev stats accounting out of wait_dev_flush

2017-06-15 Thread David Sterba
We should really just wait in wait_dev_flush and let the caller decide
what to do with the error value.

Signed-off-by: David Sterba 
---
 fs/btrfs/disk-io.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 05ff81ecb887..59a732a13370 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3512,7 +3512,6 @@ static void write_dev_flush(struct btrfs_device *device)
  */
 static int wait_dev_flush(struct btrfs_device *device)
 {
-   int ret = 0;
struct request_queue *q = bdev_get_queue(device->bdev);
struct bio *bio = device->flush_bio;
 
@@ -3521,13 +3520,7 @@ static int wait_dev_flush(struct btrfs_device *device)
 
wait_for_completion_io(>flush_wait);
 
-   if (bio->bi_error) {
-   ret = bio->bi_error;
-   btrfs_dev_stat_inc_and_print(device,
-   BTRFS_DEV_STAT_FLUSH_ERRS);
-   }
-
-   return ret;
+   return bio->bi_error;
 }
 
 static int check_barrier_error(struct btrfs_fs_devices *fsdevs)
@@ -3586,6 +3579,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info)
ret = wait_dev_flush(dev);
if (ret) {
dev->last_flush_error = ret;
+   btrfs_dev_stat_inc_and_print(dev,
+   BTRFS_DEV_STAT_FLUSH_ERRS);
errors_wait++;
}
}
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] Preallocate flush bio, sysfs tunable

2017-06-15 Thread David Sterba
This patchset follows the updates in the write_dev_flush function. The flush
bio can be preallocated at the device creation time, so we avoid repeated
alloc/free.

Next, there's a new sysfs tunable to enable forced dev flushes for devices that
do not support the barriers. This helps to test the new code but is not meant
for any non-debugging use.

I've tested lightly with some workloads and toggled the sysfs knob during that,
all fine.

David Sterba (5):
  btrfs: preallocate device flush bio
  btrfs: account as waiting for IO, while waiting fot the flush bio
completion
  btrfs: move dev stats accounting out of wait_dev_flush
  btrfs: add fs flag to force device flushing
  btrfs: sysfs: export the force_dev_flush flag

 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/disk-io.c | 41 +
 fs/btrfs/sysfs.c   | 47 +++
 fs/btrfs/volumes.c | 12 
 fs/btrfs/volumes.h |  1 +
 5 files changed, 74 insertions(+), 28 deletions(-)

-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] btrfs: preallocate device flush bio

2017-06-15 Thread David Sterba
For devices that support flushing, we allocate a bio, submit, wait for
it and then free it. The bio allocation does not fail so ENOMEM is not a
problem but we still may unnecessarily stress the allocation subsystem.

Instead, we can allocate the device at the same time we allocate the
device and reuse it each time we need to flush the barriers. The bio is
reset before each use. Reference counting is simplified to just device
allocation (get) and freeing (put).

Note for write_dev_flush: we check the queue flush status again as we
can't use the existence of bio as before.

Signed-off-by: David Sterba 
---
 fs/btrfs/disk-io.c | 24 ++--
 fs/btrfs/volumes.c | 12 
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2b00ebff13f8..27d44d6ab775 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3482,9 +3482,7 @@ static int write_dev_supers(struct btrfs_device *device,
  */
 static void btrfs_end_empty_barrier(struct bio *bio)
 {
-   if (bio->bi_private)
-   complete(bio->bi_private);
-   bio_put(bio);
+   complete(bio->bi_private);
 }
 
 /*
@@ -3494,26 +3492,19 @@ static void btrfs_end_empty_barrier(struct bio *bio)
 static void write_dev_flush(struct btrfs_device *device)
 {
struct request_queue *q = bdev_get_queue(device->bdev);
-   struct bio *bio;
+   struct bio *bio = device->flush_bio;
 
if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
return;
 
-   /*
-* one reference for us, and we leave it for the
-* caller
-*/
-   device->flush_bio = NULL;
-   bio = btrfs_io_bio_alloc(0);
+   bio_reset(bio);
bio->bi_end_io = btrfs_end_empty_barrier;
bio->bi_bdev = device->bdev;
bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH;
init_completion(>flush_wait);
bio->bi_private = >flush_wait;
-   device->flush_bio = bio;
 
-   bio_get(bio);
-   btrfsic_submit_bio(bio);
+   submit_bio(bio);
 }
 
 /*
@@ -3522,9 +3513,10 @@ static void write_dev_flush(struct btrfs_device *device)
 static int wait_dev_flush(struct btrfs_device *device)
 {
int ret = 0;
+   struct request_queue *q = bdev_get_queue(device->bdev);
struct bio *bio = device->flush_bio;
 
-   if (!bio)
+   if (!test_bit(QUEUE_FLAG_WC, >queue_flags))
return 0;
 
wait_for_completion(>flush_wait);
@@ -3535,10 +3527,6 @@ static int wait_dev_flush(struct btrfs_device *device)
BTRFS_DEV_STAT_FLUSH_ERRS);
}
 
-   /* drop the reference from the wait == 0 run */
-   bio_put(bio);
-   device->flush_bio = NULL;
-
return ret;
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8bb1f4e5905a..251ae81e4363 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -242,6 +242,17 @@ static struct btrfs_device *__alloc_device(void)
if (!dev)
return ERR_PTR(-ENOMEM);
 
+   /*
+* Preallocate a bio that's always going to be used for flushing device
+* barriers and matches the device lifespan
+*/
+   dev->flush_bio = bio_alloc_bioset(GFP_KERNEL, 0, NULL);
+   if (!dev->flush_bio) {
+   kfree(dev);
+   return ERR_PTR(-ENOMEM);
+   }
+   bio_get(dev->flush_bio);
+
INIT_LIST_HEAD(>dev_list);
INIT_LIST_HEAD(>dev_alloc_list);
INIT_LIST_HEAD(>resized_list);
@@ -838,6 +849,7 @@ static void __free_device(struct work_struct *work)
 
device = container_of(work, struct btrfs_device, rcu_work);
rcu_string_free(device->name);
+   bio_put(device->flush_bio);
kfree(device);
 }
 
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3 v2.1] btrfs: write_dev_flush does not return ENOMEM anymore

2017-06-15 Thread David Sterba
On Wed, Jun 14, 2017 at 05:01:11PM +0800, Anand Jain wrote:
> > I have a prototype for the
> > preallocated flush bio but was waiting until these cleanups are in
> > before snding it.
> 
>   oh. sorry for the delay if any. -ENOMEM was only purpose I had/have
>   in mind for peralloc, but looks like there is another purpose which
>   I am not aware of.

The bio preallocation is a slight optimization, we don't have to
allocate and free the bio all the time.

> > I've noticed that threre are two more "if (!device->bdev)" checks under
> > the device lock in write_all_supers, that might be worth removing.
> 
> > However, a NULL bdev and device->missing are related and I think there
> > are some dark corners in dev replace where the invariant can be
> > temporarily broken.
> 
>   Ok. Thanks for the hints. Trying to dig.
> 
> > Given that we probably don't have great testing
> > coverage of devices with the flush capabilities,
> 
>   Any hints on what type of test coverage ?

Devices used for normal testing, any sort of test basically, that do not
have the flush capabilities will never exercise the in write_dev_flush.
So any reference counts or failures will not get covered.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Ensure size values are rounded down

2017-06-15 Thread David Sterba
On Thu, Jun 15, 2017 at 03:52:33PM +0300, Nikolay Borisov wrote:
> We got an internal report about a file system not wanting to mount 
> following 99e3ecfcb9f4 ("Btrfs: add more validation checks for superblock"). 
> 
> BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with
> fs_devices total_rw_bytes 1000203820544
> 
> Substracting the numbers we get a difference of less than a 4kb. Upon closer 
> inspection it became apparent that mkfs actually rounds down the size of the 
> device to a multiple of sector size. However, the same cannot be said for 
> various functions which modify the total size and are called from 
> btrfs_balanace
> as well as when adding a new device. 
> 
> The first patch in the series re-implements the btrfs_device_total_bytes
> getter/setters manually so that we can add a WARN_ON to catch future 
> offenders. 
> I haven't implemented all 4 function that the macros do since we only ever use
> the getter and setter. 
> 
> The second patch ensures that all the values passed to the setter are rounded 
> down to a multiple of sectorsize. I also have an xfstest patch which passes
> after this series is applied. 
> 
> 
> Nikolay Borisov (2):
>   btrfs: Manually implement device_total_bytes getter/setter
>   btrfs: Round up values which are written for total_bytes_size

The fixes look good, but I have a comment about the changelogs. The
first patch just extends the helpers, that's a minor update and should
not contain description of the bug, as it's not the real fix. OTOH, the
2nd patch is the real fix and does not contain any bug description at
all (because it was in another patch and cover letter).

Why do I care about that? Say I'm reading the code and find that some of
the superblock values are rounded down and want to find out why. I look
up the patch in history and land in ... a patch that has no changelog or
even pointers where to look for more information. I'll naturally try to
find a patch that's close in the history but could also fail to find the
right one.

The first patch should be just expansion of the SETGET macro, without
any changes, as the added WARN_ON changes the semantics a bit.

The first three paragraphs of the cover letter hold the information that
should go to the second patch because that actually fixes the problem.
Now it's also good time to extend the setter/getter with the checks.

Also please fix the over >80 lines.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/9] btrfs: check name_len before read name

2017-06-15 Thread David Sterba
On Tue, Jun 06, 2017 at 05:56:59PM +0800, Su Yue wrote:
> When reading out name from inode_ref, dir_item, it's possible that
> corrupted name_len leads to read beyond boundary.
> Since there are already patches for btrfs-progs, this patchset is
> for btrfs.
> 
> Introduce 'btrfs_is_name_len_valid' to make check name_len with
> item boundary.
> If read name from dir_item, use 'verify_dir_item' to do more strict
> check. Otherwise, use 'btrfs_is_name_len_valid'.
> 
> It's unnessary to do check before every read/memcmp_extent_buffer name.
> Checking name_len when read name for the first time in the call graph is
> enough.
> 
> Changlog:
> v2:
>   1.Change 'btrfs_check_namelen' to 'btrfs_is_namelen_valid'.
>   2.Split patches according call graph.
> v3:
>   1.Add cases about BTRFS_ROOT_REF_KEY and BTRFS_ROOT_BACKREF_KEY.
>   2.Add more comments about how/where extent_buffer is to be read
>   for the first time.
>   3.Change 'namelen' to 'name_len' in function and changelog.

I still see some fstests fail, eg. btrfs/048

trfs/048 30s ...   [18:05:48] [18:06:18] - output mismatch (see 
/root/test/mmtests/work/testdisk/sources/xfstests-git-installed/results//btrfs/048.out.bad)
--- tests/btrfs/048.out 2017-04-24 16:56:02.268104570 +0200
+++ 
/root/test/mmtests/work/testdisk/sources/xfstests-git-installed/results//btrfs/048.out.bad
  2017-06-15 18:06:18.996635908 +0200
@@ -30,22 +30,13 @@
 ERROR: failed to set compression for SCRATCH_MNT/testdir/file1: Invalid 
argument
 ***
 compression=lzo
-compression=lzo
-compression=zlib
 ***
-compression=lzo
...

in syslog with several lines

[ 2156.970285] BTRFS critical (device sdb6): invalid dir item name len: 17

and

[ 2157.220219] BTRFS error (device sdb6): error loading props for ino 259 (root 
5): -5

similarly btrfs/059 fails. Have you seen such errors during your testing?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Use btrfs_space_info_used instead of opencoding it

2017-06-15 Thread David Sterba
On Wed, Jun 14, 2017 at 11:35:34AM +0300, Nikolay Borisov wrote:
> Signed-off-by: Nikolay Borisov 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Questions about how BtrFS works.

2017-06-15 Thread Hy Che
Hi,

On Wed, Jun 14, 2017 at 8:46 AM, Qu Wenruo  wrote:
> That's why I recommend to start with btrfs on-disk data, which is static and
> you don't ever need to read much code.
> And we have more or less good enough doc for it:
> https://btrfs.wiki.kernel.org/index.php/Btree_Items
>
> Furthermore, AFAIK btrfs has the best tool to show how btrfs metadata and
> data is located on disk.
> (much better than Xfs and ext tools, and you can make it better easily)
>
> Not only which space is used (if you understand extent tree) but also what's
> inside each btrfs tree block.
yes, btrfs-show-super and btrfs-debug-tree help me most while learning
about btrfs.

> So my recommended study plan is:
> 1) Understand btrfs on-disk data
> 1.1) chunk tree and dev-extent tree
>  The very basic btrfs logical <-> device address mapping.
>  As almost all btrfs address space is logical space, without knowing
>  how to map it device, you can't go further
> 1.2) fs and subvolume tree
>  Understand how btrfs arrange its files and dirs.
> 1.3) root tree
>  Understand how btrfs arrange its subvolumes and other trees
> 1.4) extent tree
>  One of the most complicated tree, and quite a lot of items are not
>  easy to produce.
> 1.5) other trees
>  Not as common as above essential trees.
>
> 2) Try doing contribution to btrfs-progs
>Just plain C codes without too much new facilities, and is a quite
>small subset of kernel code.
>It's small and (more or less) easy to read, and is mostly focused on
>btrfs tree operations (for offline tools like fsck)
>
> 3) Understanding kernel code
>That's quite a hard work, not only you need to understand some new
>stuff bond to fs, like page cache, kernel memory management, block
>layer API.
>It will take you a long long time to just understand btrfs part.
>But with a solid understand of btrfs btree operation, you could start
>by checking how btrfs kernel modules manipulate its btree.
>
>
> You should first understand some basic info despite above btrfs ondisk data,
> like btrfs_path, btrfs_root and extent_buffer.
> They are the basic elements to manipulate btrfs btree.
>
> Then btrfs_search_slot() in *btrfs-progs* is your best starting point.
> The reason why you should start from btrfs-progs is:
> 1) It doesn't need to care about extra functions in kernel
>A lot of on-line function like balance or scrub can affect btrfs
>btree operation.
>While in btrfs-progs we don't need to worry about that.
>
> 2) No need to worry about lock
>
> 3) Number of lines
>And size-wise, ctree.c in btrfs-progs is less than 3000 lines while
>in kernel it's near 6000 lines.
>
> So I recommend you to start from btrfs_search_slot() with cow=0 and
> ins_len=0 case.
> Then with cow=1 and ins_len=0 case.
> Finally with cow=1 ins_len=1 case.
>
> With that, you would have a basic idea how btrfs btree is manipulated, other
> related functions will be quite easy to understand, like
> btrfs_insert_empty_items().
Thanks for the guidelines, I will start reading btrfs_progs code first
and it is very easy to read as you said :)

>>
>>>
>>>

 4. How BtrFS handle transactions ?
 Correctly me if I'm wrong, the transaction collect all requests in 30
 seconds and then write back to disk. The transid increments when new
 request appeared and genid is asigned to this one.
>>>
>>>
>>> I don't think there is anything written per-se. You'd again have to
>>> resort to reading the
>>> code
>>
>>
>> I need a rough idea before reading code because it would be taking lots of
>> time.
>
>
> Indeed, transaction in btrfs is without much explain.
>
> But digging in btrfs-progs would provide you an overall view of it, but the
> behavior is still quite different from kernel.
> (BTW, 30 sec is just the commit interval which can be tuned by mount option)
>
> I'm not completely familiar with btrfs transaction, so I can be wrong and
> any comment is welcomed.
> Below is my understanding:
>
>
> A transaction is the time window in which we could modify btrfs metadata
> (tree blocks).
>
> Each transaction (not trans handler, as we can share one transaction with
> different trans handler) will increase the generation, all modified metadata
> inside the same transaction will have the same generation.
>
> And after a transaction is committed, all the on-disk tree blocks should be
> in a consistent stat.
>
> The life cycle of a transaction would be:
> \|/
> btrfs_commit_transaction()  ---<- previous trans is committed
>   and finished
>
>gen: X
> btrfs_start_transaction()   ---<- new transaction is started
> |- get trans handler A  /|\   as no running trans
> |- modify some tree blocks   |
>  |
> btrfs_start_transaction()| <- Another progress start a trans
> |- get trans handler B   |

[PATCH] btrfs: use GFP_KERNEL in btrfs_init_dev_replace_tgtdev

2017-06-15 Thread David Sterba
The function is called from ioctl context and we don't hold any locks
that take part in writeback. Right now it's only fs_info::volume_mutex.

Signed-off-by: David Sterba 
---
 fs/btrfs/dev-replace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index c6cd1d599c36..f7df85509125 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -242,7 +242,7 @@ static int btrfs_init_dev_replace_tgtdev(struct 
btrfs_fs_info *fs_info,
goto error;
}
 
-   name = rcu_string_strdup(device_path, GFP_NOFS);
+   name = rcu_string_strdup(device_path, GFP_KERNEL);
if (!name) {
kfree(device);
ret = -ENOMEM;
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-06-15 Thread Jeff Layton
On Thu, 2017-06-15 at 07:57 -0700, Christoph Hellwig wrote:
> On Thu, Jun 15, 2017 at 06:42:12AM -0400, Jeff Layton wrote:
> > Correct.
> > 
> > But if there is a data writeback error, should we report an error on all
> > open fds at that time (like we will for fsync)?
> 
> We should in theory, but I don't see how to properly do it.  In addition
> sync_file_range just can't be used for data integrity to start with, so
> I don't think it's worth it.  At some point we should add a proper
> fsync_range syscall, though.

filemap_report_wb_err will always return 0 if the inode never has
mapping_set_error called on it. So, I think we should be able to do it
there once we get all of the fs' converted over. That'll have to happen
at the end of the series however.

-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-06-15 Thread Christoph Hellwig
On Thu, Jun 15, 2017 at 06:42:12AM -0400, Jeff Layton wrote:
> Correct.
> 
> But if there is a data writeback error, should we report an error on all
> open fds at that time (like we will for fsync)?

We should in theory, but I don't see how to properly do it.  In addition
sync_file_range just can't be used for data integrity to start with, so
I don't think it's worth it.  At some point we should add a proper
fsync_range syscall, though.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/10 v11] No wait AIO

2017-06-15 Thread Christoph Hellwig
On Thu, Jun 15, 2017 at 12:11:58PM +0100, Al Viro wrote:
> Which flags are you talking about?  aio ones?  AFAICS, it's the same
> kind of thing as "can we lseek?" or "can we read/pread?", etc.
> What would that field look like?  Note that some of those might depend
> upon the flags passed to open(), so shoving them into file_operations
> might mean splitting file_operations instances...  Details, please.

Yeah, I'll take my suggestion back.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: obsolete and remove mount option alloc_start

2017-06-15 Thread David Sterba
The mount option alloc_start was used in the past for debugging and
stressing the chunk allocator. Not meant to be used by users, so we're
not breaking anybody's setup.

There was some added complexity handling changes of the value and when
it was not same as default. Such code has likely been untested and I
think it's better to remove it.

This patch kills all use of alloc_start, and by doing that also fixes
a bug when alloc_size is set, potentially called from statfs:

in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU
protection is temporarily dropped so btrfs_account_dev_extents_size can
be called and then RCU is locked again! Doing that inside
list_for_each_entry_rcu is just asking for trouble, but unlikely to be
observed in practice.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h   | 12 +--
 fs/btrfs/super.c   | 61 +-
 fs/btrfs/volumes.c |  4 +---
 3 files changed, 7 insertions(+), 70 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f0f5f28784b6..55180c5f746e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -797,17 +797,7 @@ struct btrfs_fs_info {
 * so it is also safe.
 */
u64 max_inline;
-   /*
-* Protected by ->chunk_mutex and sb->s_umount.
-*
-* The reason that we use two lock to protect it is because only
-* remount and mount operations can change it and these two operations
-* are under sb->s_umount, but the read side (chunk allocation) can not
-* acquire sb->s_umount or the deadlock would happen. So we use two
-* locks to protect it. On the write side, we must acquire two locks,
-* and on the read side, we just need acquire one of them.
-*/
-   u64 alloc_start;
+
struct btrfs_transaction *running_transaction;
wait_queue_head_t transaction_throttle;
wait_queue_head_t transaction_wait;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3371213924bd..eb355dc5afc5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -601,18 +601,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
}
break;
case Opt_alloc_start:
-   num = match_strdup([0]);
-   if (num) {
-   mutex_lock(>chunk_mutex);
-   info->alloc_start = memparse(num, NULL);
-   mutex_unlock(>chunk_mutex);
-   kfree(num);
-   btrfs_info(info, "allocations start at %llu",
-  info->alloc_start);
-   } else {
-   ret = -ENOMEM;
-   goto out;
-   }
+   btrfs_info(info,
+   "option alloc_start is obsolete, ignored");
break;
case Opt_acl:
 #ifdef CONFIG_BTRFS_FS_POSIX_ACL
@@ -1232,8 +1222,6 @@ static int btrfs_show_options(struct seq_file *seq, 
struct dentry *dentry)
seq_puts(seq, ",nobarrier");
if (info->max_inline != BTRFS_DEFAULT_MAX_INLINE)
seq_printf(seq, ",max_inline=%llu", info->max_inline);
-   if (info->alloc_start != 0)
-   seq_printf(seq, ",alloc_start=%llu", info->alloc_start);
if (info->thread_pool_size !=  min_t(unsigned long,
 num_online_cpus() + 2, 8))
seq_printf(seq, ",thread_pool=%d", info->thread_pool_size);
@@ -1716,7 +1704,6 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
unsigned long old_opts = fs_info->mount_opt;
unsigned long old_compress_type = fs_info->compress_type;
u64 old_max_inline = fs_info->max_inline;
-   u64 old_alloc_start = fs_info->alloc_start;
int old_thread_pool_size = fs_info->thread_pool_size;
unsigned int old_metadata_ratio = fs_info->metadata_ratio;
int ret;
@@ -1855,9 +1842,6 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
fs_info->mount_opt = old_opts;
fs_info->compress_type = old_compress_type;
fs_info->max_inline = old_max_inline;
-   mutex_lock(_info->chunk_mutex);
-   fs_info->alloc_start = old_alloc_start;
-   mutex_unlock(_info->chunk_mutex);
btrfs_resize_thread_pool(fs_info,
old_thread_pool_size, fs_info->thread_pool_size);
fs_info->metadata_ratio = old_metadata_ratio;
@@ -1904,11 +1888,9 @@ static int btrfs_calc_avail_data_space(struct 
btrfs_fs_info *fs_info,
u64 skip_space;
u64 type;
u64 avail_space;
-   u64 used_space;
u64 min_stripe_size;
int min_stripes = 1, num_stripes = 1;
int i = 0, nr_devices;
-   

[PATCH] btrfs: use GFP_KERNEL in btrfs_calc_avail_data_space

2017-06-15 Thread David Sterba
We don't hold any locks here. Inidirectly called from statfs.

Signed-off-by: David Sterba 
---
 fs/btrfs/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4f1cdd5058f1..6f2e68b36362 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1927,7 +1927,7 @@ static int btrfs_calc_avail_data_space(struct 
btrfs_fs_info *fs_info,
}
 
devices_info = kmalloc_array(nr_devices, sizeof(*devices_info),
-  GFP_NOFS);
+  GFP_KERNEL);
if (!devices_info)
return -ENOMEM;
 
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: Manually implement device_total_bytes getter/setter

2017-06-15 Thread Nikolay Borisov
The device->total_bytes member needs to always be rounded down to sectorsize
so that it corresponds to the value of super->total_bytes. However, there are
multiple places where the setter is fed a value which is not rounded which
can cause a fs to be unmountable due to the check introduced in
99e3ecfcb9f4 ("Btrfs: add more validation checks for superblock")

This patch adds the necessary WARN_ON in the setter to catch future offenders.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ctree.h | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0334452a7be1..c64b16a2b28d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1552,8 +1552,25 @@ static inline void btrfs_set_##name(type *s, u##bits 
val)\
s->member = cpu_to_le##bits(val);   \
 }
 
+
+static inline u64 btrfs_device_total_bytes(struct extent_buffer *eb,
+  struct btrfs_dev_item *s)
+{
+   BUILD_BUG_ON(sizeof(u64) !=
+sizeof(((struct btrfs_dev_item *)0))->total_bytes);
+   return btrfs_get_64(eb, s, offsetof(struct btrfs_dev_item, 
total_bytes));
+}
+static inline void btrfs_set_device_total_bytes(struct extent_buffer *eb,
+   struct btrfs_dev_item *s,
+   u64 val)
+{
+   BUILD_BUG_ON(sizeof(u64) != sizeof(((struct btrfs_dev_item 
*)0))->total_bytes);
+   WARN_ON(!IS_ALIGNED(val, eb->fs_info->sectorsize));
+   btrfs_set_64(eb, s, offsetof(struct btrfs_dev_item, total_bytes), val);
+}
+
+
 BTRFS_SETGET_FUNCS(device_type, struct btrfs_dev_item, type, 64);
-BTRFS_SETGET_FUNCS(device_total_bytes, struct btrfs_dev_item, total_bytes, 64);
 BTRFS_SETGET_FUNCS(device_bytes_used, struct btrfs_dev_item, bytes_used, 64);
 BTRFS_SETGET_FUNCS(device_io_align, struct btrfs_dev_item, io_align, 32);
 BTRFS_SETGET_FUNCS(device_io_width, struct btrfs_dev_item, io_width, 32);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: Round up values which are written for total_bytes_size

2017-06-15 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/volumes.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e37f95976443..adf32f46a73f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2387,7 +2387,8 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
device->io_width = fs_info->sectorsize;
device->io_align = fs_info->sectorsize;
device->sector_size = fs_info->sectorsize;
-   device->total_bytes = i_size_read(bdev->bd_inode);
+   device->total_bytes = round_down(i_size_read(bdev->bd_inode),
+fs_info->sectorsize);
device->disk_total_bytes = device->total_bytes;
device->commit_total_bytes = device->total_bytes;
device->fs_info = fs_info;
@@ -2424,7 +2425,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, 
const char *device_path
 
tmp = btrfs_super_total_bytes(fs_info->super_copy);
btrfs_set_super_total_bytes(fs_info->super_copy,
-   tmp + device->total_bytes);
+   round_down(tmp + device->total_bytes, fs_info->sectorsize));
 
tmp = btrfs_super_num_devices(fs_info->super_copy);
btrfs_set_super_num_devices(fs_info->super_copy, tmp + 1);
@@ -2687,6 +2688,8 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
if (!device->writeable)
return -EACCES;
 
+   new_size = round_down(new_size, fs_info->sectorsize);
+
mutex_lock(_info->chunk_mutex);
old_total = btrfs_super_total_bytes(super_copy);
diff = new_size - device->total_bytes;
@@ -2699,7 +2702,8 @@ int btrfs_grow_device(struct btrfs_trans_handle *trans,
 
fs_devices = fs_info->fs_devices;
 
-   btrfs_set_super_total_bytes(super_copy, old_total + diff);
+   btrfs_set_super_total_bytes(super_copy,
+   round_down(old_total + diff, fs_info->sectorsize));
device->fs_devices->total_rw_bytes += diff;
 
btrfs_device_set_total_bytes(device, new_size);
@@ -4389,7 +4393,10 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 
new_size)
struct btrfs_super_block *super_copy = fs_info->super_copy;
u64 old_total = btrfs_super_total_bytes(super_copy);
u64 old_size = btrfs_device_get_total_bytes(device);
-   u64 diff = old_size - new_size;
+   u64 diff;
+
+   new_size = round_down(new_size, fs_info->sectorsize);
+   diff = old_size - new_size;
 
if (device->is_tgtdev_for_dev_replace)
return -EINVAL;
@@ -4516,7 +4523,8 @@ int btrfs_shrink_device(struct btrfs_device *device, u64 
new_size)
  _info->fs_devices->resized_devices);
 
WARN_ON(diff > old_total);
-   btrfs_set_super_total_bytes(super_copy, old_total - diff);
+   btrfs_set_super_total_bytes(super_copy,
+   round_down(old_total - diff, fs_info->sectorsize));
mutex_unlock(_info->chunk_mutex);
 
/* Now btrfs_update_device() will change the on-disk size. */
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Ensure size values are rounded down

2017-06-15 Thread Nikolay Borisov
We got an internal report about a file system not wanting to mount 
following 99e3ecfcb9f4 ("Btrfs: add more validation checks for superblock"). 

BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with
fs_devices total_rw_bytes 1000203820544

Substracting the numbers we get a difference of less than a 4kb. Upon closer 
inspection it became apparent that mkfs actually rounds down the size of the 
device to a multiple of sector size. However, the same cannot be said for 
various functions which modify the total size and are called from btrfs_balanace
as well as when adding a new device. 

The first patch in the series re-implements the btrfs_device_total_bytes
getter/setters manually so that we can add a WARN_ON to catch future offenders. 
I haven't implemented all 4 function that the macros do since we only ever use
the getter and setter. 

The second patch ensures that all the values passed to the setter are rounded 
down to a multiple of sectorsize. I also have an xfstest patch which passes
after this series is applied. 


Nikolay Borisov (2):
  btrfs: Manually implement device_total_bytes getter/setter
  btrfs: Round up values which are written for total_bytes_size

 fs/btrfs/ctree.h   | 19 ++-
 fs/btrfs/volumes.c | 18 +-
 2 files changed, 31 insertions(+), 6 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/10 v11] No wait AIO

2017-06-15 Thread Al Viro
On Mon, Jun 12, 2017 at 11:14:31PM -0700, Christoph Hellwig wrote:
> On Mon, Jun 12, 2017 at 05:38:13PM -0500, Goldwyn Rodrigues wrote:
> > We had FS_NOWAIT in filesystem type flags (in v3), but retracted it
> > later in v4.
> 
> A per-fs flag is wrong as file_operation may have different
> capabilities.
> 
> > I will work on adding FMODE_AIO_NOWAIT in the meantime.
> 
> If Al prefers that let's go with it for now.  One other thing I'd
> love is a supported flags field in struct file_operations that we
> can compare the flags against.  Al, does that sound good as well?

Which flags are you talking about?  aio ones?  AFAICS, it's the same
kind of thing as "can we lseek?" or "can we read/pread?", etc.
What would that field look like?  Note that some of those might depend
upon the flags passed to open(), so shoving them into file_operations
might mean splitting file_operations instances...  Details, please.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: inode.c convert local flags int -> bool

2017-06-15 Thread Nikolay Borisov


On 15.06.2017 13:48, Timofey Titovets wrote:
> Just conversion from int to bool where it applicable
> Only change function local flags to not break external
> function / structures
> 
> Reasons:
> 1. Separate variables like:
>- ret = 1 - ret value with special handling from caller
>- insert = 1 - three-state flag
>- reserve = 1 - just a flag
>- reserved = 0 - just a flag
>- scanned = 0 - counter
> 2. Sync usage of bool for flags with other code
> for decrease blending int/bool
> 
> Signed-off-by: Timofey Titovets 
While I myself am a fan of bool and hate it when people use an int to
signify a boolean variable patches such as this one create needless
churn. If you have a fix, or a major refactoring to contribute and you
can also manage to squeeze in such a fixup then yes, but this - no.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: inode.c convert local flags int -> bool

2017-06-15 Thread Timofey Titovets
Just conversion from int to bool where it applicable
Only change function local flags to not break external
function / structures

Reasons:
1. Separate variables like:
   - ret = 1 - ret value with special handling from caller
   - insert = 1 - three-state flag
   - reserve = 1 - just a flag
   - reserved = 0 - just a flag
   - scanned = 0 - counter
2. Sync usage of bool for flags with other code
for decrease blending int/bool

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/inode.c | 93 +++-
 1 file changed, 45 insertions(+), 48 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 76e8cfa8aca1..f786e22a5a0e 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -458,16 +458,16 @@ static noinline void compress_file_range(struct inode 
*inode,
unsigned long total_compressed = 0;
unsigned long total_in = 0;
int i;
-   int will_compress;
+   bool will_compress;
int compress_type = fs_info->compress_type;
-   int redirty = 0;
+   bool redirty = false;

inode_should_defrag(BTRFS_I(inode), start, end, end - start + 1,
SZ_16K);

actual_end = min_t(u64, isize, end + 1);
 again:
-   will_compress = 0;
+   will_compress = false;
nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1;
BUILD_BUG_ON((BTRFS_MAX_COMPRESSED % PAGE_SIZE) != 0);
nr_pages = min_t(unsigned long, nr_pages,
@@ -529,7 +529,7 @@ static noinline void compress_file_range(struct inode 
*inode,
 * dirty again later on.
 */
extent_range_clear_dirty_for_io(inode, start, end);
-   redirty = 1;
+   redirty = true;
ret = btrfs_compress_pages(compress_type,
   inode->i_mapping, start,
   pages,
@@ -552,7 +552,7 @@ static noinline void compress_file_range(struct inode 
*inode,
   PAGE_SIZE - offset);
kunmap_atomic(kaddr);
}
-   will_compress = 1;
+   will_compress = true;
}
}
 cont:
@@ -1279,8 +1279,8 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
int extent_type;
int ret, err;
int type;
-   int nocow;
-   int check_prev = 1;
+   bool nocow;
+   bool check_prev = true;
bool nolock;
u64 ino = btrfs_ino(BTRFS_I(inode));

@@ -1314,7 +1314,7 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
found_key.type == BTRFS_EXTENT_DATA_KEY)
path->slots[0]--;
}
-   check_prev = 0;
+   check_prev = false;
 next_slot:
leaf = path->nodes[0];
if (path->slots[0] >= btrfs_header_nritems(leaf)) {
@@ -1326,7 +1326,7 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
leaf = path->nodes[0];
}

-   nocow = 0;
+   nocow = false;
disk_bytenr = 0;
num_bytes = 0;
btrfs_item_key_to_cpu(leaf, _key, path->slots[0]);
@@ -1407,7 +1407,7 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
btrfs_end_write_no_snapshoting(root);
goto out_check;
}
-   nocow = 1;
+   nocow = true;
} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
extent_end = found_key.offset +
btrfs_file_extent_inline_len(leaf,
@@ -3290,7 +3290,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info = btrfs_sb(inode->vfs_inode.i_sb);
struct btrfs_root *root = inode->root;
struct btrfs_block_rsv *block_rsv = NULL;
-   int reserve = 0;
+   bool reserve = false;
int insert = 0;
int ret;

@@ -3328,7 +3328,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans,

if (!test_and_set_bit(BTRFS_INODE_ORPHAN_META_RESERVED,
  >runtime_flags))
-   reserve = 1;
+   reserve = true;
spin_unlock(>orphan_lock);

/* grab metadata reservation from transaction handle */
@@ -3386,18 +3386,18 @@ static int btrfs_orphan_del(struct btrfs_trans_handle 
*trans,
struct btrfs_inode *inode)
 {
struct btrfs_root *root = inode->root;
-   int delete_item = 0;
-   int release_rsv = 0;
+   bool delete_item = false;
+   bool release_rsv = false;
int ret = 0;

spin_lock(>orphan_lock);
if (test_and_clear_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,

Re: [PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-06-15 Thread Jeff Layton
On Thu, 2017-06-15 at 01:22 -0700, Christoph Hellwig wrote:
> On Wed, Jun 14, 2017 at 01:24:43PM -0400, Jeff Layton wrote:
> > In this smaller set, it's only really used for DAX.
> 
> DAX only is implemented by three filesystems, please just fix them
> up in one go.
> 

Ok.

> > sync_file_range: ->fsync isn't called directly there, and I think we
> > probably want similar semantics to fsync() for it
> 
> sync_file_range is only supposed to sync data, so it should not call
> ->fsync.
> 

Correct.

But if there is a data writeback error, should we report an error on all
open fds at that time (like we will for fsync)?

I think we probably do want to do that, but like you say...there is no
file op for sync_file_range. It'll need some way to figure out what sort
of error tracking is in play.

> > JBD2: will try to re-set the error after clearing it with
> > filemap_fdatawait. That's problematic with the new infrastructure so we
> > need some way to avoid it.
> 
> JBD2 only has two users, please fix them up in one go.

I came up with a fix yesterday that makes the flag unnecessary there.

Thanks,
-- 
Jeff Layton 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Disk usage is more than double the snapshots exclusive data

2017-06-15 Thread Vianney Stroebel
On a backup drive for a home computer, disk usage as shown by 'btrfs fi show' 
is more than double the snapshots exclusive data as shown by "btrfs qgroup 
show" (574 GB vs 265 GB).


I've done a lot of research online and I couldn't find any answer to this 
problem.

Output of some commands:

uname -a
Linux viybel-pc 4.10.0-22-generic #24-Ubuntu SMP Mon May 22 17:43:20 UTC 2017 
x86_64 x86_64 x86_64 GNU/Linux


btrfs --version
btrfs-progs v4.9.1

sudo btrfs fi show
Label: 'btrfs-backup'  uuid: 35905dc5-1400-4687-8be7-cf87d6ad0980
Total devices 1 FS bytes used 573.89GiB
devid1 size 698.64GiB used 697.04GiB path /dev/sdb1

btrfs fi df /mnt/btrfs-backup
Data, single: total=670.01GiB, used=564.26GiB
System, DUP: total=8.00MiB, used=112.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=13.50GiB, used=8.64GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=14.95MiB

From https://github.com/agronick/btrfs-size (i.e. more readable "btrfs qgroup 
show")


btrfs-size /mnt/btrfs-backup
==
Snapshot / Subvolume   ID   Total 
  Exclusive Data

==
10726 gen 216807 top level 5 path full/2016-06-19_12-32-01 10726 
208.68GB 208.68GB
21512 gen 216853 top level 5 path full/2016-12-14_16-21-34 21512 
166.98GB 40.36GB
23054 gen 216853 top level 5 path full/2017-03-03_08-47-00 23054 
154.79GB 7.53GB
25451 gen 216856 top level 5 path full/2017-04-14_21-54-25 25451 
123.48GB 3.07GB
26514 gen 216862 top level 5 path full/2017-05-02_14-58-09 26514 
123.70GB 5.03GB
28499 gen 218095 top level 5 path full/2017-06-11_19-29-16 28499 
154.65GB 169.78MB
28556 gen 218094 top level 5 path full/2017-06-13_03-15-00 28556 
154.88GB 403.89MB

==
Exclusive 
Total: 265.23GB


Feel free to ask me any question.

Vianney


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: add mount umount logs

2017-06-15 Thread Anand Jain
From: Anand Jain 

By looking at the logs we should be able to know when FS was
mounted and unmounted and the options used, so to help forensic
investigations.

Signed-off-by: Anand Jain 
---
Hi David,

 As you suggested VFS is definitly a better place to have this
 and its patch is in the ML. But I am not too sure what's their
 plan is.  So can we integrate this into BTRFS ? as we can always
 remove it once VFS provides that feature.

Thanks, Anand

v2: Use ro,rw instead of rdonly and directly print it.

 fs/btrfs/super.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3371213924bd..2cb367a106e3 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1373,6 +1373,13 @@ static char *setup_root_args(char *args)
return buf;
 }
 
+static void print_mount_info(struct btrfs_fs_info *info, int flag, char *opt,
+   char *prefix)
+{
+   btrfs_notice(info, "%s: flags=%s opt=%s\n",
+   prefix, flag & MS_RDONLY ? "ro":"rw", opt);
+}
+
 static struct dentry *mount_subvol(const char *subvol_name, u64 
subvol_objectid,
   int flags, const char *device_name,
   char *data)
@@ -1467,6 +1474,8 @@ static struct dentry *mount_subvol(const char 
*subvol_name, u64 subvol_objectid,
dput(root);
root = ERR_PTR(ret);
deactivate_locked_super(s);
+   } else {
+   print_mount_info(fs_info, flags, data, "mount");
}
}
 
@@ -1845,6 +1854,9 @@ static int btrfs_remount(struct super_block *sb, int 
*flags, char *data)
 out:
wake_up_process(fs_info->transaction_kthread);
btrfs_remount_cleanup(fs_info, old_opts);
+
+   print_mount_info(fs_info, *flags, data, "remount");
+
return 0;
 
 restore:
@@ -2174,6 +2186,7 @@ static int btrfs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
 static void btrfs_kill_super(struct super_block *sb)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+   btrfs_notice(fs_info, "%s\n", "unmount");
kill_anon_super(sb);
free_fs_info(fs_info);
 }
-- 
2.7.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 12/20] fs: add a new fstype flag to indicate how writeback errors are tracked

2017-06-15 Thread Christoph Hellwig
On Wed, Jun 14, 2017 at 01:24:43PM -0400, Jeff Layton wrote:
> In this smaller set, it's only really used for DAX.

DAX only is implemented by three filesystems, please just fix them
up in one go.

> sync_file_range: ->fsync isn't called directly there, and I think we
> probably want similar semantics to fsync() for it

sync_file_range is only supposed to sync data, so it should not call
->fsync.

> JBD2: will try to re-set the error after clearing it with
> filemap_fdatawait. That's problematic with the new infrastructure so we
> need some way to avoid it.

JBD2 only has two users, please fix them up in one go.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed root -9

2017-06-15 Thread Qu Wenruo



At 06/14/2017 09:39 PM, Henk Slager wrote:

On Tue, Jun 13, 2017 at 12:47 PM, Henk Slager  wrote:

On Tue, Jun 13, 2017 at 7:24 AM, Kai Krakow  wrote:

Am Mon, 12 Jun 2017 11:00:31 +0200
schrieb Henk Slager :


Hi all,

there is 1-block corruption a 8TB filesystem that showed up several
months ago. The fs is almost exclusively a btrfs receive target and
receives monthly sequential snapshots from two hosts but 1 received
uuid. I do not know exactly when the corruption has happened but it
must have been roughly 3 to 6 months ago. with monthly updated
kernel+progs on that host.

Some more history:
- fs was created in november 2015 on top of luks
- initially bcache between the 2048-sector aligned partition and luks.
Some months ago I removed 'the bcache layer' by making sure that cache
was clean and then zeroing 8K bytes at start of partition in an
isolated situation. Then setting partion offset to 2064 by
delete-recreate in gdisk.
- in december 2016 there were more scrub errors, but related to the
monthly snapshot of december2016. I have removed that snapshot this
year and now only this 1-block csum error is the only issue.
- brand/type is seagate 8TB SMR. At least since kernel 4.4+ that
includes some SMR related changes in the blocklayer this disk works
fine with btrfs.
- the smartctl values show no error so far but I will run an extended
test this week after another btrfs check which did not show any error
earlier with the csum fail being there
- I have noticed that the board that has the disk attached has been
rebooted due to power-failures many times (unreliable power switch and
power dips from energy company) and the 150W powersupply is broken and
replaced since then. Also due to this, I decided to remove bcache
(which has been in write-through and write-around only).

Some btrfs inpect-internal exercise shows that the problem is in a
directory in the root that contains most of the data and snapshots.
But an  rsync -c  with an identical other clone snapshot shows no
difference (no writes to an rw snapshot of that clone). So the fs is
still OK as file-level backup, but btrfs replace/balance will fatal
error on just this 1 csum error. It looks like that this is not a
media/disk error but some HW induced error or SW/kernel issue.
Relevant btrfs commands + dmesg info, see below.

Any comments on how to fix or handle this without incrementally
sending all snapshots to a new fs (6+ TiB of data, assuming this won't
fail)?


# uname -r
4.11.3-1-default
# btrfs --version
btrfs-progs v4.10.2+20170406


There's btrfs-progs v4.11 available...


I started:
# btrfs check -p --readonly /dev/mapper/smr
but it stopped with printing 'Killed' while checking extents. The
board has 8G RAM, no swap (yet), so I just started lowmem mode:
# btrfs check -p --mode lowmem --readonly /dev/mapper/smr

Now after a 1 day 77 lines like this are printed:
ERROR: extent[5365470154752, 81920] referencer count mismatch (root:
6310, owner: 1771130, offset: 33243062272) wanted: 1, have: 2

It is still running, hopefully it will finish within 2 days. But
lateron I can compile/use latest progs from git. Same for kernel,
maybe with some tweaks/patches, but I think I will also plug the disk
into a faster machine then ( i7-4770 instead of the J1900 ).


fs profile is dup for system+meta, single for data

# btrfs scrub start /local/smr


What looks strange to me is that the parameters of the error reports
seem to be rotated by one... See below:


[27609.626555] BTRFS error (device dm-0): parent transid verify failed
on 6350718500864 wanted 23170 found 23076
[27609.685416] BTRFS info (device dm-0): read error corrected: ino 1
off 6350718500864 (dev /dev/mapper/smr sector 11681212672)
[27609.685928] BTRFS info (device dm-0): read error corrected: ino 1
off 6350718504960 (dev /dev/mapper/smr sector 11681212680)
[27609.686160] BTRFS info (device dm-0): read error corrected: ino 1
off 6350718509056 (dev /dev/mapper/smr sector 11681212688)
[27609.687136] BTRFS info (device dm-0): read error corrected: ino 1
off 6350718513152 (dev /dev/mapper/smr sector 11681212696)
[37663.606455] BTRFS error (device dm-0): parent transid verify failed
on 6350453751808 wanted 23170 found 23075
[37663.685158] BTRFS info (device dm-0): read error corrected: ino 1
off 6350453751808 (dev /dev/mapper/smr sector 11679647008)
[37663.685386] BTRFS info (device dm-0): read error corrected: ino 1
off 6350453755904 (dev /dev/mapper/smr sector 11679647016)
[37663.685587] BTRFS info (device dm-0): read error corrected: ino 1
off 635045376 (dev /dev/mapper/smr sector 11679647024)
[37663.685798] BTRFS info (device dm-0): read error corrected: ino 1
off 6350453764096 (dev /dev/mapper/smr sector 11679647032)


Why does it say "ino 1"? Does it mean devid 1?


On a 3-disk btrfs raid1 fs I see in the journal also "read error
corrected: ino 1" lines for all 3 disks. This was with a 4.10.x
kernel, ATM I don't know if this is 

Re: csum failed root -9

2017-06-15 Thread Kai Krakow
Am Wed, 14 Jun 2017 15:39:50 +0200
schrieb Henk Slager :

> On Tue, Jun 13, 2017 at 12:47 PM, Henk Slager 
> wrote:
> > On Tue, Jun 13, 2017 at 7:24 AM, Kai Krakow 
> > wrote:  
> >> Am Mon, 12 Jun 2017 11:00:31 +0200
> >> schrieb Henk Slager :
> >>  
>  [...]  
> >>
> >> There's btrfs-progs v4.11 available...  
> >
> > I started:
> > # btrfs check -p --readonly /dev/mapper/smr
> > but it stopped with printing 'Killed' while checking extents. The
> > board has 8G RAM, no swap (yet), so I just started lowmem mode:
> > # btrfs check -p --mode lowmem --readonly /dev/mapper/smr
> >
> > Now after a 1 day 77 lines like this are printed:
> > ERROR: extent[5365470154752, 81920] referencer count mismatch (root:
> > 6310, owner: 1771130, offset: 33243062272) wanted: 1, have: 2
> >
> > It is still running, hopefully it will finish within 2 days. But
> > lateron I can compile/use latest progs from git. Same for kernel,
> > maybe with some tweaks/patches, but I think I will also plug the
> > disk into a faster machine then ( i7-4770 instead of the J1900 ).
> >  
>  [...]  
> >>
> >> What looks strange to me is that the parameters of the error
> >> reports seem to be rotated by one... See below:
> >>  
>  [...]  
> >>
> >> Why does it say "ino 1"? Does it mean devid 1?  
> >
> > On a 3-disk btrfs raid1 fs I see in the journal also "read error
> > corrected: ino 1" lines for all 3 disks. This was with a 4.10.x
> > kernel, ATM I don't know if this is right or wrong.
> >  
>  [...]  
> >>
> >> And why does it say "root -9"? Shouldn't it be "failed -9 root 257
> >> ino 515567616"? In that case the "off" value would be completely
> >> missing...
> >>
> >> Those "rotations" may mess up with where you try to locate the
> >> error on disk...  
> >
> > I hadn't looked at the numbers like that, but as you indicate, I
> > also think that the 1-block csum fail location is bogus because the
> > kernel calculates that based on some random corruption in critical
> > btrfs structures, also looking at the 77 referencer count
> > mismatches. A negative root ID is already a sort of red flag. When
> > I can mount the fs again after the check is finished, I can
> > hopefully use the output of the check to get clearer how big the
> > 'damage' is.  
> 
> The btrfs lowmem mode check ends with:
> 
> ERROR: root 7331 EXTENT_DATA[928390 3506176] shouldn't be hole
> ERROR: errors found in fs roots
> found 6968612982784 bytes used, error(s) found
> total csum bytes: 6786376404
> total tree bytes: 25656016896
> total fs tree bytes: 14857535488
> total extent tree bytes: 3237216256
> btree space waste bytes: 3072362630
> file data blocks allocated: 38874881994752
>  referenced 36477629964288
> 
> In total 2000+ of those "shouldn't be hole" lines.
> 
> A non-lowmem check, now done with kernel 4.11.4 and progs v4.11 and
> 16G swap added ends with 'noerrors found'

Don't trust lowmem mode too much. The developer of lowmem mode may tell
you more about specific edge cases.

> W.r.t. holes, maybe it is woth to mention the super-flags:
> incompat_flags  0x369
> ( MIXED_BACKREF |
>   COMPRESS_LZO |
>   BIG_METADATA |
>   EXTENDED_IREF |
>   SKINNY_METADATA |
>   NO_HOLES )

I think it's not worth to follow up on this holes topic: I guess it was
a false report of lowmem mode, and it was fixed with 4.11 btrfs progs.

> The fs has received snapshots from source fs that had NO_HOLES enabled
> for some time, but after registed this bug:
> https://bugzilla.kernel.org/show_bug.cgi?id=121321
> I put back that NO_HOLES flag to zero on the source fs. It seems I
> forgot to do that on the 8TB target/backup fs. But I don't know if
> there is a relation between this flag flipping and the btrfs check
> error messages.
> 
> I think I leave it as is for the time being, unless there is some news
> how to fix things with low risk (or maybe via a temp overlay snapshot
> with DM). But the lowmem check took 2 days, that's not really fun.
> The goal for the 8TB fs is to have an up to 7 year snapshot history at
> sometime, now the oldest snapshot is from early 2014, so almost
> halfway :)

Btrfs is still much too unstable to trust 7 years worth of backup to
it. You will probably loose it at some point, especially while many
snapshots are still such a huge performance breaker in btrfs. I suggest
trying out also other alternatives like borg backup for such a project.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Use btrfs_space_info_used instead of opencoding it

2017-06-15 Thread Nikolay Borisov


On 15.06.2017 04:17, kbuild test robot wrote:
> Hi Nikolay,
> 
> [auto build test ERROR on v4.9-rc8]
> [cannot apply to btrfs/next kdave/for-next next-20170614]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
> 
> url:
> https://github.com/0day-ci/linux/commits/Nikolay-Borisov/btrfs-Use-btrfs_space_info_used-instead-of-opencoding-it/20170615-084940
> config: x86_64-randconfig-x012-201724 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> All errors (new ones prefixed by >>):
> 
>fs/btrfs/extent-tree.c: In function 'btrfs_calc_reclaim_metadata_size':
>>> fs/btrfs/extent-tree.c:4918:9: error: implicit declaration of function 
>>> 'btrfs_space_info_used' [-Werror=implicit-function-declaration]
>  used = btrfs_space_info_used(space_info, true);
> ^
>cc1: some warnings being treated as errors

This patch builds cleanly on latest linux master and it's based off that
tree. I don't see why it's being applied to 4.9-rc8.


> 
> vim +/btrfs_space_info_used +4918 fs/btrfs/extent-tree.c
> 
>   4912
>   4913to_reclaim = min_t(u64, num_online_cpus() * SZ_1M, 
> SZ_16M);
>   4914if (can_overcommit(root, space_info, to_reclaim,
>   4915   BTRFS_RESERVE_FLUSH_ALL))
>   4916return 0;
>   4917
>> 4918 used = btrfs_space_info_used(space_info, true);
>   4919
>   4920if (can_overcommit(root, space_info, SZ_1M, 
> BTRFS_RESERVE_FLUSH_ALL))
>   4921expected = 
> div_factor_fine(space_info->total_bytes, 95);
> 
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html