Re: [PATCH] btrfs: only run delayed refs if we're committing

2018-11-21 Thread Nikolay Borisov



On 21.11.18 г. 21:10 ч., Josef Bacik wrote:
> I noticed in a giant dbench run that we spent a lot of time on lock
> contention while running transaction commit.  This is because dbench
> results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> they all run the delayed refs first thing, so they all contend with
> each other.  This leads to seconds of 0 throughput.  Change this to only
> run the delayed refs if we're the ones committing the transaction.  This
> makes the latency go away and we get no more lock contention.
> 
> Reviewed-by: Omar Sandoval 
> Signed-off-by: Josef Bacik 
Wohooo, we no longer run delayed refs at arbitrary points during
transaction commit.

Reviewed-by: Nikolay Borisov 
> ---
>  fs/btrfs/transaction.c | 24 +---
>  1 file changed, 9 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 3c1be9db897c..41cc96cc59a3 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>   btrfs_trans_release_metadata(trans);
>   trans->block_rsv = NULL;
>  
> - /* make a pass through all the delayed refs we have so far
> -  * any runnings procs may add more while we are here
> -  */
> - ret = btrfs_run_delayed_refs(trans, 0);
> - if (ret) {
> - btrfs_end_transaction(trans);
> - return ret;
> - }
> -
>   cur_trans = trans->transaction;
>  
>   /*
> @@ -1938,12 +1929,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>  
>   btrfs_create_pending_block_groups(trans);
>  
> - ret = btrfs_run_delayed_refs(trans, 0);
> - if (ret) {
> - btrfs_end_transaction(trans);
> - return ret;
> - }
> -
>   if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
>   int run_it = 0;
>  
> @@ -2014,6 +1999,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>   spin_unlock(_info->trans_lock);
>   }
>  
> + /*
> +  * We are now the only one in the commit area, we can run delayed refs
> +  * without hitting a bunch of lock contention from a lot of people
> +  * trying to commit the transaction at once.
> +  */
> + ret = btrfs_run_delayed_refs(trans, 0);
> + if (ret)
> + goto cleanup_transaction;
> +
>   extwriter_counter_dec(cur_trans, trans->type);
>  
>   ret = btrfs_start_delalloc_flush(fs_info);
> 


[PATCH v12] Add cli and ioctl to forget scanned device(s)

2018-11-21 Thread Anand Jain
v12:
  Fixed coding style - leave space between " : ".

v11:
 btrfs-progs: Bring the code into the else part of if(forget).
  Use strerror to print the erorr instead of ret.

v10:
 Make btrfs-progs changes more readable.
 With an effort to keep the known bug [1] as it is..
  [1]
   The cli 'btrfs device scan --all /dev/sdb' which should have scanned
only one device, ends up scanning all the devices and I am not trying
to fix this bug in this patch because..
  . -d|--all is marked as deprecated, I hope -d option would go away
  . For now some script might be using this bug as a feature, and fixing
this bug might lead to mount failure.

v9:
 Make forget as a btrfs device scan option.
 Use forget in the fstests, now you can run fstests with btrfs as rootfs
  which helps to exercise the uuid_mutex lock.

v8:
 Change log update in the kernel patch.

v7:
 Use struct btrfs_ioctl_vol_args (instead of struct
  btrfs_ioctl_vol_args_v2) as its inline with other ioctl
  btrfs-control
 The CLI usage/features remains same. However internally the ioctl flag
  is not required to delete all the unmounted devices. Instead leave
  btrfs_ioctl_vol_args::name NULL.

v6:
 Use the changed fn name btrfs_free_stale_devices().

 Change in title:
 Old v5:
 Cover-letter:
  [PATCH v5] Add cli and ioctl to ignore a scanned device
 Kernel:
  [PATCH v5] btrfs: introduce feature to ignore a btrfs device
 Progs:
  [PATCH v5] btrfs-progs: add 'btrfs device ignore' cli

v5:
  Adds feature to delete all stale devices
  Reuses btrfs_free_stale_devices() fn and so depends on the
patch-set [1] in the ML.
  Uses struct btrfs_ioctl_vol_args_v2 instead of
struct btrfs_ioctl_vol_args as arg
  Does the device path matching instead of btrfs_device matching
(we won't delete the mounted device as btrfs_free_stale_devices()
checks for it)
v4:
  No change. But as the ML thread may be confusing, so resend.
v3:
  No change. Send to correct ML.
v2:
  Accepts review from Nikolay, details are in the specific patch.
  Patch 1/2 is renamed from
[PATCH 1/2] btrfs: refactor btrfs_free_stale_device() to get device list 
delete
  to
[PATCH 1/2] btrfs: add function to device list delete

Adds cli and ioctl to forget a scanned device or forget all stale
devices in the kernel.

Anand Jain (1):
  btrfs: introduce feature to forget a btrfs device

 fs/btrfs/super.c   | 3 +++
 fs/btrfs/volumes.c | 9 +
 fs/btrfs/volumes.h | 1 +
 include/uapi/linux/btrfs.h | 2 ++
 4 files changed, 15 insertions(+)

Anand Jain (1):
  btrfs-progs: add cli to forget one or all scanned devices

 cmds-device.c | 63 ++-
 ioctl.h   |  2 ++
 2 files changed, 56 insertions(+), 9 deletions(-)

[1]
Anand Jain (1):
  fstests: btrfs use forget if not reload

 common/btrfs| 20 
 tests/btrfs/124 |  6 +++---
 tests/btrfs/125 |  6 +++---
 tests/btrfs/154 |  6 +++---
 tests/btrfs/164 |  4 ++--
 5 files changed, 31 insertions(+), 11 deletions(-)

-- 
1.8.3.1


[PATCH] fstests: btrfs use forget if not reload

2018-11-21 Thread Anand Jain
[I will send this the xfstest ML after kernel and progs patch
 has been integrated].

btrfs reload was introduced to cleanup the device list inside the btrfs
kernel module.

The problem with the reload approach is that you can't run btrfs test
cases 124,125, 154 and 164 on the system with btrfs as root fs.

Now as we are introducing the btrfs forget feature as an btrfs device
scan option [1], so here are its pertaining changes in the fstests.

So these changes ensures we use the forget feature where available and
if its not then falls back to the reload approach.

[1]
btrfs-progs: add cli to forget one or all scanned devices
btrfs: introduce feature to forget a btrfs device

Signed-off-by: Anand Jain 
---
 common/btrfs| 20 
 tests/btrfs/124 |  6 +++---
 tests/btrfs/125 |  6 +++---
 tests/btrfs/154 |  6 +++---
 tests/btrfs/164 |  4 ++--
 5 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/common/btrfs b/common/btrfs
index 26dc0bb9600f..c7fbec11c8c1 100644
--- a/common/btrfs
+++ b/common/btrfs
@@ -374,3 +374,23 @@ _scratch_btrfs_sectorsize()
$BTRFS_UTIL_PROG inspect-internal dump-super $SCRATCH_DEV |\
grep sectorsize | awk '{print $2}'
 }
+
+_btrfs_supports_forget()
+{
+   $BTRFS_UTIL_PROG device scan --help | grep -wq forget && \
+   $BTRFS_UTIL_PROG device scan --forget > /dev/null 2>&1
+}
+
+_require_btrfs_forget_if_not_fs_loadable()
+{
+   _btrfs_supports_forget && return
+
+   _require_loadable_fs_module "btrfs"
+}
+
+_btrfs_forget_if_not_fs_reload()
+{
+   _btrfs_supports_forget && return
+
+   _reload_fs_module "btrfs"
+}
diff --git a/tests/btrfs/124 b/tests/btrfs/124
index ce3ad6aa3a58..edbeff1443f5 100755
--- a/tests/btrfs/124
+++ b/tests/btrfs/124
@@ -51,7 +51,7 @@ _supported_fs btrfs
 _supported_os Linux
 _require_scratch_dev_pool 2
 _test_unmount
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 
 _scratch_dev_pool_get 2
 
@@ -86,7 +86,7 @@ echo "clean btrfs ko" >> $seqres.full
 _scratch_unmount
 
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 
 echo >> $seqres.full
 echo "-Write degraded mount fill upto $max_fs_sz bytes-" >> 
$seqres.full
@@ -125,7 +125,7 @@ echo
 echo "Mount degraded with the other dev"
 _scratch_unmount
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 _mount -o degraded $dev2 $SCRATCH_MNT >>$seqres.full 2>&1
 _run_btrfs_util_prog filesystem show
 checkpoint3=`md5sum $SCRATCH_MNT/tf2`
diff --git a/tests/btrfs/125 b/tests/btrfs/125
index e38de264b28e..9161a2ddeaad 100755
--- a/tests/btrfs/125
+++ b/tests/btrfs/125
@@ -50,7 +50,7 @@ _supported_fs btrfs
 _supported_os Linux
 _require_scratch_dev_pool 3
 _test_unmount
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 
 _scratch_dev_pool_get 3
 
@@ -102,7 +102,7 @@ echo "unmount" >> $seqres.full
 _scratch_unmount
 echo "clean btrfs ko" >> $seqres.full
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 _mount -o degraded,device=$dev2 $dev1 $SCRATCH_MNT >>$seqres.full 2>&1
 dd if=/dev/zero of="$SCRATCH_MNT"/tf2 bs=$bs count=$count \
>>$seqres.full 2>&1
@@ -138,7 +138,7 @@ echo "Mount degraded but with other dev"
 
 _scratch_unmount
 # un-scan the btrfs devices
-_reload_fs_module "btrfs"
+_btrfs_forget_if_not_fs_reload
 
 _mount -o degraded,device=${dev2} $dev3 $SCRATCH_MNT >>$seqres.full 2>&1
 
diff --git a/tests/btrfs/154 b/tests/btrfs/154
index 99ea232aba4c..01745d20e9fc 100755
--- a/tests/btrfs/154
+++ b/tests/btrfs/154
@@ -36,7 +36,7 @@ rm -f $seqres.full
 _supported_fs btrfs
 _supported_os Linux
 _require_scratch_dev_pool 2
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 
 _scratch_dev_pool_get 2
 
@@ -90,7 +90,7 @@ degrade_mount_write()
 
echo "clean btrfs ko" >> $seqres.full
# un-scan the btrfs devices
-   _reload_fs_module "btrfs"
+   _btrfs_forget_if_not_fs_reload
_mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1
cnt=$(( $COUNT/10 ))
dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \
@@ -142,7 +142,7 @@ verify()
echo "unmount" >> $seqres.full
 
_scratch_unmount
-   _reload_fs_module "btrfs"
+   _btrfs_forget_if_not_fs_reload
_mount -o degraded $DEV2 $SCRATCH_MNT >>$seqres.full 2>&1
verify_checkpoint1=`md5sum $SCRATCH_MNT/tf1`
verify_checkpoint2=`md5sum $SCRATCH_MNT/tf2`
diff --git a/tests/btrfs/164 b/tests/btrfs/164
index 097191a0e493..55042c4035e0 100755
--- a/tests/btrfs/164
+++ b/tests/btrfs/164
@@ -36,7 +36,7 @@ rm -f $seqres.full
 # Modify as appropriate.
 _supported_fs btrfs
 _supported_os Linux
-_require_loadable_fs_module "btrfs"
+_require_btrfs_forget_if_not_fs_loadable
 _require_scratch_dev_pool 2
 
 _scratch_dev_pool_get 2
@@ -69,7 +69,7 @@ delete_seed()
 

[PATCH v12] btrfs: introduce feature to forget a btrfs device

2018-11-21 Thread Anand Jain
Support for a new command 'btrfs dev forget [dev]' is proposed here
to undo the effects of 'btrfs dev scan [dev]'. For this purpose
this patch proposes to use ioctl #5 as it was empty.
IOW(BTRFS_IOCTL_MAGIC, 5, ..)
This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from
the /dev/btrfs-control to forget one or all devices, (devices which are
not mounted) from the btrfs kernel.

The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be
set to specify the device path. And all unmounted devices can be removed
from the kernel if no device path is provided.

Again, the devices are removed only if the relevant fsid aren't mounted.

This new cli can provide..
 . Release of unwanted btrfs_fs_devices and btrfs_devices memory if the
   device is not going to be mounted.
 . Ability to mount the device in degraded mode when one of the other
   device is corrupted like in split brain raid1.
 . Running test cases which requires btrfs.ko-reload if the rootfs
   is btrfs.

Signed-off-by: Anand Jain 
Reviewed-by: Nikolay Borisov 
---
v11->v12: fix coding style add spacing before after ":".
v1->v11: Pls ref to the cover-letter. (sorry about that).

 fs/btrfs/super.c   | 3 +++
 fs/btrfs/volumes.c | 9 +
 fs/btrfs/volumes.h | 1 +
 include/uapi/linux/btrfs.h | 2 ++
 4 files changed, 15 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index d3c6bbc0aa3a..eba2966913ae 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2247,6 +2247,9 @@ static long btrfs_control_ioctl(struct file *file, 
unsigned int cmd,
ret = PTR_ERR_OR_ZERO(device);
mutex_unlock(_mutex);
break;
+   case BTRFS_IOC_FORGET_DEV:
+   ret = btrfs_forget_devices(vol->name);
+   break;
case BTRFS_IOC_DEVICES_READY:
mutex_lock(_mutex);
device = btrfs_scan_one_device(vol->name, FMODE_READ,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 8e36cbb355df..48415a9edd46 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1397,6 +1397,15 @@ static int btrfs_read_disk_super(struct block_device 
*bdev, u64 bytenr,
return 0;
 }
 
+int btrfs_forget_devices(const char *path)
+{
+   mutex_lock(_mutex);
+   btrfs_free_stale_devices(strlen(path) ? path : NULL, NULL);
+   mutex_unlock(_mutex);
+
+   return 0;
+}
+
 /*
  * Look for a btrfs signature on a device. This may be called out of the mount 
path
  * and we are not allowed to call set_blocksize during the scan. The superblock
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1d936ce282c3..a4a89d5050f5 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -414,6 +414,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
   fmode_t flags, void *holder);
 struct btrfs_device *btrfs_scan_one_device(const char *path,
   fmode_t flags, void *holder);
+int btrfs_forget_devices(const char *path);
 int btrfs_close_devices(struct btrfs_fs_devices *fs_devices);
 void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step);
 void btrfs_assign_next_active_device(struct btrfs_device *device,
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index e0763bc4158e..c195896d478f 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -837,6 +837,8 @@ enum btrfs_err_code {
   struct btrfs_ioctl_vol_args)
 #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \
   struct btrfs_ioctl_vol_args)
+#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \
+  struct btrfs_ioctl_vol_args)
 /* trans start and trans end are dangerous, and only for
  * use by applications that know how to avoid the
  * resulting deadlocks
-- 
1.8.3.1



[PATCH v12] btrfs-progs: add cli to forget one or all scanned devices

2018-11-21 Thread Anand Jain
This patch adds cli
  btrfs device forget [dev]
to remove the given device structure in the kernel if the device
is unmounted. If no argument is given it shall remove all stale
(device which are not mounted) from the kernel.

Signed-off-by: Anand Jain 
Reviewed-by: Nikolay Borisov 
---
v11->v12: none.
v1->v12: pls ref to the cover-letter.

 cmds-device.c | 72 ---
 ioctl.h   |  2 ++
 2 files changed, 61 insertions(+), 13 deletions(-)

diff --git a/cmds-device.c b/cmds-device.c
index 2a05f70a76a9..280d6f555377 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -254,10 +254,32 @@ static int cmd_device_delete(int argc, char **argv)
return _cmd_device_remove(argc, argv, cmd_device_delete_usage);
 }
 
+static int btrfs_forget_devices(char *path)
+{
+   struct btrfs_ioctl_vol_args args;
+   int ret;
+   int fd;
+
+   fd = open("/dev/btrfs-control", O_RDWR);
+   if (fd < 0)
+   return -errno;
+
+   memset(, 0, sizeof(args));
+   if (path)
+   strncpy_null(args.name, path);
+   ret = ioctl(fd, BTRFS_IOC_FORGET_DEV, );
+   if (ret)
+   ret = -errno;
+   close(fd);
+   return ret;
+}
+
 static const char * const cmd_device_scan_usage[] = {
-   "btrfs device scan [(-d|--all-devices)| [...]]",
-   "Scan devices for a btrfs filesystem",
+   "btrfs device scan [(-d|--all-devices)|(-u|--forget)| "\
+   "[...]]",
+   "Scan or forget (deregister) devices for a btrfs filesystem",
" -d|--all-devices (deprecated)",
+   " -u|--forget [ ..]",
NULL
 };
 
@@ -267,37 +289,53 @@ static int cmd_device_scan(int argc, char **argv)
int devstart;
int all = 0;
int ret = 0;
+   int forget = 0;
 
optind = 0;
while (1) {
int c;
static const struct option long_options[] = {
{ "all-devices", no_argument, NULL, 'd'},
+   { "forget", no_argument, NULL, 'u'},
{ NULL, 0, NULL, 0}
};
 
-   c = getopt_long(argc, argv, "d", long_options, NULL);
+   c = getopt_long(argc, argv, "du", long_options, NULL);
if (c < 0)
break;
switch (c) {
case 'd':
all = 1;
break;
+   case 'u':
+   forget = 1;
+   break;
default:
usage(cmd_device_scan_usage);
}
}
devstart = optind;
 
+   if (all && forget)
+   usage(cmd_device_scan_usage);
+
if (all && check_argc_max(argc - optind, 1))
usage(cmd_device_scan_usage);
 
if (all || argc - optind == 0) {
-   printf("Scanning for Btrfs filesystems\n");
-   ret = btrfs_scan_devices();
-   error_on(ret, "error %d while scanning", ret);
-   ret = btrfs_register_all_devices();
-   error_on(ret, "there are %d errors while registering devices", 
ret);
+   if (forget) {
+   ret = btrfs_forget_devices(NULL);
+   error_on(ret, "'%s', forget failed",
+strerror(-ret));
+   } else {
+   printf("Scanning for Btrfs filesystems\n");
+   ret = btrfs_scan_devices();
+   error_on(ret, "error %d while scanning", ret);
+   ret = btrfs_register_all_devices();
+   error_on(ret,
+   "there are %d errors while registering devices",
+   ret);
+   }
goto out;
}
 
@@ -315,11 +353,19 @@ static int cmd_device_scan(int argc, char **argv)
ret = 1;
goto out;
}
-   printf("Scanning for Btrfs filesystems in '%s'\n", path);
-   if (btrfs_register_one_device(path) != 0) {
-   ret = 1;
-   free(path);
-   goto out;
+   if (forget) {
+   ret = btrfs_forget_devices(path);
+   if (ret)
+   error("Can't forget '%s': %s",
+   path, strerror(-ret));
+   } else {
+   printf("Scanning for Btrfs filesystems in '%s'\n",
+   path);
+   if (btrfs_register_one_device(path) != 0) {
+   ret = 1;
+   free(path);
+   goto out;
+   }

Re: [PATCH 2/6] btrfs: add cleanup_ref_head_accounting helper

2018-11-21 Thread Qu Wenruo


On 2018/11/22 上午2:59, Josef Bacik wrote:
> From: Josef Bacik 
> 
> We were missing some quota cleanups in check_ref_cleanup, so break the
> ref head accounting cleanup into a helper and call that from both
> check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> we don't screw up accounting in the future for other things that we add.
> 
> Reviewed-by: Omar Sandoval 
> Reviewed-by: Liu Bo 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 67 
> +-
>  1 file changed, 39 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c36b3a42f2bb..e3ed3507018d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
> *trans,
>   return ret ? ret : 1;
>  }
>  
> +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> + struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_delayed_ref_root *delayed_refs =
> + >transaction->delayed_refs;
> +
> + if (head->total_ref_mod < 0) {
> + struct btrfs_space_info *space_info;
> + u64 flags;
> +
> + if (head->is_data)
> + flags = BTRFS_BLOCK_GROUP_DATA;
> + else if (head->is_system)
> + flags = BTRFS_BLOCK_GROUP_SYSTEM;
> + else
> + flags = BTRFS_BLOCK_GROUP_METADATA;
> + space_info = __find_space_info(fs_info, flags);
> + ASSERT(space_info);
> + percpu_counter_add_batch(_info->total_bytes_pinned,
> +-head->num_bytes,
> +BTRFS_TOTAL_BYTES_PINNED_BATCH);
> +
> + if (head->is_data) {
> + spin_lock(_refs->lock);
> + delayed_refs->pending_csums -= head->num_bytes;
> + spin_unlock(_refs->lock);
> + }
> + }
> +
> + /* Also free its reserved qgroup space */
> + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> +   head->qgroup_reserved);

This part will be removed in patch "[PATCH] btrfs: qgroup: Move reserved
data account from btrfs_delayed_ref_head to btrfs_qgroup_extent_record".

So there is one less thing to worry about in delayed ref head.

Thanks,
Qu

> +}
> +
>  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>   struct btrfs_delayed_ref_head *head)
>  {
> @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> - trace_run_delayed_ref_head(fs_info, head, 0);
> -
> - if (head->total_ref_mod < 0) {
> - struct btrfs_space_info *space_info;
> - u64 flags;
> -
> - if (head->is_data)
> - flags = BTRFS_BLOCK_GROUP_DATA;
> - else if (head->is_system)
> - flags = BTRFS_BLOCK_GROUP_SYSTEM;
> - else
> - flags = BTRFS_BLOCK_GROUP_METADATA;
> - space_info = __find_space_info(fs_info, flags);
> - ASSERT(space_info);
> - percpu_counter_add_batch(_info->total_bytes_pinned,
> --head->num_bytes,
> -BTRFS_TOTAL_BYTES_PINNED_BATCH);
> -
> - if (head->is_data) {
> - spin_lock(_refs->lock);
> - delayed_refs->pending_csums -= head->num_bytes;
> - spin_unlock(_refs->lock);
> - }
> - }
> -
>   if (head->must_insert_reserved) {
>   btrfs_pin_extent(fs_info, head->bytenr,
>head->num_bytes, 1);
> @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   }
>   }
>  
> - /* Also free its reserved qgroup space */
> - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> -   head->qgroup_reserved);
> + cleanup_ref_head_accounting(trans, head);
> +
> + trace_run_delayed_ref_head(fs_info, head, 0);
>   btrfs_delayed_ref_unlock(head);
>   btrfs_put_delayed_ref_head(head);
>   return 0;
> @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (head->must_insert_reserved)
>   ret = 1;
>  
> + cleanup_ref_head_accounting(trans, head);
>   mutex_unlock(>mutex);
>   btrfs_put_delayed_ref_head(head);
>   return ret;
> 



signature.asc
Description: OpenPGP digital signature


[RFC][PATCH v4 09/09] btrfs: use common file type conversion

2018-11-21 Thread Phillip Potter
Deduplicate the btrfs file type conversion implementation - file systems
that use the same file types as defined by POSIX do not need to define
their own versions and can use the common helper functions decared in
fs_types.h and implemented in fs_types.c

Acked-by: David Sterba 
Signed-off-by: Amir Goldstein 
Signed-off-by: Phillip Potter 
---
 fs/btrfs/btrfs_inode.h  |  2 --
 fs/btrfs/delayed-inode.c|  2 +-
 fs/btrfs/inode.c| 32 +++-
 include/uapi/linux/btrfs_tree.h |  2 ++
 4 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 97d91e55b70a..bb01c804485f 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -196,8 +196,6 @@ struct btrfs_inode {
struct inode vfs_inode;
 };
 
-extern unsigned char btrfs_filetype_table[];
-
 static inline struct btrfs_inode *BTRFS_I(const struct inode *inode)
 {
return container_of(inode, struct btrfs_inode, vfs_inode);
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index c669f250d4a0..e61947f5eb76 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1692,7 +1692,7 @@ int btrfs_readdir_delayed_dir_index(struct dir_context 
*ctx,
name = (char *)(di + 1);
name_len = btrfs_stack_dir_name_len(di);
 
-   d_type = btrfs_filetype_table[di->type];
+   d_type = fs_ftype_to_dtype(di->type);
btrfs_disk_key_to_cpu(, >location);
 
over = !dir_emit(ctx, name, name_len,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9ea4c6f0352f..8b7b1b29e2ad 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -72,17 +72,6 @@ struct kmem_cache *btrfs_trans_handle_cachep;
 struct kmem_cache *btrfs_path_cachep;
 struct kmem_cache *btrfs_free_space_cachep;
 
-#define S_SHIFT 12
-static const unsigned char btrfs_type_by_mode[S_IFMT >> S_SHIFT] = {
-   [S_IFREG >> S_SHIFT]= BTRFS_FT_REG_FILE,
-   [S_IFDIR >> S_SHIFT]= BTRFS_FT_DIR,
-   [S_IFCHR >> S_SHIFT]= BTRFS_FT_CHRDEV,
-   [S_IFBLK >> S_SHIFT]= BTRFS_FT_BLKDEV,
-   [S_IFIFO >> S_SHIFT]= BTRFS_FT_FIFO,
-   [S_IFSOCK >> S_SHIFT]   = BTRFS_FT_SOCK,
-   [S_IFLNK >> S_SHIFT]= BTRFS_FT_SYMLINK,
-};
-
 static int btrfs_setsize(struct inode *inode, struct iattr *attr);
 static int btrfs_truncate(struct inode *inode, bool skip_writeback);
 static int btrfs_finish_ordered_io(struct btrfs_ordered_extent 
*ordered_extent);
@@ -5793,10 +5782,6 @@ static struct dentry *btrfs_lookup(struct inode *dir, 
struct dentry *dentry,
return d_splice_alias(inode, dentry);
 }
 
-unsigned char btrfs_filetype_table[] = {
-   DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
-};
-
 /*
  * All this infrastructure exists because dir_emit can fault, and we are 
holding
  * the tree lock when doing readdir.  For now just allocate a buffer and copy
@@ -5935,7 +5920,7 @@ static int btrfs_real_readdir(struct file *file, struct 
dir_context *ctx)
name_ptr = (char *)(entry + 1);
read_extent_buffer(leaf, name_ptr, (unsigned long)(di + 1),
   name_len);
-   put_unaligned(btrfs_filetype_table[btrfs_dir_type(leaf, di)],
+   put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)),
>type);
btrfs_dir_item_key_to_cpu(leaf, di, );
put_unaligned(location.objectid, >ino);
@@ -6340,7 +6325,20 @@ static struct inode *btrfs_new_inode(struct 
btrfs_trans_handle *trans,
 
 static inline u8 btrfs_inode_type(struct inode *inode)
 {
-   return btrfs_type_by_mode[(inode->i_mode & S_IFMT) >> S_SHIFT];
+   /*
+* compile-time asserts that generic FT_x types still match
+* BTRFS_FT_x types
+*/
+   BUILD_BUG_ON(BTRFS_FT_UNKNOWN != FT_UNKNOWN);
+   BUILD_BUG_ON(BTRFS_FT_REG_FILE != FT_REG_FILE);
+   BUILD_BUG_ON(BTRFS_FT_DIR != FT_DIR);
+   BUILD_BUG_ON(BTRFS_FT_CHRDEV != FT_CHRDEV);
+   BUILD_BUG_ON(BTRFS_FT_BLKDEV != FT_BLKDEV);
+   BUILD_BUG_ON(BTRFS_FT_FIFO != FT_FIFO);
+   BUILD_BUG_ON(BTRFS_FT_SOCK != FT_SOCK);
+   BUILD_BUG_ON(BTRFS_FT_SYMLINK != FT_SYMLINK);
+
+   return fs_umode_to_ftype(inode->i_mode);
 }
 
 /*
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index aff1356c2bb8..4126ed7ee89a 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -307,6 +307,8 @@
  *
  * Used by:
  * struct btrfs_dir_item.type
+ *
+ * Values 0..7 must match common file type values in fs_types.h.
  */
 #define BTRFS_FT_UNKNOWN   0
 #define BTRFS_FT_REG_FILE  1
-- 
2.19.1



[PATCH 1/3] btrfs: run delayed iputs before committing

2018-11-21 Thread Josef Bacik
Delayed iputs means we can have final iputs of deleted inodes in the
queue, which could potentially generate a lot of pinned space that could
be free'd.  So before we decide to commit the transaction for ENOPSC
reasons, run the delayed iputs so that any potential space is free'd up.
If there is and we freed enough we can then commit the transaction and
potentially be able to make our reservation.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/extent-tree.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 90423b6749b7..3a90dc1d6b31 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4833,6 +4833,15 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes)
return 0;
 
+   /*
+* If we have pending delayed iputs then we could free up a bunch of
+* pinned space, so make sure we run the iputs before we do our pinned
+* bytes check below.
+*/
+   mutex_lock(_info->cleaner_delayed_iput_mutex);
+   btrfs_run_delayed_iputs(fs_info);
+   mutex_unlock(_info->cleaner_delayed_iput_mutex);
+
trans = btrfs_join_transaction(fs_info->extent_root);
if (IS_ERR(trans))
return PTR_ERR(trans);
-- 
2.14.3



[PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-11-21 Thread Josef Bacik
The cleaner thread usually takes care of delayed iputs, with the
exception of the btrfs_end_transaction_throttle path.  The cleaner
thread only gets woken up every 30 seconds, so instead wake it up to do
it's work so that we can free up that space as quickly as possible.

Reviewed-by: Filipe Manana 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3da9ac463344..3c42d8887183 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3264,6 +3264,7 @@ void btrfs_add_delayed_iput(struct inode *inode)
ASSERT(list_empty(>delayed_iput));
list_add_tail(>delayed_iput, _info->delayed_iputs);
spin_unlock(_info->delayed_iput_lock);
+   wake_up_process(fs_info->cleaner_kthread);
 }
 
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
-- 
2.14.3



[PATCH] btrfs: only run delayed refs if we're committing

2018-11-21 Thread Josef Bacik
I noticed in a giant dbench run that we spent a lot of time on lock
contention while running transaction commit.  This is because dbench
results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
they all run the delayed refs first thing, so they all contend with
each other.  This leads to seconds of 0 throughput.  Change this to only
run the delayed refs if we're the ones committing the transaction.  This
makes the latency go away and we get no more lock contention.

Reviewed-by: Omar Sandoval 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3c1be9db897c..41cc96cc59a3 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
btrfs_trans_release_metadata(trans);
trans->block_rsv = NULL;
 
-   /* make a pass through all the delayed refs we have so far
-* any runnings procs may add more while we are here
-*/
-   ret = btrfs_run_delayed_refs(trans, 0);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   return ret;
-   }
-
cur_trans = trans->transaction;
 
/*
@@ -1938,12 +1929,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
 
btrfs_create_pending_block_groups(trans);
 
-   ret = btrfs_run_delayed_refs(trans, 0);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   return ret;
-   }
-
if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
int run_it = 0;
 
@@ -2014,6 +1999,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
spin_unlock(_info->trans_lock);
}
 
+   /*
+* We are now the only one in the commit area, we can run delayed refs
+* without hitting a bunch of lock contention from a lot of people
+* trying to commit the transaction at once.
+*/
+   ret = btrfs_run_delayed_refs(trans, 0);
+   if (ret)
+   goto cleanup_transaction;
+
extwriter_counter_dec(cur_trans, trans->type);
 
ret = btrfs_start_delalloc_flush(fs_info);
-- 
2.14.3



[PATCH 0/3] Delayed iput fixes

2018-11-21 Thread Josef Bacik
Here are some delayed iput fixes.  Delayed iputs can hold reservations for a
while and there's no real good way to make sure they were gone for good, which
means we could early enospc when in reality if we had just waited for the iput
we would have had plenty of space.  So fix this up by making us wait for delayed
iputs when deciding if we need to commit for enospc flushing, and then cleanup
and rework how we run delayed iputs to make it more straightforward to wait on
them and make sure we're all done using them.  Thanks,

Josef


[PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

2018-11-21 Thread Josef Bacik
The throttle path doesn't take cleaner_delayed_iput_mutex, which means
we could think we're done flushing iputs in the data space reservation
path when we could have a throttler doing an iput.  There's no real
reason to serialize the delayed iput flushing, so instead of taking the
cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
replace it with an atomic counter and a waitqueue.  This removes the
short (or long depending on how big the inode is) window where we think
there are no more pending iputs when there really are some.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  4 +++-
 fs/btrfs/disk-io.c |  5 ++---
 fs/btrfs/extent-tree.c |  9 +
 fs/btrfs/inode.c   | 21 +
 4 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 709de7471d86..a835fe7076eb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -912,7 +912,8 @@ struct btrfs_fs_info {
 
spinlock_t delayed_iput_lock;
struct list_head delayed_iputs;
-   struct mutex cleaner_delayed_iput_mutex;
+   atomic_t nr_delayed_iputs;
+   wait_queue_head_t delayed_iputs_wait;
 
/* this protects tree_mod_seq_list */
spinlock_t tree_mod_seq_lock;
@@ -3237,6 +3238,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
 void btrfs_add_delayed_iput(struct inode *inode);
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
+int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
 int btrfs_prealloc_file_range(struct inode *inode, int mode,
  u64 start, u64 num_bytes, u64 min_size,
  loff_t actual_len, u64 *alloc_hint);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c5918ff8241b..3f81dfaefa32 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg)
goto sleep;
}
 
-   mutex_lock(_info->cleaner_delayed_iput_mutex);
btrfs_run_delayed_iputs(fs_info);
-   mutex_unlock(_info->cleaner_delayed_iput_mutex);
 
again = btrfs_clean_one_deleted_snapshot(root);
mutex_unlock(_info->cleaner_mutex);
@@ -2651,7 +2649,6 @@ int open_ctree(struct super_block *sb,
mutex_init(_info->delete_unused_bgs_mutex);
mutex_init(_info->reloc_mutex);
mutex_init(_info->delalloc_root_mutex);
-   mutex_init(_info->cleaner_delayed_iput_mutex);
seqlock_init(_info->profiles_lock);
 
INIT_LIST_HEAD(_info->dirty_cowonly_roots);
@@ -2673,6 +2670,7 @@ int open_ctree(struct super_block *sb,
atomic_set(_info->defrag_running, 0);
atomic_set(_info->qgroup_op_seq, 0);
atomic_set(_info->reada_works_cnt, 0);
+   atomic_set(_info->nr_delayed_iputs, 0);
atomic64_set(_info->tree_mod_seq, 0);
fs_info->sb = sb;
fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
@@ -2750,6 +2748,7 @@ int open_ctree(struct super_block *sb,
init_waitqueue_head(_info->transaction_wait);
init_waitqueue_head(_info->transaction_blocked_wait);
init_waitqueue_head(_info->async_submit_wait);
+   init_waitqueue_head(_info->delayed_iputs_wait);
 
INIT_LIST_HEAD(_info->pinned_chunks);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3a90dc1d6b31..36f43876be22 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4272,8 +4272,9 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
*inode, u64 bytes)
 * operations. Wait for it to finish so that
 * more space is released.
 */
-   
mutex_lock(_info->cleaner_delayed_iput_mutex);
-   
mutex_unlock(_info->cleaner_delayed_iput_mutex);
+   ret = btrfs_wait_on_delayed_iputs(fs_info);
+   if (ret)
+   return ret;
goto again;
} else {
btrfs_end_transaction(trans);
@@ -4838,9 +4839,9 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 * pinned space, so make sure we run the iputs before we do our pinned
 * bytes check below.
 */
-   mutex_lock(_info->cleaner_delayed_iput_mutex);
btrfs_run_delayed_iputs(fs_info);
-   mutex_unlock(_info->cleaner_delayed_iput_mutex);
+   wait_event(fs_info->delayed_iputs_wait,
+  atomic_read(_info->nr_delayed_iputs) == 0);
 
trans = btrfs_join_transaction(fs_info->extent_root);
if (IS_ERR(trans))
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3c42d8887183..57bf514a90eb 100644
--- a/fs/btrfs/inode.c
+++ 

[PATCH 4/7] btrfs: call btrfs_create_pending_block_groups unconditionally

2018-11-21 Thread Josef Bacik
The first thing we do is loop through the list, this

if (!list_empty())
btrfs_create_pending_block_groups();

thing is just wasted space.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 3 +--
 fs/btrfs/transaction.c | 6 ++
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a0f8880ee5dd..b9b829c8825c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3004,8 +3004,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
*trans,
}
 
if (run_all) {
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
+   btrfs_create_pending_block_groups(trans);
 
spin_lock(_refs->lock);
node = rb_first_cached(_refs->href_root);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index a4682a696fb6..826a15a07fce 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -845,8 +845,7 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
btrfs_trans_release_metadata(trans);
trans->block_rsv = NULL;
 
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
+   btrfs_create_pending_block_groups(trans);
 
btrfs_trans_release_chunk_metadata(trans);
 
@@ -1937,8 +1936,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
cur_trans->delayed_refs.flushing = 1;
smp_wmb();
 
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
+   btrfs_create_pending_block_groups(trans);
 
ret = btrfs_run_delayed_refs(trans, 0);
if (ret) {
-- 
2.14.3



[PATCH 6/7] btrfs: cleanup pending bgs on transaction abort

2018-11-21 Thread Josef Bacik
We may abort the transaction during a commit and not have a chance to
run the pending bgs stuff, which will leave block groups on our list and
cause us accounting issues and leaked memory.  Fix this by running the
pending bgs when we cleanup a transaction.

Reviewed-by: Omar Sandoval 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 826a15a07fce..3c1be9db897c 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2269,6 +2269,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
btrfs_scrub_continue(fs_info);
 cleanup_transaction:
btrfs_trans_release_metadata(trans);
+   /* This cleans up the pending block groups list properly. */
+   if (!trans->aborted)
+   trans->aborted = ret;
+   btrfs_create_pending_block_groups(trans);
btrfs_trans_release_chunk_metadata(trans);
trans->block_rsv = NULL;
btrfs_warn(fs_info, "Skipping commit of aborted transaction.");
-- 
2.14.3



[PATCH 5/7] btrfs: just delete pending bgs if we are aborted

2018-11-21 Thread Josef Bacik
We still need to do all of the accounting cleanup for pending block
groups if we abort.  So set the ret to trans->aborted so if we aborted
the cleanup happens and everybody is happy.

Reviewed-by: Omar Sandoval 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b9b829c8825c..90423b6749b7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10500,11 +10500,17 @@ void btrfs_create_pending_block_groups(struct 
btrfs_trans_handle *trans)
struct btrfs_root *extent_root = fs_info->extent_root;
struct btrfs_block_group_item item;
struct btrfs_key key;
-   int ret = 0;
+   int ret;
 
if (!trans->can_flush_pending_bgs)
return;
 
+   /*
+* If we aborted the transaction with pending bg's we need to just
+* cleanup the list and carry on.
+*/
+   ret = trans->aborted;
+
while (!list_empty(>new_bgs)) {
block_group = list_first_entry(>new_bgs,
   struct btrfs_block_group_cache,
-- 
2.14.3



[PATCH 7/7] btrfs: wait on ordered extents on abort cleanup

2018-11-21 Thread Josef Bacik
If we flip read-only before we initiate writeback on all dirty pages for
ordered extents we've created then we'll have ordered extents left over
on umount, which results in all sorts of bad things happening.  Fix this
by making sure we wait on ordered extents if we have to do the aborted
transaction cleanup stuff.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8e7926c91e35..c5918ff8241b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4171,6 +4171,14 @@ static void btrfs_destroy_all_ordered_extents(struct 
btrfs_fs_info *fs_info)
spin_lock(_info->ordered_root_lock);
}
spin_unlock(_info->ordered_root_lock);
+
+   /*
+* We need this here because if we've been flipped read-only we won't
+* get sync() from the umount, so we need to make sure any ordered
+* extents that haven't had their dirty pages IO start writeout yet
+* actually get run and error out properly.
+*/
+   btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 }
 
 static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
-- 
2.14.3



[PATCH 0/7] Abort cleanup fixes

2018-11-21 Thread Josef Bacik
A new xfstests that really hammers on transaction aborts (generic/495 I think?)
uncovered a lot of random issues.  Some of these were introduced with the new
delayed refs rsv patches, others were just exposed by them, such as the pending
bg stuff.  With these patches in place I stopped getting all the random
leftovers and WARN_ON()'s when running whichever xfstest that was and things are
much smoother now.  Thanks,

Josef


[PATCH 3/7] btrfs: handle delayed ref head accounting cleanup in abort

2018-11-21 Thread Josef Bacik
We weren't doing any of the accounting cleanup when we aborted
transactions.  Fix this by making cleanup_ref_head_accounting global and
calling it from the abort code, this fixes the issue where our
accounting was all wrong after the fs aborts.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  5 +
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/extent-tree.c | 13 ++---
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8ccc5019172b..709de7471d86 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -35,6 +35,7 @@
 struct btrfs_trans_handle;
 struct btrfs_transaction;
 struct btrfs_pending_snapshot;
+struct btrfs_delayed_ref_root;
 extern struct kmem_cache *btrfs_trans_handle_cachep;
 extern struct kmem_cache *btrfs_bit_radix_cachep;
 extern struct kmem_cache *btrfs_path_cachep;
@@ -2643,6 +2644,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
*trans,
   unsigned long count);
 int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info,
 unsigned long count, u64 transid, int wait);
+void
+btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
+ struct btrfs_delayed_ref_root *delayed_refs,
+ struct btrfs_delayed_ref_head *head);
 int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 
len);
 int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 bytenr,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7d02748cf3f6..8e7926c91e35 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4223,6 +4223,7 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
if (pin_bytes)
btrfs_pin_extent(fs_info, head->bytenr,
 head->num_bytes, 1);
+   btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
btrfs_put_delayed_ref_head(head);
cond_resched();
spin_lock(_refs->lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e31980d451c2..a0f8880ee5dd 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2457,12 +2457,11 @@ static int run_and_cleanup_extent_op(struct 
btrfs_trans_handle *trans,
return ret ? ret : 1;
 }
 
-static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
-   struct btrfs_delayed_ref_head *head)
+void
+btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
+ struct btrfs_delayed_ref_root *delayed_refs,
+ struct btrfs_delayed_ref_head *head)
 {
-   struct btrfs_fs_info *fs_info = trans->fs_info;
-   struct btrfs_delayed_ref_root *delayed_refs =
-   >transaction->delayed_refs;
int nr_items = 1;
 
if (head->total_ref_mod < 0) {
@@ -2540,7 +2539,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
}
}
 
-   cleanup_ref_head_accounting(trans, head);
+   btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
 
trace_run_delayed_ref_head(fs_info, head, 0);
btrfs_delayed_ref_unlock(head);
@@ -7218,7 +7217,7 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (head->must_insert_reserved)
ret = 1;
 
-   cleanup_ref_head_accounting(trans, head);
+   btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head);
mutex_unlock(>mutex);
btrfs_put_delayed_ref_head(head);
return ret;
-- 
2.14.3



[PATCH 2/7] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head

2018-11-21 Thread Josef Bacik
Instead of open coding this stuff use the helper instead.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f062fb0487cd..7d02748cf3f6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4215,12 +4215,7 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
if (head->must_insert_reserved)
pin_bytes = true;
btrfs_free_delayed_extent_op(head->extent_op);
-   delayed_refs->num_heads--;
-   if (head->processing == 0)
-   delayed_refs->num_heads_ready--;
-   atomic_dec(_refs->num_entries);
-   rb_erase_cached(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
+   btrfs_delete_ref_head(delayed_refs, head);
spin_unlock(>lock);
spin_unlock(_refs->lock);
mutex_unlock(>mutex);
-- 
2.14.3



[PATCH 1/7] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock

2018-11-21 Thread Josef Bacik
We have this open coded in btrfs_destroy_delayed_refs, use the helper
instead.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index beaa58e742b5..f062fb0487cd 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4197,16 +4197,9 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
 
head = rb_entry(node, struct btrfs_delayed_ref_head,
href_node);
-   if (!mutex_trylock(>mutex)) {
-   refcount_inc(>refs);
-   spin_unlock(_refs->lock);
-
-   mutex_lock(>mutex);
-   mutex_unlock(>mutex);
-   btrfs_put_delayed_ref_head(head);
-   spin_lock(_refs->lock);
+   if (btrfs_delayed_ref_lock(delayed_refs, head))
continue;
-   }
+
spin_lock(>lock);
while ((n = rb_first_cached(>ref_tree)) != NULL) {
ref = rb_entry(n, struct btrfs_delayed_ref_node,
-- 
2.14.3



[PATCH 3/8] btrfs: don't use global rsv for chunk allocation

2018-11-21 Thread Josef Bacik
We've done this forever because of the voodoo around knowing how much
space we have.  However we have better ways of doing this now, and on
normal file systems we'll easily have a global reserve of 512MiB, and
since metadata chunks are usually 1GiB that means we'll allocate
metadata chunks more readily.  Instead use the actual used amount when
determining if we need to allocate a chunk or not.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7a30fbc05e5e..a91b3183dcae 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4388,21 +4388,12 @@ static inline u64 calc_global_rsv_need_space(struct 
btrfs_block_rsv *global)
 static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
  struct btrfs_space_info *sinfo, int force)
 {
-   struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
u64 bytes_used = btrfs_space_info_used(sinfo, false);
u64 thresh;
 
if (force == CHUNK_ALLOC_FORCE)
return 1;
 
-   /*
-* We need to take into account the global rsv because for all intents
-* and purposes it's used space.  Don't worry about locking the
-* global_rsv, it doesn't change except when the transaction commits.
-*/
-   if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA)
-   bytes_used += calc_global_rsv_need_space(global_rsv);
-
/*
 * in limited mode, we want to have some free space up to
 * about 1% of the FS size.
-- 
2.14.3



[PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code

2018-11-21 Thread Josef Bacik
With my change to no longer take into account the global reserve for
metadata allocation chunks we have this side-effect for mixed block
group fs'es where we are no longer allocating enough chunks for the
data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
step to the flushing state machine.  This will only get used if we've
already made a full loop through the flushing machinery and tried
committing the transaction.  If we have then we can try and force a
chunk allocation since we likely need it to make progress.  This
resolves the issues I was seeing with the mixed bg tests in xfstests
with my previous patch.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  3 ++-
 fs/btrfs/extent-tree.c   | 18 +-
 include/trace/events/btrfs.h |  1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0c6d589c8ce4..8ccc5019172b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2750,7 +2750,8 @@ enum btrfs_flush_state {
FLUSH_DELALLOC  =   5,
FLUSH_DELALLOC_WAIT =   6,
ALLOC_CHUNK =   7,
-   COMMIT_TRANS=   8,
+   ALLOC_CHUNK_FORCE   =   8,
+   COMMIT_TRANS=   9,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a91b3183dcae..e6bb6ce23c84 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4927,6 +4927,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
btrfs_end_transaction(trans);
break;
case ALLOC_CHUNK:
+   case ALLOC_CHUNK_FORCE:
trans = btrfs_join_transaction(root);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
@@ -4934,7 +4935,9 @@ static void flush_space(struct btrfs_fs_info *fs_info,
}
ret = do_chunk_alloc(trans,
 btrfs_metadata_alloc_profile(fs_info),
-CHUNK_ALLOC_NO_FORCE);
+(state == ALLOC_CHUNK) ?
+CHUNK_ALLOC_NO_FORCE :
+CHUNK_ALLOC_FORCE);
btrfs_end_transaction(trans);
if (ret > 0 || ret == -ENOSPC)
ret = 0;
@@ -5070,6 +5073,19 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
commit_cycles--;
}
 
+   /*
+* We don't want to force a chunk allocation until we've tried
+* pretty hard to reclaim space.  Think of the case where we
+* free'd up a bunch of space and so have a lot of pinned space
+* to reclaim.  We would rather use that than possibly create a
+* underutilized metadata chunk.  So if this is our first run
+* through the flushing state machine skip ALLOC_CHUNK_FORCE and
+* commit the transaction.  If nothing has changed the next go
+* around then we can force a chunk allocation.
+*/
+   if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
+   flush_state++;
+
if (flush_state > COMMIT_TRANS) {
commit_cycles++;
if (commit_cycles > 2) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 63d1f9d8b8c7..dd0e6f8d6b6e 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush,
{ FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"},   
\
{ FLUSH_DELAYED_REFS,   "FLUSH_ELAYED_REFS"},   
\
{ ALLOC_CHUNK,  "ALLOC_CHUNK"}, 
\
+   { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"},   
\
{ COMMIT_TRANS, "COMMIT_TRANS"})
 
 TRACE_EVENT(btrfs_flush_space,
-- 
2.14.3



[PATCH 5/8] btrfs: don't enospc all tickets on flush failure

2018-11-21 Thread Josef Bacik
With the introduction of the per-inode block_rsv it became possible to
have really really large reservation requests made because of data
fragmentation.  Since the ticket stuff assumed that we'd always have
relatively small reservation requests it just killed all tickets if we
were unable to satisfy the current request.  However this is generally
not the case anymore.  So fix this logic to instead see if we had a
ticket that we were able to give some reservation to, and if we were
continue the flushing loop again.  Likewise we make the tickets use the
space_info_add_old_bytes() method of returning what reservation they did
receive in hopes that it could satisfy reservations down the line.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 45 +
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e6bb6ce23c84..983d086fa768 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4791,6 +4791,7 @@ static void shrink_delalloc(struct btrfs_fs_info 
*fs_info, u64 to_reclaim,
 }
 
 struct reserve_ticket {
+   u64 orig_bytes;
u64 bytes;
int error;
struct list_head list;
@@ -5012,7 +5013,7 @@ static inline int need_do_async_reclaim(struct 
btrfs_fs_info *fs_info,
!test_bit(BTRFS_FS_STATE_REMOUNTING, _info->fs_state));
 }
 
-static void wake_all_tickets(struct list_head *head)
+static bool wake_all_tickets(struct list_head *head)
 {
struct reserve_ticket *ticket;
 
@@ -5021,7 +5022,10 @@ static void wake_all_tickets(struct list_head *head)
list_del_init(>list);
ticket->error = -ENOSPC;
wake_up(>wait);
+   if (ticket->bytes != ticket->orig_bytes)
+   return true;
}
+   return false;
 }
 
 /*
@@ -5089,8 +5093,12 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
if (flush_state > COMMIT_TRANS) {
commit_cycles++;
if (commit_cycles > 2) {
-   wake_all_tickets(_info->tickets);
-   space_info->flush = 0;
+   if (wake_all_tickets(_info->tickets)) {
+   flush_state = FLUSH_DELAYED_ITEMS_NR;
+   commit_cycles--;
+   } else {
+   space_info->flush = 0;
+   }
} else {
flush_state = FLUSH_DELAYED_ITEMS_NR;
}
@@ -5142,10 +5150,11 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
   struct btrfs_space_info *space_info,
-  struct reserve_ticket *ticket, u64 orig_bytes)
+  struct reserve_ticket *ticket)
 
 {
DEFINE_WAIT(wait);
+   u64 reclaim_bytes = 0;
int ret = 0;
 
spin_lock(_info->lock);
@@ -5166,14 +5175,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info 
*fs_info,
ret = ticket->error;
if (!list_empty(>list))
list_del_init(>list);
-   if (ticket->bytes && ticket->bytes < orig_bytes) {
-   u64 num_bytes = orig_bytes - ticket->bytes;
-   update_bytes_may_use(space_info, -num_bytes);
-   trace_btrfs_space_reservation(fs_info, "space_info",
- space_info->flags, num_bytes, 0);
-   }
+   if (ticket->bytes && ticket->bytes < ticket->orig_bytes)
+   reclaim_bytes = ticket->orig_bytes - ticket->bytes;
spin_unlock(_info->lock);
 
+   if (reclaim_bytes)
+   space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
return ret;
 }
 
@@ -5199,6 +5206,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info 
*fs_info,
 {
struct reserve_ticket ticket;
u64 used;
+   u64 reclaim_bytes = 0;
int ret = 0;
 
ASSERT(orig_bytes);
@@ -5234,6 +5242,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info 
*fs_info,
 * the list and we will do our own flushing further down.
 */
if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
+   ticket.orig_bytes = orig_bytes;
ticket.bytes = orig_bytes;
ticket.error = 0;
init_waitqueue_head();
@@ -5274,25 +5283,21 @@ static int __reserve_metadata_bytes(struct 
btrfs_fs_info *fs_info,
return ret;
 
if (flush == BTRFS_RESERVE_FLUSH_ALL)
-   return wait_reserve_ticket(fs_info, space_info, ,
-  orig_bytes);
+   return wait_reserve_ticket(fs_info, 

[PATCH 7/8] btrfs: be more explicit about allowed flush states

2018-11-21 Thread Josef Bacik
For FLUSH_LIMIT flushers we really can only allocate chunks and flush
delayed inode items, everything else is problematic.  I added a bunch of
new states and it lead to weirdness in the FLUSH_LIMIT case because I
forgot about how it worked.  So instead explicitly declare the states
that are ok for flushing with FLUSH_LIMIT and use that for our state
machine.  Then as we add new things that are safe we can just add them
to this list.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0e9ba77e5316..e31980d451c2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5112,12 +5112,18 @@ void btrfs_init_async_reclaim_work(struct work_struct 
*work)
INIT_WORK(work, btrfs_async_reclaim_metadata_space);
 }
 
+static const enum btrfs_flush_state priority_flush_states[] = {
+   FLUSH_DELAYED_ITEMS_NR,
+   FLUSH_DELAYED_ITEMS,
+   ALLOC_CHUNK,
+};
+
 static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *space_info,
struct reserve_ticket *ticket)
 {
u64 to_reclaim;
-   int flush_state = FLUSH_DELAYED_ITEMS_NR;
+   int flush_state = 0;
 
spin_lock(_info->lock);
to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info,
@@ -5129,7 +5135,8 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
spin_unlock(_info->lock);
 
do {
-   flush_space(fs_info, space_info, to_reclaim, flush_state);
+   flush_space(fs_info, space_info, to_reclaim,
+   priority_flush_states[flush_state]);
flush_state++;
spin_lock(_info->lock);
if (ticket->bytes == 0) {
@@ -5137,15 +5144,7 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
return;
}
spin_unlock(_info->lock);
-
-   /*
-* Priority flushers can't wait on delalloc without
-* deadlocking.
-*/
-   if (flush_state == FLUSH_DELALLOC ||
-   flush_state == FLUSH_DELALLOC_WAIT)
-   flush_state = ALLOC_CHUNK;
-   } while (flush_state < COMMIT_TRANS);
+   } while (flush_state < ARRAY_SIZE(priority_flush_states));
 }
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
-- 
2.14.3



[PATCH 0/8] Enospc cleanups and fixes

2018-11-21 Thread Josef Bacik
The delayed refs rsv patches exposed a bunch of issues in our enospc
infrastructure that needed to be addressed.  These aren't really one coherent
group, but they are all around flushing and reservations.
may_commit_transaction() needed to be updated a little bit, and we needed to add
a new state to force chunk allocation if things got dicey.  Also because we can
end up needed to reserve a whole bunch of extra space for outstanding delayed
refs we needed to add the ability to only ENOSPC tickets that were too big to
satisfy, instead of failing all of the tickets.  There's also a fix in here for
one of the corner cases where we didn't quite have enough space reserved for the
delayed refs we were generating during evict().  Thanks,

Josef


[PATCH 6/8] btrfs: loop in inode_rsv_refill

2018-11-21 Thread Josef Bacik
With severe fragmentation we can end up with our inode rsv size being
huge during writeout, which would cause us to need to make very large
metadata reservations.  However we may not actually need that much once
writeout is complete.  So instead try to make our reservation, and if we
couldn't make it re-calculate our new reservation size and try again.
If our reservation size doesn't change between tries then we know we are
actually out of space and can error out.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 56 --
 1 file changed, 41 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 983d086fa768..0e9ba77e5316 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5776,6 +5776,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
return ret;
 }
 
+static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv,
+ u64 *metadata_bytes, u64 *qgroup_bytes)
+{
+   *metadata_bytes = 0;
+   *qgroup_bytes = 0;
+
+   spin_lock(_rsv->lock);
+   if (block_rsv->reserved < block_rsv->size)
+   *metadata_bytes = block_rsv->size - block_rsv->reserved;
+   if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
+   *qgroup_bytes = block_rsv->qgroup_rsv_size -
+   block_rsv->qgroup_rsv_reserved;
+   spin_unlock(_rsv->lock);
+}
+
 /**
  * btrfs_inode_rsv_refill - refill the inode block rsv.
  * @inode - the inode we are refilling.
@@ -5791,25 +5806,37 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
*inode,
 {
struct btrfs_root *root = inode->root;
struct btrfs_block_rsv *block_rsv = >block_rsv;
-   u64 num_bytes = 0;
+   u64 num_bytes = 0, last = 0;
u64 qgroup_num_bytes = 0;
int ret = -ENOSPC;
 
-   spin_lock(_rsv->lock);
-   if (block_rsv->reserved < block_rsv->size)
-   num_bytes = block_rsv->size - block_rsv->reserved;
-   if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
-   qgroup_num_bytes = block_rsv->qgroup_rsv_size -
-  block_rsv->qgroup_rsv_reserved;
-   spin_unlock(_rsv->lock);
-
+   __get_refill_bytes(block_rsv, _bytes, _num_bytes);
if (num_bytes == 0)
return 0;
 
-   ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
-   if (ret)
-   return ret;
-   ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+   do {
+   ret = btrfs_qgroup_reserve_meta_prealloc(root, 
qgroup_num_bytes, true);
+   if (ret)
+   return ret;
+   ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+   if (ret) {
+   btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+   last = num_bytes;
+   /*
+* If we are fragmented we can end up with a lot of
+* outstanding extents which will make our size be much
+* larger than our reserved amount.  If we happen to
+* try to do a reservation here that may result in us
+* trying to do a pretty hefty reservation, which we may
+* not need once delalloc flushing happens.  If this is
+* the case try and do the reserve again.
+*/
+   if (flush == BTRFS_RESERVE_FLUSH_ALL)
+   __get_refill_bytes(block_rsv, _bytes,
+  _num_bytes);
+   }
+   } while (ret && last != num_bytes);
+
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, false);
trace_btrfs_space_reservation(root->fs_info, "delalloc",
@@ -5819,8 +5846,7 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
*inode,
spin_lock(_rsv->lock);
block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
spin_unlock(_rsv->lock);
-   } else
-   btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+   }
return ret;
 }
 
-- 
2.14.3



[PATCH 2/8] btrfs: dump block_rsv whe dumping space info

2018-11-21 Thread Josef Bacik
For enospc_debug having the block rsvs is super helpful to see if we've
done something wrong.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
Reviewed-by: David Sterba 
---
 fs/btrfs/extent-tree.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0dca250dc02e..7a30fbc05e5e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8052,6 +8052,15 @@ static noinline int find_free_extent(struct 
btrfs_fs_info *fs_info,
return ret;
 }
 
+#define DUMP_BLOCK_RSV(fs_info, rsv_name)  \
+do {   \
+   struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name;   \
+   spin_lock(&__rsv->lock);\
+   btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu",  \
+  __rsv->size, __rsv->reserved);   \
+   spin_unlock(&__rsv->lock);  \
+} while (0)
+
 static void dump_space_info(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *info, u64 bytes,
int dump_block_groups)
@@ -8071,6 +8080,12 @@ static void dump_space_info(struct btrfs_fs_info 
*fs_info,
info->bytes_readonly);
spin_unlock(>lock);
 
+   DUMP_BLOCK_RSV(fs_info, global_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, chunk_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
+
if (!dump_block_groups)
return;
 
-- 
2.14.3



[PATCH 8/8] btrfs: reserve extra space during evict()

2018-11-21 Thread Josef Bacik
We could generate a lot of delayed refs in evict but never have any left
over space from our block rsv to make up for that fact.  So reserve some
extra space and give it to the transaction so it can be used to refill
the delayed refs rsv every loop through the truncate path.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index cae30f6c095f..3da9ac463344 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5258,13 +5258,15 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
 {
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
+   u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1);
int failures = 0;
 
for (;;) {
struct btrfs_trans_handle *trans;
int ret;
 
-   ret = btrfs_block_rsv_refill(root, rsv, rsv->size,
+   ret = btrfs_block_rsv_refill(root, rsv,
+rsv->size + delayed_refs_extra,
 BTRFS_RESERVE_FLUSH_LIMIT);
 
if (ret && ++failures > 2) {
@@ -5273,9 +5275,28 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
return ERR_PTR(-ENOSPC);
}
 
+   /*
+* Evict can generate a large amount of delayed refs without
+* having a way to add space back since we exhaust our temporary
+* block rsv.  We aren't allowed to do FLUSH_ALL in this case
+* because we could deadlock with so many things in the flushing
+* code, so we have to try and hold some extra space to
+* compensate for our delayed ref generation.  If we can't get
+* that space then we need see if we can steal our minimum from
+* the global reserve.  We will be ratelimited by the amount of
+* space we have for the delayed refs rsv, so we'll end up
+* committing and trying again.
+*/
trans = btrfs_join_transaction(root);
-   if (IS_ERR(trans) || !ret)
+   if (IS_ERR(trans) || !ret) {
+   if (!IS_ERR(trans)) {
+   trans->block_rsv = _info->trans_block_rsv;
+   trans->bytes_reserved = delayed_refs_extra;
+   btrfs_block_rsv_migrate(rsv, trans->block_rsv,
+   delayed_refs_extra, 1);
+   }
return trans;
+   }
 
/*
 * Try to steal from the global reserve if there is space for
-- 
2.14.3



[PATCH 1/8] btrfs: check if free bgs for commit

2018-11-21 Thread Josef Bacik
may_commit_transaction will skip committing the transaction if we don't
have enough pinned space or if we're trying to find space for a SYSTEM
chunk.  However if we have pending free block groups in this transaction
we still want to commit as we may be able to allocate a chunk to make
our reservation.  So instead of just returning ENOSPC, check if we have
free block groups pending, and if so commit the transaction to allow us
to use that free space.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/extent-tree.c | 33 +++--
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8af68b13fa27..0dca250dc02e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4843,10 +4843,18 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes)
return 0;
 
-   /* See if there is enough pinned space to make this reservation */
-   if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
+   trans = btrfs_join_transaction(fs_info->extent_root);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
+   /*
+* See if there is enough pinned space to make this reservation, or if
+* we have bg's that are going to be freed, allowing us to possibly do a
+* chunk allocation the next loop through.
+*/
+   if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags) ||
+   __percpu_counter_compare(_info->total_bytes_pinned, bytes,
+BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
goto commit;
 
/*
@@ -4854,7 +4862,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 * this reservation.
 */
if (space_info != delayed_rsv->space_info)
-   return -ENOSPC;
+   goto enospc;
 
spin_lock(_rsv->lock);
reclaim_bytes += delayed_rsv->reserved;
@@ -4868,17 +4876,14 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
bytes -= reclaim_bytes;
 
if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
-   return -ENOSPC;
-   }
-
+bytes,
+BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
+   goto enospc;
 commit:
-   trans = btrfs_join_transaction(fs_info->extent_root);
-   if (IS_ERR(trans))
-   return -ENOSPC;
-
return btrfs_commit_transaction(trans);
+enospc:
+   btrfs_end_transaction(trans);
+   return -ENOSPC;
 }
 
 /*
-- 
2.14.3



[PATCH 6/6] btrfs: fix truncate throttling

2018-11-21 Thread Josef Bacik
We have a bunch of magic to make sure we're throttling delayed refs when
truncating a file.  Now that we have a delayed refs rsv and a mechanism
for refilling that reserve simply use that instead of all of this magic.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 79 
 1 file changed, 17 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8532a2eb56d1..cae30f6c095f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4437,31 +4437,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry 
*dentry)
return err;
 }
 
-static int truncate_space_check(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root,
-   u64 bytes_deleted)
-{
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   int ret;
-
-   /*
-* This is only used to apply pressure to the enospc system, we don't
-* intend to use this reservation at all.
-*/
-   bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted);
-   bytes_deleted *= fs_info->nodesize;
-   ret = btrfs_block_rsv_add(root, _info->trans_block_rsv,
- bytes_deleted, BTRFS_RESERVE_NO_FLUSH);
-   if (!ret) {
-   trace_btrfs_space_reservation(fs_info, "transaction",
- trans->transid,
- bytes_deleted, 1);
-   trans->bytes_reserved += bytes_deleted;
-   }
-   return ret;
-
-}
-
 /*
  * Return this if we need to call truncate_block for the last bit of the
  * truncate.
@@ -4506,7 +4481,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
u64 bytes_deleted = 0;
bool be_nice = false;
bool should_throttle = false;
-   bool should_end = false;
 
BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
 
@@ -4719,15 +4693,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
btrfs_abort_transaction(trans, ret);
break;
}
-   if (btrfs_should_throttle_delayed_refs(trans))
-   btrfs_async_run_delayed_refs(fs_info,
-   trans->delayed_ref_updates * 2,
-   trans->transid, 0);
if (be_nice) {
-   if (truncate_space_check(trans, root,
-extent_num_bytes)) {
-   should_end = true;
-   }
if (btrfs_should_throttle_delayed_refs(trans))
should_throttle = true;
}
@@ -4738,7 +4704,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
 
if (path->slots[0] == 0 ||
path->slots[0] != pending_del_slot ||
-   should_throttle || should_end) {
+   should_throttle) {
if (pending_del_nr) {
ret = btrfs_del_items(trans, root, path,
pending_del_slot,
@@ -4750,23 +4716,24 @@ int btrfs_truncate_inode_items(struct 
btrfs_trans_handle *trans,
pending_del_nr = 0;
}
btrfs_release_path(path);
-   if (should_throttle) {
-   unsigned long updates = 
trans->delayed_ref_updates;
-   if (updates) {
-   trans->delayed_ref_updates = 0;
-   ret = btrfs_run_delayed_refs(trans,
-  updates * 2);
-   if (ret)
-   break;
-   }
-   }
+
/*
-* if we failed to refill our space rsv, bail out
-* and let the transaction restart
+* We can generate a lot of delayed refs, so we need to
+* throttle every once and a while and make sure we're
+* adding enough space to keep up with the work we are
+* generating.  Since we hold a transaction here we
+* can't flush, and we don't want to FLUSH_LIMIT because
+* we could have generated too many delayed refs to
+* actually allocate, so just bail if we're short and
+* let the normal reservation dance happen higher up.
 

[PATCH 5/6] btrfs: introduce delayed_refs_rsv

2018-11-21 Thread Josef Bacik
From: Josef Bacik 

Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv.  This works most
of the time, except when it doesn't.  We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction.  Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.

So instead give them their own block_rsv.  This way we can always know
exactly how much outstanding space we need for delayed refs.  This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system.  Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.

For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums.  We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  26 ++--
 fs/btrfs/delayed-ref.c   |  28 -
 fs/btrfs/disk-io.c   |   4 +
 fs/btrfs/extent-tree.c   | 279 +++
 fs/btrfs/inode.c |   4 +-
 fs/btrfs/transaction.c   |  77 ++--
 include/trace/events/btrfs.h |   2 +
 7 files changed, 313 insertions(+), 107 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8b41ec42f405..0c6d589c8ce4 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -448,8 +448,9 @@ struct btrfs_space_info {
 #defineBTRFS_BLOCK_RSV_TRANS   3
 #defineBTRFS_BLOCK_RSV_CHUNK   4
 #defineBTRFS_BLOCK_RSV_DELOPS  5
-#defineBTRFS_BLOCK_RSV_EMPTY   6
-#defineBTRFS_BLOCK_RSV_TEMP7
+#define BTRFS_BLOCK_RSV_DELREFS6
+#defineBTRFS_BLOCK_RSV_EMPTY   7
+#defineBTRFS_BLOCK_RSV_TEMP8
 
 struct btrfs_block_rsv {
u64 size;
@@ -812,6 +813,8 @@ struct btrfs_fs_info {
struct btrfs_block_rsv chunk_block_rsv;
/* block reservation for delayed operations */
struct btrfs_block_rsv delayed_block_rsv;
+   /* block reservation for delayed refs */
+   struct btrfs_block_rsv delayed_refs_rsv;
 
struct btrfs_block_rsv empty_block_rsv;
 
@@ -2628,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct 
btrfs_fs_info *fs_info,
 }
 
 int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans);
-int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans);
+bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
 void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
 const u64 start);
 void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg);
@@ -2742,10 +2745,12 @@ enum btrfs_reserve_flush_enum {
 enum btrfs_flush_state {
FLUSH_DELAYED_ITEMS_NR  =   1,
FLUSH_DELAYED_ITEMS =   2,
-   FLUSH_DELALLOC  =   3,
-   FLUSH_DELALLOC_WAIT =   4,
-   ALLOC_CHUNK =   5,
-   COMMIT_TRANS=   6,
+   FLUSH_DELAYED_REFS_NR   =   3,
+   FLUSH_DELAYED_REFS  =   4,
+   FLUSH_DELALLOC  =   5,
+   FLUSH_DELALLOC_WAIT =   6,
+   ALLOC_CHUNK =   7,
+   COMMIT_TRANS=   8,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
@@ -2796,6 +2801,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info 
*fs_info,
 void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
 struct btrfs_block_rsv *block_rsv,
 u64 num_bytes);
+void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
+void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
+int btrfs_throttle_delayed_refs(struct btrfs_fs_info *fs_info,
+   enum btrfs_reserve_flush_enum flush);
+void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
+  struct btrfs_block_rsv *src,
+  u64 num_bytes);
 int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 48725fa757a3..96de92588f06 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -474,11 

[PATCH 0/6] Delayed refs rsv

2018-11-21 Thread Josef Bacik
This patchset changes how we do space reservations for delayed refs.  We were
hitting probably 20-40 enospc abort's per day in production while running
delayed refs at transaction commit time.  This means we ran out of space in the
global reserve and couldn't easily get more space in use_block_rsv().

The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if
we're running short on space and when it's time to force a commit.  A failure
rate of 20-40 file systems when we run hundreds of thousands of them isn't super
high, but cleaning up this code will make things less ugly and more predictible.

Thus the delayed refs rsv.  We always know how many delayed refs we have
outstanding, and although running them generates more we can use the global
reserve for that spill over, which fits better into it's desired use than a full
blown reservation.  This first approach is to simply take how many times we're
reserving space for and multiply that by 2 in order to save enough space for the
delayed refs that could be generated.  This is a niave approach and will
probably evolve, but for now it works.

With this patchset we've gone down to 2-8 failures per week.  It's not perfect,
there are some corner cases that still need to be addressed, but is
significantly better than what we had.  Thanks,

Josef


[PATCH 4/6] btrfs: only track ref_heads in delayed_ref_updates

2018-11-21 Thread Josef Bacik
From: Josef Bacik 

We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it.  Fix the accounting to only be
adjusted when we add/remove a ref head.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index b3e4c9fcb664..48725fa757a3 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct 
btrfs_trans_handle *trans,
ref->in_tree = 0;
btrfs_put_delayed_ref(ref);
atomic_dec(_refs->num_entries);
-   if (trans->delayed_ref_updates)
-   trans->delayed_ref_updates--;
 }
 
 static bool merge_ref(struct btrfs_trans_handle *trans,
@@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
*trans,
if (ref->action == BTRFS_ADD_DELAYED_REF)
list_add_tail(>add_list, >ref_add_list);
atomic_inc(>num_entries);
-   trans->delayed_ref_updates++;
spin_unlock(>lock);
return ret;
 }
-- 
2.14.3



[PATCH 2/6] btrfs: add cleanup_ref_head_accounting helper

2018-11-21 Thread Josef Bacik
From: Josef Bacik 

We were missing some quota cleanups in check_ref_cleanup, so break the
ref head accounting cleanup into a helper and call that from both
check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
we don't screw up accounting in the future for other things that we add.

Reviewed-by: Omar Sandoval 
Reviewed-by: Liu Bo 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 67 +-
 1 file changed, 39 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c36b3a42f2bb..e3ed3507018d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
*trans,
return ret ? ret : 1;
 }
 
+static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_ref_head *head)
+{
+   struct btrfs_fs_info *fs_info = trans->fs_info;
+   struct btrfs_delayed_ref_root *delayed_refs =
+   >transaction->delayed_refs;
+
+   if (head->total_ref_mod < 0) {
+   struct btrfs_space_info *space_info;
+   u64 flags;
+
+   if (head->is_data)
+   flags = BTRFS_BLOCK_GROUP_DATA;
+   else if (head->is_system)
+   flags = BTRFS_BLOCK_GROUP_SYSTEM;
+   else
+   flags = BTRFS_BLOCK_GROUP_METADATA;
+   space_info = __find_space_info(fs_info, flags);
+   ASSERT(space_info);
+   percpu_counter_add_batch(_info->total_bytes_pinned,
+  -head->num_bytes,
+  BTRFS_TOTAL_BYTES_PINNED_BATCH);
+
+   if (head->is_data) {
+   spin_lock(_refs->lock);
+   delayed_refs->pending_csums -= head->num_bytes;
+   spin_unlock(_refs->lock);
+   }
+   }
+
+   /* Also free its reserved qgroup space */
+   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
+ head->qgroup_reserved);
+}
+
 static int cleanup_ref_head(struct btrfs_trans_handle *trans,
struct btrfs_delayed_ref_head *head)
 {
@@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
spin_unlock(>lock);
spin_unlock(_refs->lock);
 
-   trace_run_delayed_ref_head(fs_info, head, 0);
-
-   if (head->total_ref_mod < 0) {
-   struct btrfs_space_info *space_info;
-   u64 flags;
-
-   if (head->is_data)
-   flags = BTRFS_BLOCK_GROUP_DATA;
-   else if (head->is_system)
-   flags = BTRFS_BLOCK_GROUP_SYSTEM;
-   else
-   flags = BTRFS_BLOCK_GROUP_METADATA;
-   space_info = __find_space_info(fs_info, flags);
-   ASSERT(space_info);
-   percpu_counter_add_batch(_info->total_bytes_pinned,
-  -head->num_bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH);
-
-   if (head->is_data) {
-   spin_lock(_refs->lock);
-   delayed_refs->pending_csums -= head->num_bytes;
-   spin_unlock(_refs->lock);
-   }
-   }
-
if (head->must_insert_reserved) {
btrfs_pin_extent(fs_info, head->bytenr,
 head->num_bytes, 1);
@@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
}
}
 
-   /* Also free its reserved qgroup space */
-   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
- head->qgroup_reserved);
+   cleanup_ref_head_accounting(trans, head);
+
+   trace_run_delayed_ref_head(fs_info, head, 0);
btrfs_delayed_ref_unlock(head);
btrfs_put_delayed_ref_head(head);
return 0;
@@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (head->must_insert_reserved)
ret = 1;
 
+   cleanup_ref_head_accounting(trans, head);
mutex_unlock(>mutex);
btrfs_put_delayed_ref_head(head);
return ret;
-- 
2.14.3



[PATCH 3/6] btrfs: cleanup extent_op handling

2018-11-21 Thread Josef Bacik
From: Josef Bacik 

The cleanup_extent_op function actually would run the extent_op if it
needed running, which made the name sort of a misnomer.  Change it to
run_and_cleanup_extent_op, and move the actual cleanup work to
cleanup_extent_op so it can be used by check_ref_cleanup() in order to
unify the extent op handling.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 36 +++-
 1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e3ed3507018d..8a776dc9cb38 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2424,19 +2424,33 @@ static void unselect_delayed_ref_head(struct 
btrfs_delayed_ref_root *delayed_ref
btrfs_delayed_ref_unlock(head);
 }
 
-static int cleanup_extent_op(struct btrfs_trans_handle *trans,
-struct btrfs_delayed_ref_head *head)
+static struct btrfs_delayed_extent_op *
+cleanup_extent_op(struct btrfs_trans_handle *trans,
+ struct btrfs_delayed_ref_head *head)
 {
struct btrfs_delayed_extent_op *extent_op = head->extent_op;
-   int ret;
 
if (!extent_op)
-   return 0;
-   head->extent_op = NULL;
+   return NULL;
+
if (head->must_insert_reserved) {
+   head->extent_op = NULL;
btrfs_free_delayed_extent_op(extent_op);
-   return 0;
+   return NULL;
}
+   return extent_op;
+}
+
+static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans,
+struct btrfs_delayed_ref_head *head)
+{
+   struct btrfs_delayed_extent_op *extent_op =
+   cleanup_extent_op(trans, head);
+   int ret;
+
+   if (!extent_op)
+   return 0;
+   head->extent_op = NULL;
spin_unlock(>lock);
ret = run_delayed_extent_op(trans, head, extent_op);
btrfs_free_delayed_extent_op(extent_op);
@@ -2488,7 +2502,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
 
delayed_refs = >transaction->delayed_refs;
 
-   ret = cleanup_extent_op(trans, head);
+   ret = run_and_cleanup_extent_op(trans, head);
if (ret < 0) {
unselect_delayed_ref_head(delayed_refs, head);
btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
@@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (!RB_EMPTY_ROOT(>ref_tree.rb_root))
goto out;
 
-   if (head->extent_op) {
-   if (!head->must_insert_reserved)
-   goto out;
-   btrfs_free_delayed_extent_op(head->extent_op);
-   head->extent_op = NULL;
-   }
+   if (cleanup_extent_op(trans, head) != NULL)
+   goto out;
 
/*
 * waiting for the lock here would deadlock.  If someone else has it
-- 
2.14.3



[PATCH 1/6] btrfs: add btrfs_delete_ref_head helper

2018-11-21 Thread Josef Bacik
From: Josef Bacik 

We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/delayed-ref.c | 14 ++
 fs/btrfs/delayed-ref.h |  3 ++-
 fs/btrfs/extent-tree.c | 22 +++---
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 9301b3ad9217..b3e4c9fcb664 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
return head;
 }
 
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+  struct btrfs_delayed_ref_head *head)
+{
+   lockdep_assert_held(_refs->lock);
+   lockdep_assert_held(>lock);
+
+   rb_erase_cached(>href_node, _refs->href_root);
+   RB_CLEAR_NODE(>href_node);
+   atomic_dec(_refs->num_entries);
+   delayed_refs->num_heads--;
+   if (head->processing == 0)
+   delayed_refs->num_heads_ready--;
+}
+
 /*
  * Helper to insert the ref_node to the tail or merge with tail.
  *
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 8e20c5cb5404..d2af974f68a1 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
btrfs_delayed_ref_head *head)
 {
mutex_unlock(>mutex);
 }
-
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+  struct btrfs_delayed_ref_head *head);
 
 struct btrfs_delayed_ref_head *btrfs_select_ref_head(
struct btrfs_delayed_ref_root *delayed_refs);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d242a1174e50..c36b3a42f2bb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
spin_unlock(_refs->lock);
return 1;
}
-   delayed_refs->num_heads--;
-   rb_erase_cached(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
+   btrfs_delete_ref_head(delayed_refs, head);
spin_unlock(>lock);
spin_unlock(_refs->lock);
-   atomic_dec(_refs->num_entries);
 
trace_run_delayed_ref_head(fs_info, head, 0);
 
@@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (!mutex_trylock(>mutex))
goto out;
 
-   /*
-* at this point we have a head with no other entries.  Go
-* ahead and process it.
-*/
-   rb_erase_cached(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
-   atomic_dec(_refs->num_entries);
-
-   /*
-* we don't take a ref on the node because we're removing it from the
-* tree, so we just steal the ref the tree was holding.
-*/
-   delayed_refs->num_heads--;
-   if (head->processing == 0)
-   delayed_refs->num_heads_ready--;
+   btrfs_delete_ref_head(delayed_refs, head);
head->processing = 0;
+
spin_unlock(>lock);
spin_unlock(_refs->lock);
 
-- 
2.14.3



Re: [PATCH] Btrfs: fix access to available allocation bits when starting balance

2018-11-21 Thread David Sterba
On Mon, Nov 19, 2018 at 09:48:12AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> The available allocation bits members from struct btrfs_fs_info are
> protected by a sequence lock, and when starting balance we access them
> incorrectly in two different ways:
> 
> 1) In the read sequence lock loop at btrfs_balance() we use the values we
>read from fs_info->avail_*_alloc_bits and we can immediately do actions
>that have side effects and can not be undone (printing a message and
>jumping to a label). This is wrong because a retry might be needed, so
>our actions must not have side effects and must be repeatable as long
>as read_seqretry() returns a non-zero value. In other words, we were
>essentially ignoring the sequence lock;
> 
> 2) Right below the read sequence lock loop, we were reading the values
>from avail_metadata_alloc_bits and avail_data_alloc_bits without any
>protection from concurrent writers, that is, reading them outside of
>the read sequence lock critical section.
> 
> So fix this by making sure we only read the available allocation bits
> while in a read sequence lock critical section and that what we do in the
> critical section is repeatable (has nothing that can not be undone) so
> that any eventual retry that is needed is handled properly.
> 
> Fixes: de98ced9e743 ("Btrfs: use seqlock to protect fs_info->avail_{data, 
> metadata, system}_alloc_bits")
> Fixes: 14506127979a ("btrfs: fix a bogus warning when converting only data or 
> metadata")
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.


[PATCH RESEND 1/2] btrfs-progs: fix kernel version parsing on some versions past 3.0

2018-11-21 Thread Adam Borowski
The code fails if the third section is missing (like "4.18") or is followed
by anything but "." or "-".  This happens for example if we're not exactly
at a tag and CONFIG_LOCALVERSION_AUTO=n (which results in "4.18.5+").

Signed-off-by: Adam Borowski 
---
 fsfeatures.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fsfeatures.c b/fsfeatures.c
index 7d85d60f..66111bf4 100644
--- a/fsfeatures.c
+++ b/fsfeatures.c
@@ -216,11 +216,8 @@ u32 get_running_kernel_version(void)
return (u32)-1;
version |= atoi(tmp) << 8;
tmp = strtok_r(NULL, ".", );
-   if (tmp) {
-   if (!string_is_numerical(tmp))
-   return (u32)-1;
+   if (tmp && string_is_numerical(tmp))
version |= atoi(tmp);
-   }
 
return version;
 }
-- 
2.19.1



[PATCH RESEND-v3 2/2] btrfs-progs: defrag: open files RO on new enough kernels

2018-11-21 Thread Adam Borowski
Defragging an executable conflicts both way with it being run, resulting in
ETXTBSY.  This either makes defrag fail or prevents the program from being
executed.

Kernels 4.19-rc1 and later allow defragging files you could have possibly
opened rw, even if the passed descriptor is ro (commit 616d374efa23).

Signed-off-by: Adam Borowski 
---
v2: more eloquent description; root can't defrag RO on old kernels (unlike
dedupe)
v3: more eloquentier description; s/defrag_ro/defrag_open_mode/

 cmds-filesystem.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index d1af21ee..c67bf5da 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -39,12 +40,14 @@
 #include "list_sort.h"
 #include "disk-io.h"
 #include "help.h"
+#include "fsfeatures.h"
 
 /*
  * for btrfs fi show, we maintain a hash of fsids we've already printed.
  * This way we don't print dups if a given FS is mounted more than once.
  */
 static struct seen_fsid *seen_fsid_hash[SEEN_FSID_HASH_SIZE] = {NULL,};
+static mode_t defrag_open_mode = O_RDONLY;
 
 static const char * const filesystem_cmd_group_usage[] = {
"btrfs filesystem []  []",
@@ -878,7 +881,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
-   fd = open(fpath, O_RDWR);
+   fd = open(fpath, defrag_open_mode);
if (fd < 0) {
goto error;
}
@@ -915,6 +918,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int compress_type = BTRFS_COMPRESS_NONE;
DIR *dirstream;
 
+   if (get_running_kernel_version() < KERNEL_VERSION(4,19,0))
+   defrag_open_mode = O_RDWR;
+
/*
 * Kernel has a different default (256K) that is supposed to be safe,
 * but it does not defragment very well. The 32M will likely lead to
@@ -1015,7 +1021,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int defrag_err = 0;
 
dirstream = NULL;
-   fd = open_file_or_dir(argv[i], );
+   fd = open_file_or_dir3(argv[i], , defrag_open_mode);
if (fd < 0) {
error("cannot open %s: %m", argv[i]);
ret = -errno;
-- 
2.19.1



Re: [PATCH] Btrfs: fix race between enabling quotas and subvolume creation

2018-11-21 Thread David Sterba
On Mon, Nov 19, 2018 at 04:20:34PM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> We have a race between enabling quotas end subvolume creation that cause
> subvolume creation to fail with -EINVAL, and the following diagram shows
> how it happens:
> 
>   CPU 0  CPU 1
> 
>  btrfs_ioctl()
>   btrfs_ioctl_quota_ctl()
>btrfs_quota_enable()
> mutex_lock(fs_info->qgroup_ioctl_lock)
> 
>   btrfs_ioctl()
>create_subvol()
> btrfs_qgroup_inherit()
>  -> save 
> fs_info->quota_root
> into quota_root
>  -> stores a NULL value
>  -> tries to lock the 
> mutex
> qgroup_ioctl_lock
> -> blocks waiting for
>the task at CPU0
> 
>-> sets BTRFS_FS_QUOTA_ENABLED in fs_info
>-> sets quota_root in fs_info->quota_root
>   (non-NULL value)
> 
>mutex_unlock(fs_info->qgroup_ioctl_lock)
> 
>  -> checks quota enabled
> flag is set
>  -> returns -EINVAL 
> because
> fs_info->quota_root 
> was
> NULL before it 
> acquired
> the mutex
> qgroup_ioctl_lock
>-> ioctl returns -EINVAL
> 
> Returning -EINVAL to user space will be confusing if all the arguments
> passed to the subvolume creation ioctl were valid.
> 
> Fix it by grabbing the value from fs_info->quota_root after acquiring
> the mutex.
> 
> Signed-off-by: Filipe Manana 

Reviewed-by: David Sterba 

Added to misc-next, thanks.


[PATCHv3] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-11-21 Thread Nikolay Borisov
Running btrfs/124 in a loop hung up on me sporadically with the
following call trace:
btrfs   D0  5760   5324 0x
Call Trace:
 ? __schedule+0x243/0x800
 schedule+0x33/0x90
 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
 ? wait_woken+0xa0/0xa0
 btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
 btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
 btrfs_relocate_chunk+0x49/0x100 [btrfs]
 btrfs_balance+0xbeb/0x1740 [btrfs]
 btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
 btrfs_ioctl+0x1691/0x3110 [btrfs]
 ? lockdep_hardirqs_on+0xed/0x180
 ? __handle_mm_fault+0x8e7/0xfb0
 ? _raw_spin_unlock+0x24/0x30
 ? __handle_mm_fault+0x8e7/0xfb0
 ? do_vfs_ioctl+0xa5/0x6e0
 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
 do_vfs_ioctl+0xa5/0x6e0
 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
 ksys_ioctl+0x3a/0x70
 __x64_sys_ioctl+0x16/0x20
 do_syscall_64+0x60/0x1b0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

This happens because during page writeback it's valid for
writepage_delalloc to instantiate a delalloc range which doesn't
belong to the page currently being written back.

The reason this case is valid is due to find_lock_delalloc_range
returning any available range after the passed delalloc_start and
ignorting whether the page under writeback is within that range.
In turn ordered extents (OE) are always created for the returned range
from find_lock_delalloc_range. If, however, a failure occurs while OE
are being created then the clean up code in btrfs_cleanup_ordered_extents
will be called.

Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
the case of such 'foreign' range being processed and instead it always
assumes that the range OE are created for belongs to the page. This
leads to the first page of such foregin range to not be cleaned up since
it's deliberately missed skipped by the current cleaning up code.

Fix this by correctly checking whether the current page belongs to the
range being instantiated and if so adjsut the range parameters
passed for cleaning up. If it doesn't, then just clean the whole OE
range directly.

Signed-off-by: Nikolay Borisov 
Reviewed-by: Josef Bacik 
---
V3:
 * Re-worded the commit for easier comprehension
 * Added RB tag from Josef

V2:
 * Fix compilation failure due to missing parentheses
 * Fixed the "Fixes" tag.
 fs/btrfs/inode.c | 29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a0fc564626df..9f65f131f244 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -110,17 +110,17 @@ static void __endio_write_update_ordered(struct inode 
*inode,
  * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING
  * and EXTENT_DELALLOC simultaneously, because that causes the reserved 
metadata
  * to be released, which we want to happen only when finishing the ordered
- * extent (btrfs_finish_ordered_io()). Also note that the caller of
- * btrfs_run_delalloc_range already does proper cleanup for the first page of
- * the range, that is, it invokes the callback writepage_end_io_hook() for the
- * range of the first page.
+ * extent (btrfs_finish_ordered_io()).
  */
 static inline void btrfs_cleanup_ordered_extents(struct inode *inode,
-const u64 offset,
-const u64 bytes)
+struct page *locked_page,
+u64 offset, u64 bytes)
 {
unsigned long index = offset >> PAGE_SHIFT;
unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT;
+   u64 page_start = page_offset(locked_page);
+   u64 page_end = page_start + PAGE_SIZE - 1;
+
struct page *page;
 
while (index <= end_index) {
@@ -131,8 +131,18 @@ static inline void btrfs_cleanup_ordered_extents(struct 
inode *inode,
ClearPagePrivate2(page);
put_page(page);
}
-   return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
-   bytes - PAGE_SIZE, false);
+
+   /*
+* In case this page belongs to the delalloc range being instantiated
+* then skip it, since the first page of a range is going to be
+* properly cleaned up by the caller of run_delalloc_range
+*/
+   if (page_start >= offset && page_end <= (offset + bytes - 1)) {
+   offset += PAGE_SIZE;
+   bytes -= PAGE_SIZE;
+   }
+
+   return __endio_write_update_ordered(inode, offset, bytes, false);
 }
 
 static int btrfs_dirty_inode(struct inode *inode);
@@ -1605,7 +1615,8 @@ int btrfs_run_delalloc_range(void *private_data, struct 
page *locked_page,