Re: [PATCH] btrfs: only run delayed refs if we're committing
On 21.11.18 г. 21:10 ч., Josef Bacik wrote: > I noticed in a giant dbench run that we spent a lot of time on lock > contention while running transaction commit. This is because dbench > results in a lot of fsync()'s that do a btrfs_transaction_commit(), and > they all run the delayed refs first thing, so they all contend with > each other. This leads to seconds of 0 throughput. Change this to only > run the delayed refs if we're the ones committing the transaction. This > makes the latency go away and we get no more lock contention. > > Reviewed-by: Omar Sandoval > Signed-off-by: Josef Bacik Wohooo, we no longer run delayed refs at arbitrary points during transaction commit. Reviewed-by: Nikolay Borisov > --- > fs/btrfs/transaction.c | 24 +--- > 1 file changed, 9 insertions(+), 15 deletions(-) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 3c1be9db897c..41cc96cc59a3 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > btrfs_trans_release_metadata(trans); > trans->block_rsv = NULL; > > - /* make a pass through all the delayed refs we have so far > - * any runnings procs may add more while we are here > - */ > - ret = btrfs_run_delayed_refs(trans, 0); > - if (ret) { > - btrfs_end_transaction(trans); > - return ret; > - } > - > cur_trans = trans->transaction; > > /* > @@ -1938,12 +1929,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > > btrfs_create_pending_block_groups(trans); > > - ret = btrfs_run_delayed_refs(trans, 0); > - if (ret) { > - btrfs_end_transaction(trans); > - return ret; > - } > - > if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) { > int run_it = 0; > > @@ -2014,6 +1999,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle > *trans) > spin_unlock(_info->trans_lock); > } > > + /* > + * We are now the only one in the commit area, we can run delayed refs > + * without hitting a bunch of lock contention from a lot of people > + * trying to commit the transaction at once. > + */ > + ret = btrfs_run_delayed_refs(trans, 0); > + if (ret) > + goto cleanup_transaction; > + > extwriter_counter_dec(cur_trans, trans->type); > > ret = btrfs_start_delalloc_flush(fs_info); >
[PATCH v12] Add cli and ioctl to forget scanned device(s)
v12: Fixed coding style - leave space between " : ". v11: btrfs-progs: Bring the code into the else part of if(forget). Use strerror to print the erorr instead of ret. v10: Make btrfs-progs changes more readable. With an effort to keep the known bug [1] as it is.. [1] The cli 'btrfs device scan --all /dev/sdb' which should have scanned only one device, ends up scanning all the devices and I am not trying to fix this bug in this patch because.. . -d|--all is marked as deprecated, I hope -d option would go away . For now some script might be using this bug as a feature, and fixing this bug might lead to mount failure. v9: Make forget as a btrfs device scan option. Use forget in the fstests, now you can run fstests with btrfs as rootfs which helps to exercise the uuid_mutex lock. v8: Change log update in the kernel patch. v7: Use struct btrfs_ioctl_vol_args (instead of struct btrfs_ioctl_vol_args_v2) as its inline with other ioctl btrfs-control The CLI usage/features remains same. However internally the ioctl flag is not required to delete all the unmounted devices. Instead leave btrfs_ioctl_vol_args::name NULL. v6: Use the changed fn name btrfs_free_stale_devices(). Change in title: Old v5: Cover-letter: [PATCH v5] Add cli and ioctl to ignore a scanned device Kernel: [PATCH v5] btrfs: introduce feature to ignore a btrfs device Progs: [PATCH v5] btrfs-progs: add 'btrfs device ignore' cli v5: Adds feature to delete all stale devices Reuses btrfs_free_stale_devices() fn and so depends on the patch-set [1] in the ML. Uses struct btrfs_ioctl_vol_args_v2 instead of struct btrfs_ioctl_vol_args as arg Does the device path matching instead of btrfs_device matching (we won't delete the mounted device as btrfs_free_stale_devices() checks for it) v4: No change. But as the ML thread may be confusing, so resend. v3: No change. Send to correct ML. v2: Accepts review from Nikolay, details are in the specific patch. Patch 1/2 is renamed from [PATCH 1/2] btrfs: refactor btrfs_free_stale_device() to get device list delete to [PATCH 1/2] btrfs: add function to device list delete Adds cli and ioctl to forget a scanned device or forget all stale devices in the kernel. Anand Jain (1): btrfs: introduce feature to forget a btrfs device fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 9 + fs/btrfs/volumes.h | 1 + include/uapi/linux/btrfs.h | 2 ++ 4 files changed, 15 insertions(+) Anand Jain (1): btrfs-progs: add cli to forget one or all scanned devices cmds-device.c | 63 ++- ioctl.h | 2 ++ 2 files changed, 56 insertions(+), 9 deletions(-) [1] Anand Jain (1): fstests: btrfs use forget if not reload common/btrfs| 20 tests/btrfs/124 | 6 +++--- tests/btrfs/125 | 6 +++--- tests/btrfs/154 | 6 +++--- tests/btrfs/164 | 4 ++-- 5 files changed, 31 insertions(+), 11 deletions(-) -- 1.8.3.1
[PATCH] fstests: btrfs use forget if not reload
[I will send this the xfstest ML after kernel and progs patch has been integrated]. btrfs reload was introduced to cleanup the device list inside the btrfs kernel module. The problem with the reload approach is that you can't run btrfs test cases 124,125, 154 and 164 on the system with btrfs as root fs. Now as we are introducing the btrfs forget feature as an btrfs device scan option [1], so here are its pertaining changes in the fstests. So these changes ensures we use the forget feature where available and if its not then falls back to the reload approach. [1] btrfs-progs: add cli to forget one or all scanned devices btrfs: introduce feature to forget a btrfs device Signed-off-by: Anand Jain --- common/btrfs| 20 tests/btrfs/124 | 6 +++--- tests/btrfs/125 | 6 +++--- tests/btrfs/154 | 6 +++--- tests/btrfs/164 | 4 ++-- 5 files changed, 31 insertions(+), 11 deletions(-) diff --git a/common/btrfs b/common/btrfs index 26dc0bb9600f..c7fbec11c8c1 100644 --- a/common/btrfs +++ b/common/btrfs @@ -374,3 +374,23 @@ _scratch_btrfs_sectorsize() $BTRFS_UTIL_PROG inspect-internal dump-super $SCRATCH_DEV |\ grep sectorsize | awk '{print $2}' } + +_btrfs_supports_forget() +{ + $BTRFS_UTIL_PROG device scan --help | grep -wq forget && \ + $BTRFS_UTIL_PROG device scan --forget > /dev/null 2>&1 +} + +_require_btrfs_forget_if_not_fs_loadable() +{ + _btrfs_supports_forget && return + + _require_loadable_fs_module "btrfs" +} + +_btrfs_forget_if_not_fs_reload() +{ + _btrfs_supports_forget && return + + _reload_fs_module "btrfs" +} diff --git a/tests/btrfs/124 b/tests/btrfs/124 index ce3ad6aa3a58..edbeff1443f5 100755 --- a/tests/btrfs/124 +++ b/tests/btrfs/124 @@ -51,7 +51,7 @@ _supported_fs btrfs _supported_os Linux _require_scratch_dev_pool 2 _test_unmount -_require_loadable_fs_module "btrfs" +_require_btrfs_forget_if_not_fs_loadable _scratch_dev_pool_get 2 @@ -86,7 +86,7 @@ echo "clean btrfs ko" >> $seqres.full _scratch_unmount # un-scan the btrfs devices -_reload_fs_module "btrfs" +_btrfs_forget_if_not_fs_reload echo >> $seqres.full echo "-Write degraded mount fill upto $max_fs_sz bytes-" >> $seqres.full @@ -125,7 +125,7 @@ echo echo "Mount degraded with the other dev" _scratch_unmount # un-scan the btrfs devices -_reload_fs_module "btrfs" +_btrfs_forget_if_not_fs_reload _mount -o degraded $dev2 $SCRATCH_MNT >>$seqres.full 2>&1 _run_btrfs_util_prog filesystem show checkpoint3=`md5sum $SCRATCH_MNT/tf2` diff --git a/tests/btrfs/125 b/tests/btrfs/125 index e38de264b28e..9161a2ddeaad 100755 --- a/tests/btrfs/125 +++ b/tests/btrfs/125 @@ -50,7 +50,7 @@ _supported_fs btrfs _supported_os Linux _require_scratch_dev_pool 3 _test_unmount -_require_loadable_fs_module "btrfs" +_require_btrfs_forget_if_not_fs_loadable _scratch_dev_pool_get 3 @@ -102,7 +102,7 @@ echo "unmount" >> $seqres.full _scratch_unmount echo "clean btrfs ko" >> $seqres.full # un-scan the btrfs devices -_reload_fs_module "btrfs" +_btrfs_forget_if_not_fs_reload _mount -o degraded,device=$dev2 $dev1 $SCRATCH_MNT >>$seqres.full 2>&1 dd if=/dev/zero of="$SCRATCH_MNT"/tf2 bs=$bs count=$count \ >>$seqres.full 2>&1 @@ -138,7 +138,7 @@ echo "Mount degraded but with other dev" _scratch_unmount # un-scan the btrfs devices -_reload_fs_module "btrfs" +_btrfs_forget_if_not_fs_reload _mount -o degraded,device=${dev2} $dev3 $SCRATCH_MNT >>$seqres.full 2>&1 diff --git a/tests/btrfs/154 b/tests/btrfs/154 index 99ea232aba4c..01745d20e9fc 100755 --- a/tests/btrfs/154 +++ b/tests/btrfs/154 @@ -36,7 +36,7 @@ rm -f $seqres.full _supported_fs btrfs _supported_os Linux _require_scratch_dev_pool 2 -_require_loadable_fs_module "btrfs" +_require_btrfs_forget_if_not_fs_loadable _scratch_dev_pool_get 2 @@ -90,7 +90,7 @@ degrade_mount_write() echo "clean btrfs ko" >> $seqres.full # un-scan the btrfs devices - _reload_fs_module "btrfs" + _btrfs_forget_if_not_fs_reload _mount -o degraded $DEV1 $SCRATCH_MNT >>$seqres.full 2>&1 cnt=$(( $COUNT/10 )) dd if=/dev/urandom of="$SCRATCH_MNT"/tf1 bs=$bs count=$cnt \ @@ -142,7 +142,7 @@ verify() echo "unmount" >> $seqres.full _scratch_unmount - _reload_fs_module "btrfs" + _btrfs_forget_if_not_fs_reload _mount -o degraded $DEV2 $SCRATCH_MNT >>$seqres.full 2>&1 verify_checkpoint1=`md5sum $SCRATCH_MNT/tf1` verify_checkpoint2=`md5sum $SCRATCH_MNT/tf2` diff --git a/tests/btrfs/164 b/tests/btrfs/164 index 097191a0e493..55042c4035e0 100755 --- a/tests/btrfs/164 +++ b/tests/btrfs/164 @@ -36,7 +36,7 @@ rm -f $seqres.full # Modify as appropriate. _supported_fs btrfs _supported_os Linux -_require_loadable_fs_module "btrfs" +_require_btrfs_forget_if_not_fs_loadable _require_scratch_dev_pool 2 _scratch_dev_pool_get 2 @@ -69,7 +69,7 @@ delete_seed()
[PATCH v12] btrfs: introduce feature to forget a btrfs device
Support for a new command 'btrfs dev forget [dev]' is proposed here to undo the effects of 'btrfs dev scan [dev]'. For this purpose this patch proposes to use ioctl #5 as it was empty. IOW(BTRFS_IOCTL_MAGIC, 5, ..) This patch adds new ioctl BTRFS_IOC_FORGET_DEV which can be sent from the /dev/btrfs-control to forget one or all devices, (devices which are not mounted) from the btrfs kernel. The argument it takes is struct btrfs_ioctl_vol_args, and ::name can be set to specify the device path. And all unmounted devices can be removed from the kernel if no device path is provided. Again, the devices are removed only if the relevant fsid aren't mounted. This new cli can provide.. . Release of unwanted btrfs_fs_devices and btrfs_devices memory if the device is not going to be mounted. . Ability to mount the device in degraded mode when one of the other device is corrupted like in split brain raid1. . Running test cases which requires btrfs.ko-reload if the rootfs is btrfs. Signed-off-by: Anand Jain Reviewed-by: Nikolay Borisov --- v11->v12: fix coding style add spacing before after ":". v1->v11: Pls ref to the cover-letter. (sorry about that). fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 9 + fs/btrfs/volumes.h | 1 + include/uapi/linux/btrfs.h | 2 ++ 4 files changed, 15 insertions(+) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index d3c6bbc0aa3a..eba2966913ae 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2247,6 +2247,9 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd, ret = PTR_ERR_OR_ZERO(device); mutex_unlock(_mutex); break; + case BTRFS_IOC_FORGET_DEV: + ret = btrfs_forget_devices(vol->name); + break; case BTRFS_IOC_DEVICES_READY: mutex_lock(_mutex); device = btrfs_scan_one_device(vol->name, FMODE_READ, diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 8e36cbb355df..48415a9edd46 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1397,6 +1397,15 @@ static int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr, return 0; } +int btrfs_forget_devices(const char *path) +{ + mutex_lock(_mutex); + btrfs_free_stale_devices(strlen(path) ? path : NULL, NULL); + mutex_unlock(_mutex); + + return 0; +} + /* * Look for a btrfs signature on a device. This may be called out of the mount path * and we are not allowed to call set_blocksize during the scan. The superblock diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 1d936ce282c3..a4a89d5050f5 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -414,6 +414,7 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, fmode_t flags, void *holder); struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags, void *holder); +int btrfs_forget_devices(const char *path); int btrfs_close_devices(struct btrfs_fs_devices *fs_devices); void btrfs_free_extra_devids(struct btrfs_fs_devices *fs_devices, int step); void btrfs_assign_next_active_device(struct btrfs_device *device, diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index e0763bc4158e..c195896d478f 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -837,6 +837,8 @@ enum btrfs_err_code { struct btrfs_ioctl_vol_args) #define BTRFS_IOC_SCAN_DEV _IOW(BTRFS_IOCTL_MAGIC, 4, \ struct btrfs_ioctl_vol_args) +#define BTRFS_IOC_FORGET_DEV _IOW(BTRFS_IOCTL_MAGIC, 5, \ + struct btrfs_ioctl_vol_args) /* trans start and trans end are dangerous, and only for * use by applications that know how to avoid the * resulting deadlocks -- 1.8.3.1
[PATCH v12] btrfs-progs: add cli to forget one or all scanned devices
This patch adds cli btrfs device forget [dev] to remove the given device structure in the kernel if the device is unmounted. If no argument is given it shall remove all stale (device which are not mounted) from the kernel. Signed-off-by: Anand Jain Reviewed-by: Nikolay Borisov --- v11->v12: none. v1->v12: pls ref to the cover-letter. cmds-device.c | 72 --- ioctl.h | 2 ++ 2 files changed, 61 insertions(+), 13 deletions(-) diff --git a/cmds-device.c b/cmds-device.c index 2a05f70a76a9..280d6f555377 100644 --- a/cmds-device.c +++ b/cmds-device.c @@ -254,10 +254,32 @@ static int cmd_device_delete(int argc, char **argv) return _cmd_device_remove(argc, argv, cmd_device_delete_usage); } +static int btrfs_forget_devices(char *path) +{ + struct btrfs_ioctl_vol_args args; + int ret; + int fd; + + fd = open("/dev/btrfs-control", O_RDWR); + if (fd < 0) + return -errno; + + memset(, 0, sizeof(args)); + if (path) + strncpy_null(args.name, path); + ret = ioctl(fd, BTRFS_IOC_FORGET_DEV, ); + if (ret) + ret = -errno; + close(fd); + return ret; +} + static const char * const cmd_device_scan_usage[] = { - "btrfs device scan [(-d|--all-devices)| [...]]", - "Scan devices for a btrfs filesystem", + "btrfs device scan [(-d|--all-devices)|(-u|--forget)| "\ + "[...]]", + "Scan or forget (deregister) devices for a btrfs filesystem", " -d|--all-devices (deprecated)", + " -u|--forget [ ..]", NULL }; @@ -267,37 +289,53 @@ static int cmd_device_scan(int argc, char **argv) int devstart; int all = 0; int ret = 0; + int forget = 0; optind = 0; while (1) { int c; static const struct option long_options[] = { { "all-devices", no_argument, NULL, 'd'}, + { "forget", no_argument, NULL, 'u'}, { NULL, 0, NULL, 0} }; - c = getopt_long(argc, argv, "d", long_options, NULL); + c = getopt_long(argc, argv, "du", long_options, NULL); if (c < 0) break; switch (c) { case 'd': all = 1; break; + case 'u': + forget = 1; + break; default: usage(cmd_device_scan_usage); } } devstart = optind; + if (all && forget) + usage(cmd_device_scan_usage); + if (all && check_argc_max(argc - optind, 1)) usage(cmd_device_scan_usage); if (all || argc - optind == 0) { - printf("Scanning for Btrfs filesystems\n"); - ret = btrfs_scan_devices(); - error_on(ret, "error %d while scanning", ret); - ret = btrfs_register_all_devices(); - error_on(ret, "there are %d errors while registering devices", ret); + if (forget) { + ret = btrfs_forget_devices(NULL); + error_on(ret, "'%s', forget failed", +strerror(-ret)); + } else { + printf("Scanning for Btrfs filesystems\n"); + ret = btrfs_scan_devices(); + error_on(ret, "error %d while scanning", ret); + ret = btrfs_register_all_devices(); + error_on(ret, + "there are %d errors while registering devices", + ret); + } goto out; } @@ -315,11 +353,19 @@ static int cmd_device_scan(int argc, char **argv) ret = 1; goto out; } - printf("Scanning for Btrfs filesystems in '%s'\n", path); - if (btrfs_register_one_device(path) != 0) { - ret = 1; - free(path); - goto out; + if (forget) { + ret = btrfs_forget_devices(path); + if (ret) + error("Can't forget '%s': %s", + path, strerror(-ret)); + } else { + printf("Scanning for Btrfs filesystems in '%s'\n", + path); + if (btrfs_register_one_device(path) != 0) { + ret = 1; + free(path); + goto out; + }
Re: [PATCH 2/6] btrfs: add cleanup_ref_head_accounting helper
On 2018/11/22 上午2:59, Josef Bacik wrote: > From: Josef Bacik > > We were missing some quota cleanups in check_ref_cleanup, so break the > ref head accounting cleanup into a helper and call that from both > check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that > we don't screw up accounting in the future for other things that we add. > > Reviewed-by: Omar Sandoval > Reviewed-by: Liu Bo > Signed-off-by: Josef Bacik > --- > fs/btrfs/extent-tree.c | 67 > +- > 1 file changed, 39 insertions(+), 28 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index c36b3a42f2bb..e3ed3507018d 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle > *trans, > return ret ? ret : 1; > } > > +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, > + struct btrfs_delayed_ref_head *head) > +{ > + struct btrfs_fs_info *fs_info = trans->fs_info; > + struct btrfs_delayed_ref_root *delayed_refs = > + >transaction->delayed_refs; > + > + if (head->total_ref_mod < 0) { > + struct btrfs_space_info *space_info; > + u64 flags; > + > + if (head->is_data) > + flags = BTRFS_BLOCK_GROUP_DATA; > + else if (head->is_system) > + flags = BTRFS_BLOCK_GROUP_SYSTEM; > + else > + flags = BTRFS_BLOCK_GROUP_METADATA; > + space_info = __find_space_info(fs_info, flags); > + ASSERT(space_info); > + percpu_counter_add_batch(_info->total_bytes_pinned, > +-head->num_bytes, > +BTRFS_TOTAL_BYTES_PINNED_BATCH); > + > + if (head->is_data) { > + spin_lock(_refs->lock); > + delayed_refs->pending_csums -= head->num_bytes; > + spin_unlock(_refs->lock); > + } > + } > + > + /* Also free its reserved qgroup space */ > + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, > + head->qgroup_reserved); This part will be removed in patch "[PATCH] btrfs: qgroup: Move reserved data account from btrfs_delayed_ref_head to btrfs_qgroup_extent_record". So there is one less thing to worry about in delayed ref head. Thanks, Qu > +} > + > static int cleanup_ref_head(struct btrfs_trans_handle *trans, > struct btrfs_delayed_ref_head *head) > { > @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > spin_unlock(>lock); > spin_unlock(_refs->lock); > > - trace_run_delayed_ref_head(fs_info, head, 0); > - > - if (head->total_ref_mod < 0) { > - struct btrfs_space_info *space_info; > - u64 flags; > - > - if (head->is_data) > - flags = BTRFS_BLOCK_GROUP_DATA; > - else if (head->is_system) > - flags = BTRFS_BLOCK_GROUP_SYSTEM; > - else > - flags = BTRFS_BLOCK_GROUP_METADATA; > - space_info = __find_space_info(fs_info, flags); > - ASSERT(space_info); > - percpu_counter_add_batch(_info->total_bytes_pinned, > --head->num_bytes, > -BTRFS_TOTAL_BYTES_PINNED_BATCH); > - > - if (head->is_data) { > - spin_lock(_refs->lock); > - delayed_refs->pending_csums -= head->num_bytes; > - spin_unlock(_refs->lock); > - } > - } > - > if (head->must_insert_reserved) { > btrfs_pin_extent(fs_info, head->bytenr, >head->num_bytes, 1); > @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle > *trans, > } > } > > - /* Also free its reserved qgroup space */ > - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, > - head->qgroup_reserved); > + cleanup_ref_head_accounting(trans, head); > + > + trace_run_delayed_ref_head(fs_info, head, 0); > btrfs_delayed_ref_unlock(head); > btrfs_put_delayed_ref_head(head); > return 0; > @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct > btrfs_trans_handle *trans, > if (head->must_insert_reserved) > ret = 1; > > + cleanup_ref_head_accounting(trans, head); > mutex_unlock(>mutex); > btrfs_put_delayed_ref_head(head); > return ret; > signature.asc Description: OpenPGP digital signature
[RFC][PATCH v4 09/09] btrfs: use common file type conversion
Deduplicate the btrfs file type conversion implementation - file systems that use the same file types as defined by POSIX do not need to define their own versions and can use the common helper functions decared in fs_types.h and implemented in fs_types.c Acked-by: David Sterba Signed-off-by: Amir Goldstein Signed-off-by: Phillip Potter --- fs/btrfs/btrfs_inode.h | 2 -- fs/btrfs/delayed-inode.c| 2 +- fs/btrfs/inode.c| 32 +++- include/uapi/linux/btrfs_tree.h | 2 ++ 4 files changed, 18 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 97d91e55b70a..bb01c804485f 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -196,8 +196,6 @@ struct btrfs_inode { struct inode vfs_inode; }; -extern unsigned char btrfs_filetype_table[]; - static inline struct btrfs_inode *BTRFS_I(const struct inode *inode) { return container_of(inode, struct btrfs_inode, vfs_inode); diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index c669f250d4a0..e61947f5eb76 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1692,7 +1692,7 @@ int btrfs_readdir_delayed_dir_index(struct dir_context *ctx, name = (char *)(di + 1); name_len = btrfs_stack_dir_name_len(di); - d_type = btrfs_filetype_table[di->type]; + d_type = fs_ftype_to_dtype(di->type); btrfs_disk_key_to_cpu(, >location); over = !dir_emit(ctx, name, name_len, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 9ea4c6f0352f..8b7b1b29e2ad 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -72,17 +72,6 @@ struct kmem_cache *btrfs_trans_handle_cachep; struct kmem_cache *btrfs_path_cachep; struct kmem_cache *btrfs_free_space_cachep; -#define S_SHIFT 12 -static const unsigned char btrfs_type_by_mode[S_IFMT >> S_SHIFT] = { - [S_IFREG >> S_SHIFT]= BTRFS_FT_REG_FILE, - [S_IFDIR >> S_SHIFT]= BTRFS_FT_DIR, - [S_IFCHR >> S_SHIFT]= BTRFS_FT_CHRDEV, - [S_IFBLK >> S_SHIFT]= BTRFS_FT_BLKDEV, - [S_IFIFO >> S_SHIFT]= BTRFS_FT_FIFO, - [S_IFSOCK >> S_SHIFT] = BTRFS_FT_SOCK, - [S_IFLNK >> S_SHIFT]= BTRFS_FT_SYMLINK, -}; - static int btrfs_setsize(struct inode *inode, struct iattr *attr); static int btrfs_truncate(struct inode *inode, bool skip_writeback); static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent); @@ -5793,10 +5782,6 @@ static struct dentry *btrfs_lookup(struct inode *dir, struct dentry *dentry, return d_splice_alias(inode, dentry); } -unsigned char btrfs_filetype_table[] = { - DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK -}; - /* * All this infrastructure exists because dir_emit can fault, and we are holding * the tree lock when doing readdir. For now just allocate a buffer and copy @@ -5935,7 +5920,7 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx) name_ptr = (char *)(entry + 1); read_extent_buffer(leaf, name_ptr, (unsigned long)(di + 1), name_len); - put_unaligned(btrfs_filetype_table[btrfs_dir_type(leaf, di)], + put_unaligned(fs_ftype_to_dtype(btrfs_dir_type(leaf, di)), >type); btrfs_dir_item_key_to_cpu(leaf, di, ); put_unaligned(location.objectid, >ino); @@ -6340,7 +6325,20 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, static inline u8 btrfs_inode_type(struct inode *inode) { - return btrfs_type_by_mode[(inode->i_mode & S_IFMT) >> S_SHIFT]; + /* +* compile-time asserts that generic FT_x types still match +* BTRFS_FT_x types +*/ + BUILD_BUG_ON(BTRFS_FT_UNKNOWN != FT_UNKNOWN); + BUILD_BUG_ON(BTRFS_FT_REG_FILE != FT_REG_FILE); + BUILD_BUG_ON(BTRFS_FT_DIR != FT_DIR); + BUILD_BUG_ON(BTRFS_FT_CHRDEV != FT_CHRDEV); + BUILD_BUG_ON(BTRFS_FT_BLKDEV != FT_BLKDEV); + BUILD_BUG_ON(BTRFS_FT_FIFO != FT_FIFO); + BUILD_BUG_ON(BTRFS_FT_SOCK != FT_SOCK); + BUILD_BUG_ON(BTRFS_FT_SYMLINK != FT_SYMLINK); + + return fs_umode_to_ftype(inode->i_mode); } /* diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h index aff1356c2bb8..4126ed7ee89a 100644 --- a/include/uapi/linux/btrfs_tree.h +++ b/include/uapi/linux/btrfs_tree.h @@ -307,6 +307,8 @@ * * Used by: * struct btrfs_dir_item.type + * + * Values 0..7 must match common file type values in fs_types.h. */ #define BTRFS_FT_UNKNOWN 0 #define BTRFS_FT_REG_FILE 1 -- 2.19.1
[PATCH 1/3] btrfs: run delayed iputs before committing
Delayed iputs means we can have final iputs of deleted inodes in the queue, which could potentially generate a lot of pinned space that could be free'd. So before we decide to commit the transaction for ENOPSC reasons, run the delayed iputs so that any potential space is free'd up. If there is and we freed enough we can then commit the transaction and potentially be able to make our reservation. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 9 + 1 file changed, 9 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 90423b6749b7..3a90dc1d6b31 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4833,6 +4833,15 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (!bytes) return 0; + /* +* If we have pending delayed iputs then we could free up a bunch of +* pinned space, so make sure we run the iputs before we do our pinned +* bytes check below. +*/ + mutex_lock(_info->cleaner_delayed_iput_mutex); + btrfs_run_delayed_iputs(fs_info); + mutex_unlock(_info->cleaner_delayed_iput_mutex); + trans = btrfs_join_transaction(fs_info->extent_root); if (IS_ERR(trans)) return PTR_ERR(trans); -- 2.14.3
[PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput
The cleaner thread usually takes care of delayed iputs, with the exception of the btrfs_end_transaction_throttle path. The cleaner thread only gets woken up every 30 seconds, so instead wake it up to do it's work so that we can free up that space as quickly as possible. Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3da9ac463344..3c42d8887183 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3264,6 +3264,7 @@ void btrfs_add_delayed_iput(struct inode *inode) ASSERT(list_empty(>delayed_iput)); list_add_tail(>delayed_iput, _info->delayed_iputs); spin_unlock(_info->delayed_iput_lock); + wake_up_process(fs_info->cleaner_kthread); } void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info) -- 2.14.3
[PATCH] btrfs: only run delayed refs if we're committing
I noticed in a giant dbench run that we spent a lot of time on lock contention while running transaction commit. This is because dbench results in a lot of fsync()'s that do a btrfs_transaction_commit(), and they all run the delayed refs first thing, so they all contend with each other. This leads to seconds of 0 throughput. Change this to only run the delayed refs if we're the ones committing the transaction. This makes the latency go away and we get no more lock contention. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 3c1be9db897c..41cc96cc59a3 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - /* make a pass through all the delayed refs we have so far -* any runnings procs may add more while we are here -*/ - ret = btrfs_run_delayed_refs(trans, 0); - if (ret) { - btrfs_end_transaction(trans); - return ret; - } - cur_trans = trans->transaction; /* @@ -1938,12 +1929,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_create_pending_block_groups(trans); - ret = btrfs_run_delayed_refs(trans, 0); - if (ret) { - btrfs_end_transaction(trans); - return ret; - } - if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) { int run_it = 0; @@ -2014,6 +1999,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) spin_unlock(_info->trans_lock); } + /* +* We are now the only one in the commit area, we can run delayed refs +* without hitting a bunch of lock contention from a lot of people +* trying to commit the transaction at once. +*/ + ret = btrfs_run_delayed_refs(trans, 0); + if (ret) + goto cleanup_transaction; + extwriter_counter_dec(cur_trans, trans->type); ret = btrfs_start_delalloc_flush(fs_info); -- 2.14.3
[PATCH 0/3] Delayed iput fixes
Here are some delayed iput fixes. Delayed iputs can hold reservations for a while and there's no real good way to make sure they were gone for good, which means we could early enospc when in reality if we had just waited for the iput we would have had plenty of space. So fix this up by making us wait for delayed iputs when deciding if we need to commit for enospc flushing, and then cleanup and rework how we run delayed iputs to make it more straightforward to wait on them and make sure we're all done using them. Thanks, Josef
[PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
The throttle path doesn't take cleaner_delayed_iput_mutex, which means we could think we're done flushing iputs in the data space reservation path when we could have a throttler doing an iput. There's no real reason to serialize the delayed iput flushing, so instead of taking the cleaner_delayed_iput_mutex whenever we flush the delayed iputs just replace it with an atomic counter and a waitqueue. This removes the short (or long depending on how big the inode is) window where we think there are no more pending iputs when there really are some. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 4 +++- fs/btrfs/disk-io.c | 5 ++--- fs/btrfs/extent-tree.c | 9 + fs/btrfs/inode.c | 21 + 4 files changed, 31 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 709de7471d86..a835fe7076eb 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -912,7 +912,8 @@ struct btrfs_fs_info { spinlock_t delayed_iput_lock; struct list_head delayed_iputs; - struct mutex cleaner_delayed_iput_mutex; + atomic_t nr_delayed_iputs; + wait_queue_head_t delayed_iputs_wait; /* this protects tree_mod_seq_list */ spinlock_t tree_mod_seq_lock; @@ -3237,6 +3238,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root); int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size); void btrfs_add_delayed_iput(struct inode *inode); void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info); +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info); int btrfs_prealloc_file_range(struct inode *inode, int mode, u64 start, u64 num_bytes, u64 min_size, loff_t actual_len, u64 *alloc_hint); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index c5918ff8241b..3f81dfaefa32 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg) goto sleep; } - mutex_lock(_info->cleaner_delayed_iput_mutex); btrfs_run_delayed_iputs(fs_info); - mutex_unlock(_info->cleaner_delayed_iput_mutex); again = btrfs_clean_one_deleted_snapshot(root); mutex_unlock(_info->cleaner_mutex); @@ -2651,7 +2649,6 @@ int open_ctree(struct super_block *sb, mutex_init(_info->delete_unused_bgs_mutex); mutex_init(_info->reloc_mutex); mutex_init(_info->delalloc_root_mutex); - mutex_init(_info->cleaner_delayed_iput_mutex); seqlock_init(_info->profiles_lock); INIT_LIST_HEAD(_info->dirty_cowonly_roots); @@ -2673,6 +2670,7 @@ int open_ctree(struct super_block *sb, atomic_set(_info->defrag_running, 0); atomic_set(_info->qgroup_op_seq, 0); atomic_set(_info->reada_works_cnt, 0); + atomic_set(_info->nr_delayed_iputs, 0); atomic64_set(_info->tree_mod_seq, 0); fs_info->sb = sb; fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE; @@ -2750,6 +2748,7 @@ int open_ctree(struct super_block *sb, init_waitqueue_head(_info->transaction_wait); init_waitqueue_head(_info->transaction_blocked_wait); init_waitqueue_head(_info->async_submit_wait); + init_waitqueue_head(_info->delayed_iputs_wait); INIT_LIST_HEAD(_info->pinned_chunks); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3a90dc1d6b31..36f43876be22 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4272,8 +4272,9 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) * operations. Wait for it to finish so that * more space is released. */ - mutex_lock(_info->cleaner_delayed_iput_mutex); - mutex_unlock(_info->cleaner_delayed_iput_mutex); + ret = btrfs_wait_on_delayed_iputs(fs_info); + if (ret) + return ret; goto again; } else { btrfs_end_transaction(trans); @@ -4838,9 +4839,9 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, * pinned space, so make sure we run the iputs before we do our pinned * bytes check below. */ - mutex_lock(_info->cleaner_delayed_iput_mutex); btrfs_run_delayed_iputs(fs_info); - mutex_unlock(_info->cleaner_delayed_iput_mutex); + wait_event(fs_info->delayed_iputs_wait, + atomic_read(_info->nr_delayed_iputs) == 0); trans = btrfs_join_transaction(fs_info->extent_root); if (IS_ERR(trans)) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3c42d8887183..57bf514a90eb 100644 --- a/fs/btrfs/inode.c +++
[PATCH 4/7] btrfs: call btrfs_create_pending_block_groups unconditionally
The first thing we do is loop through the list, this if (!list_empty()) btrfs_create_pending_block_groups(); thing is just wasted space. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 3 +-- fs/btrfs/transaction.c | 6 ++ 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a0f8880ee5dd..b9b829c8825c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3004,8 +3004,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, } if (run_all) { - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); spin_lock(_refs->lock); node = rb_first_cached(_refs->href_root); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index a4682a696fb6..826a15a07fce 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -845,8 +845,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); btrfs_trans_release_chunk_metadata(trans); @@ -1937,8 +1936,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) cur_trans->delayed_refs.flushing = 1; smp_wmb(); - if (!list_empty(>new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); ret = btrfs_run_delayed_refs(trans, 0); if (ret) { -- 2.14.3
[PATCH 6/7] btrfs: cleanup pending bgs on transaction abort
We may abort the transaction during a commit and not have a chance to run the pending bgs stuff, which will leave block groups on our list and cause us accounting issues and leaked memory. Fix this by running the pending bgs when we cleanup a transaction. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 826a15a07fce..3c1be9db897c 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2269,6 +2269,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_scrub_continue(fs_info); cleanup_transaction: btrfs_trans_release_metadata(trans); + /* This cleans up the pending block groups list properly. */ + if (!trans->aborted) + trans->aborted = ret; + btrfs_create_pending_block_groups(trans); btrfs_trans_release_chunk_metadata(trans); trans->block_rsv = NULL; btrfs_warn(fs_info, "Skipping commit of aborted transaction."); -- 2.14.3
[PATCH 5/7] btrfs: just delete pending bgs if we are aborted
We still need to do all of the accounting cleanup for pending block groups if we abort. So set the ret to trans->aborted so if we aborted the cleanup happens and everybody is happy. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index b9b829c8825c..90423b6749b7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10500,11 +10500,17 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) struct btrfs_root *extent_root = fs_info->extent_root; struct btrfs_block_group_item item; struct btrfs_key key; - int ret = 0; + int ret; if (!trans->can_flush_pending_bgs) return; + /* +* If we aborted the transaction with pending bg's we need to just +* cleanup the list and carry on. +*/ + ret = trans->aborted; + while (!list_empty(>new_bgs)) { block_group = list_first_entry(>new_bgs, struct btrfs_block_group_cache, -- 2.14.3
[PATCH 7/7] btrfs: wait on ordered extents on abort cleanup
If we flip read-only before we initiate writeback on all dirty pages for ordered extents we've created then we'll have ordered extents left over on umount, which results in all sorts of bad things happening. Fix this by making sure we wait on ordered extents if we have to do the aborted transaction cleanup stuff. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 8e7926c91e35..c5918ff8241b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4171,6 +4171,14 @@ static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info) spin_lock(_info->ordered_root_lock); } spin_unlock(_info->ordered_root_lock); + + /* +* We need this here because if we've been flipped read-only we won't +* get sync() from the umount, so we need to make sure any ordered +* extents that haven't had their dirty pages IO start writeout yet +* actually get run and error out properly. +*/ + btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); } static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, -- 2.14.3
[PATCH 0/7] Abort cleanup fixes
A new xfstests that really hammers on transaction aborts (generic/495 I think?) uncovered a lot of random issues. Some of these were introduced with the new delayed refs rsv patches, others were just exposed by them, such as the pending bg stuff. With these patches in place I stopped getting all the random leftovers and WARN_ON()'s when running whichever xfstest that was and things are much smoother now. Thanks, Josef
[PATCH 3/7] btrfs: handle delayed ref head accounting cleanup in abort
We weren't doing any of the accounting cleanup when we aborted transactions. Fix this by making cleanup_ref_head_accounting global and calling it from the abort code, this fixes the issue where our accounting was all wrong after the fs aborts. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 5 + fs/btrfs/disk-io.c | 1 + fs/btrfs/extent-tree.c | 13 ++--- 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8ccc5019172b..709de7471d86 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -35,6 +35,7 @@ struct btrfs_trans_handle; struct btrfs_transaction; struct btrfs_pending_snapshot; +struct btrfs_delayed_ref_root; extern struct kmem_cache *btrfs_trans_handle_cachep; extern struct kmem_cache *btrfs_bit_radix_cachep; extern struct kmem_cache *btrfs_path_cachep; @@ -2643,6 +2644,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, unsigned long count); int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info, unsigned long count, u64 transid, int wait); +void +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, + struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head); int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 len); int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, u64 bytenr, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7d02748cf3f6..8e7926c91e35 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4223,6 +4223,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, if (pin_bytes) btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); btrfs_put_delayed_ref_head(head); cond_resched(); spin_lock(_refs->lock); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e31980d451c2..a0f8880ee5dd 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2457,12 +2457,11 @@ static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } -static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, - struct btrfs_delayed_ref_head *head) +void +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, + struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) { - struct btrfs_fs_info *fs_info = trans->fs_info; - struct btrfs_delayed_ref_root *delayed_refs = - >transaction->delayed_refs; int nr_items = 1; if (head->total_ref_mod < 0) { @@ -2540,7 +2539,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, } } - cleanup_ref_head_accounting(trans, head); + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); trace_run_delayed_ref_head(fs_info, head, 0); btrfs_delayed_ref_unlock(head); @@ -7218,7 +7217,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (head->must_insert_reserved) ret = 1; - cleanup_ref_head_accounting(trans, head); + btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head); mutex_unlock(>mutex); btrfs_put_delayed_ref_head(head); return ret; -- 2.14.3
[PATCH 2/7] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head
Instead of open coding this stuff use the helper instead. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index f062fb0487cd..7d02748cf3f6 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4215,12 +4215,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, if (head->must_insert_reserved) pin_bytes = true; btrfs_free_delayed_extent_op(head->extent_op); - delayed_refs->num_heads--; - if (head->processing == 0) - delayed_refs->num_heads_ready--; - atomic_dec(_refs->num_entries); - rb_erase_cached(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); + btrfs_delete_ref_head(delayed_refs, head); spin_unlock(>lock); spin_unlock(_refs->lock); mutex_unlock(>mutex); -- 2.14.3
[PATCH 1/7] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock
We have this open coded in btrfs_destroy_delayed_refs, use the helper instead. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index beaa58e742b5..f062fb0487cd 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4197,16 +4197,9 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, head = rb_entry(node, struct btrfs_delayed_ref_head, href_node); - if (!mutex_trylock(>mutex)) { - refcount_inc(>refs); - spin_unlock(_refs->lock); - - mutex_lock(>mutex); - mutex_unlock(>mutex); - btrfs_put_delayed_ref_head(head); - spin_lock(_refs->lock); + if (btrfs_delayed_ref_lock(delayed_refs, head)) continue; - } + spin_lock(>lock); while ((n = rb_first_cached(>ref_tree)) != NULL) { ref = rb_entry(n, struct btrfs_delayed_ref_node, -- 2.14.3
[PATCH 3/8] btrfs: don't use global rsv for chunk allocation
We've done this forever because of the voodoo around knowing how much space we have. However we have better ways of doing this now, and on normal file systems we'll easily have a global reserve of 512MiB, and since metadata chunks are usually 1GiB that means we'll allocate metadata chunks more readily. Instead use the actual used amount when determining if we need to allocate a chunk or not. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 9 - 1 file changed, 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7a30fbc05e5e..a91b3183dcae 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4388,21 +4388,12 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global) static int should_alloc_chunk(struct btrfs_fs_info *fs_info, struct btrfs_space_info *sinfo, int force) { - struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; u64 bytes_used = btrfs_space_info_used(sinfo, false); u64 thresh; if (force == CHUNK_ALLOC_FORCE) return 1; - /* -* We need to take into account the global rsv because for all intents -* and purposes it's used space. Don't worry about locking the -* global_rsv, it doesn't change except when the transaction commits. -*/ - if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA) - bytes_used += calc_global_rsv_need_space(global_rsv); - /* * in limited mode, we want to have some free space up to * about 1% of the FS size. -- 2.14.3
[PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
With my change to no longer take into account the global reserve for metadata allocation chunks we have this side-effect for mixed block group fs'es where we are no longer allocating enough chunks for the data/metadata requirements. To deal with this add a ALLOC_CHUNK_FORCE step to the flushing state machine. This will only get used if we've already made a full loop through the flushing machinery and tried committing the transaction. If we have then we can try and force a chunk allocation since we likely need it to make progress. This resolves the issues I was seeing with the mixed bg tests in xfstests with my previous patch. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/extent-tree.c | 18 +- include/trace/events/btrfs.h | 1 + 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0c6d589c8ce4..8ccc5019172b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2750,7 +2750,8 @@ enum btrfs_flush_state { FLUSH_DELALLOC = 5, FLUSH_DELALLOC_WAIT = 6, ALLOC_CHUNK = 7, - COMMIT_TRANS= 8, + ALLOC_CHUNK_FORCE = 8, + COMMIT_TRANS= 9, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a91b3183dcae..e6bb6ce23c84 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4927,6 +4927,7 @@ static void flush_space(struct btrfs_fs_info *fs_info, btrfs_end_transaction(trans); break; case ALLOC_CHUNK: + case ALLOC_CHUNK_FORCE: trans = btrfs_join_transaction(root); if (IS_ERR(trans)) { ret = PTR_ERR(trans); @@ -4934,7 +4935,9 @@ static void flush_space(struct btrfs_fs_info *fs_info, } ret = do_chunk_alloc(trans, btrfs_metadata_alloc_profile(fs_info), -CHUNK_ALLOC_NO_FORCE); +(state == ALLOC_CHUNK) ? +CHUNK_ALLOC_NO_FORCE : +CHUNK_ALLOC_FORCE); btrfs_end_transaction(trans); if (ret > 0 || ret == -ENOSPC) ret = 0; @@ -5070,6 +5073,19 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) commit_cycles--; } + /* +* We don't want to force a chunk allocation until we've tried +* pretty hard to reclaim space. Think of the case where we +* free'd up a bunch of space and so have a lot of pinned space +* to reclaim. We would rather use that than possibly create a +* underutilized metadata chunk. So if this is our first run +* through the flushing state machine skip ALLOC_CHUNK_FORCE and +* commit the transaction. If nothing has changed the next go +* around then we can force a chunk allocation. +*/ + if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles) + flush_state++; + if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 63d1f9d8b8c7..dd0e6f8d6b6e 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush, { FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"}, \ { FLUSH_DELAYED_REFS, "FLUSH_ELAYED_REFS"}, \ { ALLOC_CHUNK, "ALLOC_CHUNK"}, \ + { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"}, \ { COMMIT_TRANS, "COMMIT_TRANS"}) TRACE_EVENT(btrfs_flush_space, -- 2.14.3
[PATCH 5/8] btrfs: don't enospc all tickets on flush failure
With the introduction of the per-inode block_rsv it became possible to have really really large reservation requests made because of data fragmentation. Since the ticket stuff assumed that we'd always have relatively small reservation requests it just killed all tickets if we were unable to satisfy the current request. However this is generally not the case anymore. So fix this logic to instead see if we had a ticket that we were able to give some reservation to, and if we were continue the flushing loop again. Likewise we make the tickets use the space_info_add_old_bytes() method of returning what reservation they did receive in hopes that it could satisfy reservations down the line. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 45 + 1 file changed, 25 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e6bb6ce23c84..983d086fa768 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4791,6 +4791,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim, } struct reserve_ticket { + u64 orig_bytes; u64 bytes; int error; struct list_head list; @@ -5012,7 +5013,7 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info, !test_bit(BTRFS_FS_STATE_REMOUNTING, _info->fs_state)); } -static void wake_all_tickets(struct list_head *head) +static bool wake_all_tickets(struct list_head *head) { struct reserve_ticket *ticket; @@ -5021,7 +5022,10 @@ static void wake_all_tickets(struct list_head *head) list_del_init(>list); ticket->error = -ENOSPC; wake_up(>wait); + if (ticket->bytes != ticket->orig_bytes) + return true; } + return false; } /* @@ -5089,8 +5093,12 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { - wake_all_tickets(_info->tickets); - space_info->flush = 0; + if (wake_all_tickets(_info->tickets)) { + flush_state = FLUSH_DELAYED_ITEMS_NR; + commit_cycles--; + } else { + space_info->flush = 0; + } } else { flush_state = FLUSH_DELAYED_ITEMS_NR; } @@ -5142,10 +5150,11 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, - struct reserve_ticket *ticket, u64 orig_bytes) + struct reserve_ticket *ticket) { DEFINE_WAIT(wait); + u64 reclaim_bytes = 0; int ret = 0; spin_lock(_info->lock); @@ -5166,14 +5175,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, ret = ticket->error; if (!list_empty(>list)) list_del_init(>list); - if (ticket->bytes && ticket->bytes < orig_bytes) { - u64 num_bytes = orig_bytes - ticket->bytes; - update_bytes_may_use(space_info, -num_bytes); - trace_btrfs_space_reservation(fs_info, "space_info", - space_info->flags, num_bytes, 0); - } + if (ticket->bytes && ticket->bytes < ticket->orig_bytes) + reclaim_bytes = ticket->orig_bytes - ticket->bytes; spin_unlock(_info->lock); + if (reclaim_bytes) + space_info_add_old_bytes(fs_info, space_info, reclaim_bytes); return ret; } @@ -5199,6 +5206,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, { struct reserve_ticket ticket; u64 used; + u64 reclaim_bytes = 0; int ret = 0; ASSERT(orig_bytes); @@ -5234,6 +5242,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, * the list and we will do our own flushing further down. */ if (ret && flush != BTRFS_RESERVE_NO_FLUSH) { + ticket.orig_bytes = orig_bytes; ticket.bytes = orig_bytes; ticket.error = 0; init_waitqueue_head(); @@ -5274,25 +5283,21 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, return ret; if (flush == BTRFS_RESERVE_FLUSH_ALL) - return wait_reserve_ticket(fs_info, space_info, , - orig_bytes); + return wait_reserve_ticket(fs_info,
[PATCH 7/8] btrfs: be more explicit about allowed flush states
For FLUSH_LIMIT flushers we really can only allocate chunks and flush delayed inode items, everything else is problematic. I added a bunch of new states and it lead to weirdness in the FLUSH_LIMIT case because I forgot about how it worked. So instead explicitly declare the states that are ok for flushing with FLUSH_LIMIT and use that for our state machine. Then as we add new things that are safe we can just add them to this list. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0e9ba77e5316..e31980d451c2 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5112,12 +5112,18 @@ void btrfs_init_async_reclaim_work(struct work_struct *work) INIT_WORK(work, btrfs_async_reclaim_metadata_space); } +static const enum btrfs_flush_state priority_flush_states[] = { + FLUSH_DELAYED_ITEMS_NR, + FLUSH_DELAYED_ITEMS, + ALLOC_CHUNK, +}; + static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, struct reserve_ticket *ticket) { u64 to_reclaim; - int flush_state = FLUSH_DELAYED_ITEMS_NR; + int flush_state = 0; spin_lock(_info->lock); to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info, @@ -5129,7 +5135,8 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, spin_unlock(_info->lock); do { - flush_space(fs_info, space_info, to_reclaim, flush_state); + flush_space(fs_info, space_info, to_reclaim, + priority_flush_states[flush_state]); flush_state++; spin_lock(_info->lock); if (ticket->bytes == 0) { @@ -5137,15 +5144,7 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, return; } spin_unlock(_info->lock); - - /* -* Priority flushers can't wait on delalloc without -* deadlocking. -*/ - if (flush_state == FLUSH_DELALLOC || - flush_state == FLUSH_DELALLOC_WAIT) - flush_state = ALLOC_CHUNK; - } while (flush_state < COMMIT_TRANS); + } while (flush_state < ARRAY_SIZE(priority_flush_states)); } static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, -- 2.14.3
[PATCH 0/8] Enospc cleanups and fixes
The delayed refs rsv patches exposed a bunch of issues in our enospc infrastructure that needed to be addressed. These aren't really one coherent group, but they are all around flushing and reservations. may_commit_transaction() needed to be updated a little bit, and we needed to add a new state to force chunk allocation if things got dicey. Also because we can end up needed to reserve a whole bunch of extra space for outstanding delayed refs we needed to add the ability to only ENOSPC tickets that were too big to satisfy, instead of failing all of the tickets. There's also a fix in here for one of the corner cases where we didn't quite have enough space reserved for the delayed refs we were generating during evict(). Thanks, Josef
[PATCH 6/8] btrfs: loop in inode_rsv_refill
With severe fragmentation we can end up with our inode rsv size being huge during writeout, which would cause us to need to make very large metadata reservations. However we may not actually need that much once writeout is complete. So instead try to make our reservation, and if we couldn't make it re-calculate our new reservation size and try again. If our reservation size doesn't change between tries then we know we are actually out of space and can error out. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 56 -- 1 file changed, 41 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 983d086fa768..0e9ba77e5316 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5776,6 +5776,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root, return ret; } +static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv, + u64 *metadata_bytes, u64 *qgroup_bytes) +{ + *metadata_bytes = 0; + *qgroup_bytes = 0; + + spin_lock(_rsv->lock); + if (block_rsv->reserved < block_rsv->size) + *metadata_bytes = block_rsv->size - block_rsv->reserved; + if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size) + *qgroup_bytes = block_rsv->qgroup_rsv_size - + block_rsv->qgroup_rsv_reserved; + spin_unlock(_rsv->lock); +} + /** * btrfs_inode_rsv_refill - refill the inode block rsv. * @inode - the inode we are refilling. @@ -5791,25 +5806,37 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, { struct btrfs_root *root = inode->root; struct btrfs_block_rsv *block_rsv = >block_rsv; - u64 num_bytes = 0; + u64 num_bytes = 0, last = 0; u64 qgroup_num_bytes = 0; int ret = -ENOSPC; - spin_lock(_rsv->lock); - if (block_rsv->reserved < block_rsv->size) - num_bytes = block_rsv->size - block_rsv->reserved; - if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size) - qgroup_num_bytes = block_rsv->qgroup_rsv_size - - block_rsv->qgroup_rsv_reserved; - spin_unlock(_rsv->lock); - + __get_refill_bytes(block_rsv, _bytes, _num_bytes); if (num_bytes == 0) return 0; - ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true); - if (ret) - return ret; - ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); + do { + ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true); + if (ret) + return ret; + ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); + if (ret) { + btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes); + last = num_bytes; + /* +* If we are fragmented we can end up with a lot of +* outstanding extents which will make our size be much +* larger than our reserved amount. If we happen to +* try to do a reservation here that may result in us +* trying to do a pretty hefty reservation, which we may +* not need once delalloc flushing happens. If this is +* the case try and do the reserve again. +*/ + if (flush == BTRFS_RESERVE_FLUSH_ALL) + __get_refill_bytes(block_rsv, _bytes, + _num_bytes); + } + } while (ret && last != num_bytes); + if (!ret) { block_rsv_add_bytes(block_rsv, num_bytes, false); trace_btrfs_space_reservation(root->fs_info, "delalloc", @@ -5819,8 +5846,7 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, spin_lock(_rsv->lock); block_rsv->qgroup_rsv_reserved += qgroup_num_bytes; spin_unlock(_rsv->lock); - } else - btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes); + } return ret; } -- 2.14.3
[PATCH 2/8] btrfs: dump block_rsv whe dumping space info
For enospc_debug having the block rsvs is super helpful to see if we've done something wrong. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval Reviewed-by: David Sterba --- fs/btrfs/extent-tree.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0dca250dc02e..7a30fbc05e5e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8052,6 +8052,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, return ret; } +#define DUMP_BLOCK_RSV(fs_info, rsv_name) \ +do { \ + struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name; \ + spin_lock(&__rsv->lock);\ + btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu", \ + __rsv->size, __rsv->reserved); \ + spin_unlock(&__rsv->lock); \ +} while (0) + static void dump_space_info(struct btrfs_fs_info *fs_info, struct btrfs_space_info *info, u64 bytes, int dump_block_groups) @@ -8071,6 +8080,12 @@ static void dump_space_info(struct btrfs_fs_info *fs_info, info->bytes_readonly); spin_unlock(>lock); + DUMP_BLOCK_RSV(fs_info, global_block_rsv); + DUMP_BLOCK_RSV(fs_info, trans_block_rsv); + DUMP_BLOCK_RSV(fs_info, chunk_block_rsv); + DUMP_BLOCK_RSV(fs_info, delayed_block_rsv); + DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv); + if (!dump_block_groups) return; -- 2.14.3
[PATCH 8/8] btrfs: reserve extra space during evict()
We could generate a lot of delayed refs in evict but never have any left over space from our block rsv to make up for that fact. So reserve some extra space and give it to the transaction so it can be used to refill the delayed refs rsv every loop through the truncate path. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index cae30f6c095f..3da9ac463344 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5258,13 +5258,15 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, { struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_block_rsv *global_rsv = _info->global_block_rsv; + u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1); int failures = 0; for (;;) { struct btrfs_trans_handle *trans; int ret; - ret = btrfs_block_rsv_refill(root, rsv, rsv->size, + ret = btrfs_block_rsv_refill(root, rsv, +rsv->size + delayed_refs_extra, BTRFS_RESERVE_FLUSH_LIMIT); if (ret && ++failures > 2) { @@ -5273,9 +5275,28 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, return ERR_PTR(-ENOSPC); } + /* +* Evict can generate a large amount of delayed refs without +* having a way to add space back since we exhaust our temporary +* block rsv. We aren't allowed to do FLUSH_ALL in this case +* because we could deadlock with so many things in the flushing +* code, so we have to try and hold some extra space to +* compensate for our delayed ref generation. If we can't get +* that space then we need see if we can steal our minimum from +* the global reserve. We will be ratelimited by the amount of +* space we have for the delayed refs rsv, so we'll end up +* committing and trying again. +*/ trans = btrfs_join_transaction(root); - if (IS_ERR(trans) || !ret) + if (IS_ERR(trans) || !ret) { + if (!IS_ERR(trans)) { + trans->block_rsv = _info->trans_block_rsv; + trans->bytes_reserved = delayed_refs_extra; + btrfs_block_rsv_migrate(rsv, trans->block_rsv, + delayed_refs_extra, 1); + } return trans; + } /* * Try to steal from the global reserve if there is space for -- 2.14.3
[PATCH 1/8] btrfs: check if free bgs for commit
may_commit_transaction will skip committing the transaction if we don't have enough pinned space or if we're trying to find space for a SYSTEM chunk. However if we have pending free block groups in this transaction we still want to commit as we may be able to allocate a chunk to make our reservation. So instead of just returning ENOSPC, check if we have free block groups pending, and if so commit the transaction to allow us to use that free space. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 33 +++-- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8af68b13fa27..0dca250dc02e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4843,10 +4843,18 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (!bytes) return 0; - /* See if there is enough pinned space to make this reservation */ - if (__percpu_counter_compare(_info->total_bytes_pinned, - bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) + trans = btrfs_join_transaction(fs_info->extent_root); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + /* +* See if there is enough pinned space to make this reservation, or if +* we have bg's that are going to be freed, allowing us to possibly do a +* chunk allocation the next loop through. +*/ + if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags) || + __percpu_counter_compare(_info->total_bytes_pinned, bytes, +BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) goto commit; /* @@ -4854,7 +4862,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, * this reservation. */ if (space_info != delayed_rsv->space_info) - return -ENOSPC; + goto enospc; spin_lock(_rsv->lock); reclaim_bytes += delayed_rsv->reserved; @@ -4868,17 +4876,14 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, bytes -= reclaim_bytes; if (__percpu_counter_compare(_info->total_bytes_pinned, - bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) { - return -ENOSPC; - } - +bytes, +BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) + goto enospc; commit: - trans = btrfs_join_transaction(fs_info->extent_root); - if (IS_ERR(trans)) - return -ENOSPC; - return btrfs_commit_transaction(trans); +enospc: + btrfs_end_transaction(trans); + return -ENOSPC; } /* -- 2.14.3
[PATCH 6/6] btrfs: fix truncate throttling
We have a bunch of magic to make sure we're throttling delayed refs when truncating a file. Now that we have a delayed refs rsv and a mechanism for refilling that reserve simply use that instead of all of this magic. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 79 1 file changed, 17 insertions(+), 62 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8532a2eb56d1..cae30f6c095f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4437,31 +4437,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) return err; } -static int truncate_space_check(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 bytes_deleted) -{ - struct btrfs_fs_info *fs_info = root->fs_info; - int ret; - - /* -* This is only used to apply pressure to the enospc system, we don't -* intend to use this reservation at all. -*/ - bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted); - bytes_deleted *= fs_info->nodesize; - ret = btrfs_block_rsv_add(root, _info->trans_block_rsv, - bytes_deleted, BTRFS_RESERVE_NO_FLUSH); - if (!ret) { - trace_btrfs_space_reservation(fs_info, "transaction", - trans->transid, - bytes_deleted, 1); - trans->bytes_reserved += bytes_deleted; - } - return ret; - -} - /* * Return this if we need to call truncate_block for the last bit of the * truncate. @@ -4506,7 +4481,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, u64 bytes_deleted = 0; bool be_nice = false; bool should_throttle = false; - bool should_end = false; BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY); @@ -4719,15 +4693,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, btrfs_abort_transaction(trans, ret); break; } - if (btrfs_should_throttle_delayed_refs(trans)) - btrfs_async_run_delayed_refs(fs_info, - trans->delayed_ref_updates * 2, - trans->transid, 0); if (be_nice) { - if (truncate_space_check(trans, root, -extent_num_bytes)) { - should_end = true; - } if (btrfs_should_throttle_delayed_refs(trans)) should_throttle = true; } @@ -4738,7 +4704,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, if (path->slots[0] == 0 || path->slots[0] != pending_del_slot || - should_throttle || should_end) { + should_throttle) { if (pending_del_nr) { ret = btrfs_del_items(trans, root, path, pending_del_slot, @@ -4750,23 +4716,24 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, pending_del_nr = 0; } btrfs_release_path(path); - if (should_throttle) { - unsigned long updates = trans->delayed_ref_updates; - if (updates) { - trans->delayed_ref_updates = 0; - ret = btrfs_run_delayed_refs(trans, - updates * 2); - if (ret) - break; - } - } + /* -* if we failed to refill our space rsv, bail out -* and let the transaction restart +* We can generate a lot of delayed refs, so we need to +* throttle every once and a while and make sure we're +* adding enough space to keep up with the work we are +* generating. Since we hold a transaction here we +* can't flush, and we don't want to FLUSH_LIMIT because +* we could have generated too many delayed refs to +* actually allocate, so just bail if we're short and +* let the normal reservation dance happen higher up.
[PATCH 5/6] btrfs: introduce delayed_refs_rsv
From: Josef Bacik Traditionally we've had voodoo in btrfs to account for the space that delayed refs may take up by having a global_block_rsv. This works most of the time, except when it doesn't. We've had issues reported and seen in production where sometimes the global reserve is exhausted during transaction commit before we can run all of our delayed refs, resulting in an aborted transaction. Because of this voodoo we have equally dubious flushing semantics around throttling delayed refs which we often get wrong. So instead give them their own block_rsv. This way we can always know exactly how much outstanding space we need for delayed refs. This allows us to make sure we are constantly filling that reservation up with space, and allows us to put more precise pressure on the enospc system. Instead of doing math to see if its a good time to throttle, the normal enospc code will be invoked if we have a lot of delayed refs pending, and they will be run via the normal flushing mechanism. For now the delayed_refs_rsv will hold the reservations for the delayed refs, the block group updates, and deleting csums. We could have a separate rsv for the block group updates, but the csum deletion stuff is still handled via the delayed_refs so that will stay there. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 26 ++-- fs/btrfs/delayed-ref.c | 28 - fs/btrfs/disk-io.c | 4 + fs/btrfs/extent-tree.c | 279 +++ fs/btrfs/inode.c | 4 +- fs/btrfs/transaction.c | 77 ++-- include/trace/events/btrfs.h | 2 + 7 files changed, 313 insertions(+), 107 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8b41ec42f405..0c6d589c8ce4 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -448,8 +448,9 @@ struct btrfs_space_info { #defineBTRFS_BLOCK_RSV_TRANS 3 #defineBTRFS_BLOCK_RSV_CHUNK 4 #defineBTRFS_BLOCK_RSV_DELOPS 5 -#defineBTRFS_BLOCK_RSV_EMPTY 6 -#defineBTRFS_BLOCK_RSV_TEMP7 +#define BTRFS_BLOCK_RSV_DELREFS6 +#defineBTRFS_BLOCK_RSV_EMPTY 7 +#defineBTRFS_BLOCK_RSV_TEMP8 struct btrfs_block_rsv { u64 size; @@ -812,6 +813,8 @@ struct btrfs_fs_info { struct btrfs_block_rsv chunk_block_rsv; /* block reservation for delayed operations */ struct btrfs_block_rsv delayed_block_rsv; + /* block reservation for delayed refs */ + struct btrfs_block_rsv delayed_refs_rsv; struct btrfs_block_rsv empty_block_rsv; @@ -2628,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct btrfs_fs_info *fs_info, } int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans); -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans); +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info); void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, const u64 start); void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg); @@ -2742,10 +2745,12 @@ enum btrfs_reserve_flush_enum { enum btrfs_flush_state { FLUSH_DELAYED_ITEMS_NR = 1, FLUSH_DELAYED_ITEMS = 2, - FLUSH_DELALLOC = 3, - FLUSH_DELALLOC_WAIT = 4, - ALLOC_CHUNK = 5, - COMMIT_TRANS= 6, + FLUSH_DELAYED_REFS_NR = 3, + FLUSH_DELAYED_REFS = 4, + FLUSH_DELALLOC = 5, + FLUSH_DELALLOC_WAIT = 6, + ALLOC_CHUNK = 7, + COMMIT_TRANS= 8, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); @@ -2796,6 +2801,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info, void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *block_rsv, u64 num_bytes); +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr); +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans); +int btrfs_throttle_delayed_refs(struct btrfs_fs_info *fs_info, + enum btrfs_reserve_flush_enum flush); +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, + struct btrfs_block_rsv *src, + u64 num_bytes); int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_put_block_group_cache(struct btrfs_fs_info *info); diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 48725fa757a3..96de92588f06 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -474,11
[PATCH 0/6] Delayed refs rsv
This patchset changes how we do space reservations for delayed refs. We were hitting probably 20-40 enospc abort's per day in production while running delayed refs at transaction commit time. This means we ran out of space in the global reserve and couldn't easily get more space in use_block_rsv(). The global reserve has grown to cover everything we don't reserve space explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if we're running short on space and when it's time to force a commit. A failure rate of 20-40 file systems when we run hundreds of thousands of them isn't super high, but cleaning up this code will make things less ugly and more predictible. Thus the delayed refs rsv. We always know how many delayed refs we have outstanding, and although running them generates more we can use the global reserve for that spill over, which fits better into it's desired use than a full blown reservation. This first approach is to simply take how many times we're reserving space for and multiply that by 2 in order to save enough space for the delayed refs that could be generated. This is a niave approach and will probably evolve, but for now it works. With this patchset we've gone down to 2-8 failures per week. It's not perfect, there are some corner cases that still need to be addressed, but is significantly better than what we had. Thanks, Josef
[PATCH 4/6] btrfs: only track ref_heads in delayed_ref_updates
From: Josef Bacik We use this number to figure out how many delayed refs to run, but __btrfs_run_delayed_refs really only checks every time we need a new delayed ref head, so we always run at least one ref head completely no matter what the number of items on it. Fix the accounting to only be adjusted when we add/remove a ref head. Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index b3e4c9fcb664..48725fa757a3 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans, ref->in_tree = 0; btrfs_put_delayed_ref(ref); atomic_dec(_refs->num_entries); - if (trans->delayed_ref_updates) - trans->delayed_ref_updates--; } static bool merge_ref(struct btrfs_trans_handle *trans, @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans, if (ref->action == BTRFS_ADD_DELAYED_REF) list_add_tail(>add_list, >ref_add_list); atomic_inc(>num_entries); - trans->delayed_ref_updates++; spin_unlock(>lock); return ret; } -- 2.14.3
[PATCH 2/6] btrfs: add cleanup_ref_head_accounting helper
From: Josef Bacik We were missing some quota cleanups in check_ref_cleanup, so break the ref head accounting cleanup into a helper and call that from both check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that we don't screw up accounting in the future for other things that we add. Reviewed-by: Omar Sandoval Reviewed-by: Liu Bo Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 67 +- 1 file changed, 39 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c36b3a42f2bb..e3ed3507018d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head) +{ + struct btrfs_fs_info *fs_info = trans->fs_info; + struct btrfs_delayed_ref_root *delayed_refs = + >transaction->delayed_refs; + + if (head->total_ref_mod < 0) { + struct btrfs_space_info *space_info; + u64 flags; + + if (head->is_data) + flags = BTRFS_BLOCK_GROUP_DATA; + else if (head->is_system) + flags = BTRFS_BLOCK_GROUP_SYSTEM; + else + flags = BTRFS_BLOCK_GROUP_METADATA; + space_info = __find_space_info(fs_info, flags); + ASSERT(space_info); + percpu_counter_add_batch(_info->total_bytes_pinned, + -head->num_bytes, + BTRFS_TOTAL_BYTES_PINNED_BATCH); + + if (head->is_data) { + spin_lock(_refs->lock); + delayed_refs->pending_csums -= head->num_bytes; + spin_unlock(_refs->lock); + } + } + + /* Also free its reserved qgroup space */ + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, + head->qgroup_reserved); +} + static int cleanup_ref_head(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *head) { @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(>lock); spin_unlock(_refs->lock); - trace_run_delayed_ref_head(fs_info, head, 0); - - if (head->total_ref_mod < 0) { - struct btrfs_space_info *space_info; - u64 flags; - - if (head->is_data) - flags = BTRFS_BLOCK_GROUP_DATA; - else if (head->is_system) - flags = BTRFS_BLOCK_GROUP_SYSTEM; - else - flags = BTRFS_BLOCK_GROUP_METADATA; - space_info = __find_space_info(fs_info, flags); - ASSERT(space_info); - percpu_counter_add_batch(_info->total_bytes_pinned, - -head->num_bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH); - - if (head->is_data) { - spin_lock(_refs->lock); - delayed_refs->pending_csums -= head->num_bytes; - spin_unlock(_refs->lock); - } - } - if (head->must_insert_reserved) { btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, } } - /* Also free its reserved qgroup space */ - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, - head->qgroup_reserved); + cleanup_ref_head_accounting(trans, head); + + trace_run_delayed_ref_head(fs_info, head, 0); btrfs_delayed_ref_unlock(head); btrfs_put_delayed_ref_head(head); return 0; @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (head->must_insert_reserved) ret = 1; + cleanup_ref_head_accounting(trans, head); mutex_unlock(>mutex); btrfs_put_delayed_ref_head(head); return ret; -- 2.14.3
[PATCH 3/6] btrfs: cleanup extent_op handling
From: Josef Bacik The cleanup_extent_op function actually would run the extent_op if it needed running, which made the name sort of a misnomer. Change it to run_and_cleanup_extent_op, and move the actual cleanup work to cleanup_extent_op so it can be used by check_ref_cleanup() in order to unify the extent op handling. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 36 +++- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e3ed3507018d..8a776dc9cb38 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2424,19 +2424,33 @@ static void unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_ref btrfs_delayed_ref_unlock(head); } -static int cleanup_extent_op(struct btrfs_trans_handle *trans, -struct btrfs_delayed_ref_head *head) +static struct btrfs_delayed_extent_op * +cleanup_extent_op(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head) { struct btrfs_delayed_extent_op *extent_op = head->extent_op; - int ret; if (!extent_op) - return 0; - head->extent_op = NULL; + return NULL; + if (head->must_insert_reserved) { + head->extent_op = NULL; btrfs_free_delayed_extent_op(extent_op); - return 0; + return NULL; } + return extent_op; +} + +static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans, +struct btrfs_delayed_ref_head *head) +{ + struct btrfs_delayed_extent_op *extent_op = + cleanup_extent_op(trans, head); + int ret; + + if (!extent_op) + return 0; + head->extent_op = NULL; spin_unlock(>lock); ret = run_delayed_extent_op(trans, head, extent_op); btrfs_free_delayed_extent_op(extent_op); @@ -2488,7 +2502,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, delayed_refs = >transaction->delayed_refs; - ret = cleanup_extent_op(trans, head); + ret = run_and_cleanup_extent_op(trans, head); if (ret < 0) { unselect_delayed_ref_head(delayed_refs, head); btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret); @@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!RB_EMPTY_ROOT(>ref_tree.rb_root)) goto out; - if (head->extent_op) { - if (!head->must_insert_reserved) - goto out; - btrfs_free_delayed_extent_op(head->extent_op); - head->extent_op = NULL; - } + if (cleanup_extent_op(trans, head) != NULL) + goto out; /* * waiting for the lock here would deadlock. If someone else has it -- 2.14.3
[PATCH 1/6] btrfs: add btrfs_delete_ref_head helper
From: Josef Bacik We do this dance in cleanup_ref_head and check_ref_cleanup, unify it into a helper and cleanup the calling functions. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/delayed-ref.c | 14 ++ fs/btrfs/delayed-ref.h | 3 ++- fs/btrfs/extent-tree.c | 22 +++--- 3 files changed, 19 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 9301b3ad9217..b3e4c9fcb664 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head( return head; } +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) +{ + lockdep_assert_held(_refs->lock); + lockdep_assert_held(>lock); + + rb_erase_cached(>href_node, _refs->href_root); + RB_CLEAR_NODE(>href_node); + atomic_dec(_refs->num_entries); + delayed_refs->num_heads--; + if (head->processing == 0) + delayed_refs->num_heads_ready--; +} + /* * Helper to insert the ref_node to the tail or merge with tail. * diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 8e20c5cb5404..d2af974f68a1 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head) { mutex_unlock(>mutex); } - +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head); struct btrfs_delayed_ref_head *btrfs_select_ref_head( struct btrfs_delayed_ref_root *delayed_refs); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d242a1174e50..c36b3a42f2bb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(_refs->lock); return 1; } - delayed_refs->num_heads--; - rb_erase_cached(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); + btrfs_delete_ref_head(delayed_refs, head); spin_unlock(>lock); spin_unlock(_refs->lock); - atomic_dec(_refs->num_entries); trace_run_delayed_ref_head(fs_info, head, 0); @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!mutex_trylock(>mutex)) goto out; - /* -* at this point we have a head with no other entries. Go -* ahead and process it. -*/ - rb_erase_cached(>href_node, _refs->href_root); - RB_CLEAR_NODE(>href_node); - atomic_dec(_refs->num_entries); - - /* -* we don't take a ref on the node because we're removing it from the -* tree, so we just steal the ref the tree was holding. -*/ - delayed_refs->num_heads--; - if (head->processing == 0) - delayed_refs->num_heads_ready--; + btrfs_delete_ref_head(delayed_refs, head); head->processing = 0; + spin_unlock(>lock); spin_unlock(_refs->lock); -- 2.14.3
Re: [PATCH] Btrfs: fix access to available allocation bits when starting balance
On Mon, Nov 19, 2018 at 09:48:12AM +, fdman...@kernel.org wrote: > From: Filipe Manana > > The available allocation bits members from struct btrfs_fs_info are > protected by a sequence lock, and when starting balance we access them > incorrectly in two different ways: > > 1) In the read sequence lock loop at btrfs_balance() we use the values we >read from fs_info->avail_*_alloc_bits and we can immediately do actions >that have side effects and can not be undone (printing a message and >jumping to a label). This is wrong because a retry might be needed, so >our actions must not have side effects and must be repeatable as long >as read_seqretry() returns a non-zero value. In other words, we were >essentially ignoring the sequence lock; > > 2) Right below the read sequence lock loop, we were reading the values >from avail_metadata_alloc_bits and avail_data_alloc_bits without any >protection from concurrent writers, that is, reading them outside of >the read sequence lock critical section. > > So fix this by making sure we only read the available allocation bits > while in a read sequence lock critical section and that what we do in the > critical section is repeatable (has nothing that can not be undone) so > that any eventual retry that is needed is handled properly. > > Fixes: de98ced9e743 ("Btrfs: use seqlock to protect fs_info->avail_{data, > metadata, system}_alloc_bits") > Fixes: 14506127979a ("btrfs: fix a bogus warning when converting only data or > metadata") > Signed-off-by: Filipe Manana Added to misc-next, thanks.
[PATCH RESEND 1/2] btrfs-progs: fix kernel version parsing on some versions past 3.0
The code fails if the third section is missing (like "4.18") or is followed by anything but "." or "-". This happens for example if we're not exactly at a tag and CONFIG_LOCALVERSION_AUTO=n (which results in "4.18.5+"). Signed-off-by: Adam Borowski --- fsfeatures.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/fsfeatures.c b/fsfeatures.c index 7d85d60f..66111bf4 100644 --- a/fsfeatures.c +++ b/fsfeatures.c @@ -216,11 +216,8 @@ u32 get_running_kernel_version(void) return (u32)-1; version |= atoi(tmp) << 8; tmp = strtok_r(NULL, ".", ); - if (tmp) { - if (!string_is_numerical(tmp)) - return (u32)-1; + if (tmp && string_is_numerical(tmp)) version |= atoi(tmp); - } return version; } -- 2.19.1
[PATCH RESEND-v3 2/2] btrfs-progs: defrag: open files RO on new enough kernels
Defragging an executable conflicts both way with it being run, resulting in ETXTBSY. This either makes defrag fail or prevents the program from being executed. Kernels 4.19-rc1 and later allow defragging files you could have possibly opened rw, even if the passed descriptor is ro (commit 616d374efa23). Signed-off-by: Adam Borowski --- v2: more eloquent description; root can't defrag RO on old kernels (unlike dedupe) v3: more eloquentier description; s/defrag_ro/defrag_open_mode/ cmds-filesystem.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index d1af21ee..c67bf5da 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include @@ -39,12 +40,14 @@ #include "list_sort.h" #include "disk-io.h" #include "help.h" +#include "fsfeatures.h" /* * for btrfs fi show, we maintain a hash of fsids we've already printed. * This way we don't print dups if a given FS is mounted more than once. */ static struct seen_fsid *seen_fsid_hash[SEEN_FSID_HASH_SIZE] = {NULL,}; +static mode_t defrag_open_mode = O_RDONLY; static const char * const filesystem_cmd_group_usage[] = { "btrfs filesystem [] []", @@ -878,7 +881,7 @@ static int defrag_callback(const char *fpath, const struct stat *sb, if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) { if (defrag_global_verbose) printf("%s\n", fpath); - fd = open(fpath, O_RDWR); + fd = open(fpath, defrag_open_mode); if (fd < 0) { goto error; } @@ -915,6 +918,9 @@ static int cmd_filesystem_defrag(int argc, char **argv) int compress_type = BTRFS_COMPRESS_NONE; DIR *dirstream; + if (get_running_kernel_version() < KERNEL_VERSION(4,19,0)) + defrag_open_mode = O_RDWR; + /* * Kernel has a different default (256K) that is supposed to be safe, * but it does not defragment very well. The 32M will likely lead to @@ -1015,7 +1021,7 @@ static int cmd_filesystem_defrag(int argc, char **argv) int defrag_err = 0; dirstream = NULL; - fd = open_file_or_dir(argv[i], ); + fd = open_file_or_dir3(argv[i], , defrag_open_mode); if (fd < 0) { error("cannot open %s: %m", argv[i]); ret = -errno; -- 2.19.1
Re: [PATCH] Btrfs: fix race between enabling quotas and subvolume creation
On Mon, Nov 19, 2018 at 04:20:34PM +, fdman...@kernel.org wrote: > From: Filipe Manana > > We have a race between enabling quotas end subvolume creation that cause > subvolume creation to fail with -EINVAL, and the following diagram shows > how it happens: > > CPU 0 CPU 1 > > btrfs_ioctl() > btrfs_ioctl_quota_ctl() >btrfs_quota_enable() > mutex_lock(fs_info->qgroup_ioctl_lock) > > btrfs_ioctl() >create_subvol() > btrfs_qgroup_inherit() > -> save > fs_info->quota_root > into quota_root > -> stores a NULL value > -> tries to lock the > mutex > qgroup_ioctl_lock > -> blocks waiting for >the task at CPU0 > >-> sets BTRFS_FS_QUOTA_ENABLED in fs_info >-> sets quota_root in fs_info->quota_root > (non-NULL value) > >mutex_unlock(fs_info->qgroup_ioctl_lock) > > -> checks quota enabled > flag is set > -> returns -EINVAL > because > fs_info->quota_root > was > NULL before it > acquired > the mutex > qgroup_ioctl_lock >-> ioctl returns -EINVAL > > Returning -EINVAL to user space will be confusing if all the arguments > passed to the subvolume creation ioctl were valid. > > Fix it by grabbing the value from fs_info->quota_root after acquiring > the mutex. > > Signed-off-by: Filipe Manana Reviewed-by: David Sterba Added to misc-next, thanks.
[PATCHv3] btrfs: Fix error handling in btrfs_cleanup_ordered_extents
Running btrfs/124 in a loop hung up on me sporadically with the following call trace: btrfs D0 5760 5324 0x Call Trace: ? __schedule+0x243/0x800 schedule+0x33/0x90 btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs] ? wait_woken+0xa0/0xa0 btrfs_wait_ordered_range+0xbb/0x100 [btrfs] btrfs_relocate_block_group+0x1ff/0x230 [btrfs] btrfs_relocate_chunk+0x49/0x100 [btrfs] btrfs_balance+0xbeb/0x1740 [btrfs] btrfs_ioctl_balance+0x2ee/0x380 [btrfs] btrfs_ioctl+0x1691/0x3110 [btrfs] ? lockdep_hardirqs_on+0xed/0x180 ? __handle_mm_fault+0x8e7/0xfb0 ? _raw_spin_unlock+0x24/0x30 ? __handle_mm_fault+0x8e7/0xfb0 ? do_vfs_ioctl+0xa5/0x6e0 ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] do_vfs_ioctl+0xa5/0x6e0 ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe ksys_ioctl+0x3a/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x60/0x1b0 entry_SYSCALL_64_after_hwframe+0x49/0xbe This happens because during page writeback it's valid for writepage_delalloc to instantiate a delalloc range which doesn't belong to the page currently being written back. The reason this case is valid is due to find_lock_delalloc_range returning any available range after the passed delalloc_start and ignorting whether the page under writeback is within that range. In turn ordered extents (OE) are always created for the returned range from find_lock_delalloc_range. If, however, a failure occurs while OE are being created then the clean up code in btrfs_cleanup_ordered_extents will be called. Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider the case of such 'foreign' range being processed and instead it always assumes that the range OE are created for belongs to the page. This leads to the first page of such foregin range to not be cleaned up since it's deliberately missed skipped by the current cleaning up code. Fix this by correctly checking whether the current page belongs to the range being instantiated and if so adjsut the range parameters passed for cleaning up. If it doesn't, then just clean the whole OE range directly. Signed-off-by: Nikolay Borisov Reviewed-by: Josef Bacik --- V3: * Re-worded the commit for easier comprehension * Added RB tag from Josef V2: * Fix compilation failure due to missing parentheses * Fixed the "Fixes" tag. fs/btrfs/inode.c | 29 - 1 file changed, 20 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index a0fc564626df..9f65f131f244 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -110,17 +110,17 @@ static void __endio_write_update_ordered(struct inode *inode, * extent_clear_unlock_delalloc() to clear both the bits EXTENT_DO_ACCOUNTING * and EXTENT_DELALLOC simultaneously, because that causes the reserved metadata * to be released, which we want to happen only when finishing the ordered - * extent (btrfs_finish_ordered_io()). Also note that the caller of - * btrfs_run_delalloc_range already does proper cleanup for the first page of - * the range, that is, it invokes the callback writepage_end_io_hook() for the - * range of the first page. + * extent (btrfs_finish_ordered_io()). */ static inline void btrfs_cleanup_ordered_extents(struct inode *inode, -const u64 offset, -const u64 bytes) +struct page *locked_page, +u64 offset, u64 bytes) { unsigned long index = offset >> PAGE_SHIFT; unsigned long end_index = (offset + bytes - 1) >> PAGE_SHIFT; + u64 page_start = page_offset(locked_page); + u64 page_end = page_start + PAGE_SIZE - 1; + struct page *page; while (index <= end_index) { @@ -131,8 +131,18 @@ static inline void btrfs_cleanup_ordered_extents(struct inode *inode, ClearPagePrivate2(page); put_page(page); } - return __endio_write_update_ordered(inode, offset + PAGE_SIZE, - bytes - PAGE_SIZE, false); + + /* +* In case this page belongs to the delalloc range being instantiated +* then skip it, since the first page of a range is going to be +* properly cleaned up by the caller of run_delalloc_range +*/ + if (page_start >= offset && page_end <= (offset + bytes - 1)) { + offset += PAGE_SIZE; + bytes -= PAGE_SIZE; + } + + return __endio_write_update_ordered(inode, offset, bytes, false); } static int btrfs_dirty_inode(struct inode *inode); @@ -1605,7 +1615,8 @@ int btrfs_run_delalloc_range(void *private_data, struct page *locked_page,