Re: Ran into "invalid block group size" bug, unclear how to proceed.
Apologies for the dupe Chris, I neglected to hit Reply-All.. Comments below. On Mon, Dec 3, 2018 at 9:56 PM Chris Murphy wrote: > > On Mon, Dec 3, 2018 at 8:32 PM Mike Javorski wrote: > > > > Need a bit of advice here ladies / gents. I am running into an issue > > which Qu Wenruo seems to have posted a patch for several weeks ago > > (see https://patchwork.kernel.org/patch/10694997/). > > > > Here is the relevant dmesg output which led me to Qu's patch. > > > > [ 10.032475] BTRFS critical (device sdb): corrupt leaf: root=2 > > block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104, > > invalid block group size, have 10804527104 expect (0, 10737418240] > > [ 10.032493] BTRFS error (device sdb): failed to read block groups: -5 > > [ 10.053365] BTRFS error (device sdb): open_ctree failed > > > > > > This server has a 16 disk btrfs filesystem (RAID6) which I boot > > periodically to btrfs-send snapshots to. This machine is running > > ArchLinux and I had just updated to their latest 4.19.4 kernel > > package (from 4.18.10 which was working fine). I've tried updating to > > the 4.19.6 kernel that is in testing, but that doesn't seem to resolve > > the issue. From what I can see on kernel.org, the patch above is not > > pushed to stable or to Linus' tree. > > > > At this point the question is what to do. Is my FS toast? Could I > > revert to the 4.18.10 kernel and boot safely? I don't know if the 4.19 > > boot process may have flipped some bits which would make reverting > > problematic. > > That patch is not yet merged in linux-next so to use it, you'd need to > apply yourself and compile a kernel. I can't tell for sure if it'd > help. > > But, the less you change the file system, the better chance of saving > it. I have no idea why there'd be a corrupt leaf just due to a kernel > version change, though. > > Needless to say, raid56 just seems fragile once it runs into any kind > of trouble. I personally wouldn't boot off it at all. I would only > mount it from another system, ideally an installed system but a live > system with the kernel versions you need would also work. That way you > can get more information without changes, and booting will almost > immediately mount rw, if mount succeeds at all, and will write a bunch > of changes to the file system. > If the boot could corrupt the disk, that ship has already sailed as I have previously attempted to mount the volume with the 4.19.4 and 4.19.6 kernels, but they both failed reporting the log lines in my original message. I am hoping that Qu notices this thread at some point as they are the author of the original patch which introduced the check which is now failing, as well as the un-merged patch linked earlier that adjusts the check condition. What I don't know is if the checks up until that mount failure have been read-only and thus I can revert to the older kernel, or if something would have been written to disk prior to the mount call failing. I don't want to risk the 23 TiB of snapshot data stored if it's an easy workaround :-). I realize there are risks with the RAID56 code, but I've done my best to mitigate with server-grade hardware, ECC memory, a UPS and a redundant copy of the data via btrfs-send to this machine. Losing this snapshot volume is not the end of the world, but I am about to upgrade the primary server (which is currently running 4.19.4 without issue btw) and want to have a best effort snapshot/backup in place before I do so. > Whether it's a case of 4.18.10 not detecting corruption that 4.19 > sees, or if 4.19 already caused it, the best chance is to not mount it > rw, and not run check --repair, until you get some feedback from a > developer. > +1 > The thing I'd like to see is > # btrfs rescue super -v /anydevice/ > # btrfs insp dump-s -f /anydevice/ > > First command will tell us if all the supers are the same and valid > across all devices. And the second one, hopefully it's pointed to a > device with valid super, will tell us if there's a log root value > other than 0. Both of those are read only commands. > It was my understanding that rescue writes to the disk (the man page indicates it recovers superblocks which would imply writes). If you are sure they both leave the on-disk data structures completely intact, I would be willing to run them. - mike
Re: [PATCH 2/5] btrfs: Refactor btrfs_can_relocate
On 3.12.18 г. 19:25 ч., David Sterba wrote: > On Sat, Nov 17, 2018 at 09:29:27AM +0800, Anand Jain wrote: >>> - ret = find_free_dev_extent(trans, device, min_free, >>> - &dev_offset, NULL); >>> - if (!ret) >>> + if (!find_free_dev_extent(trans, device, min_free, >>> + &dev_offset, NULL)) >> >> This can return -ENOMEM; >> >>> @@ -2856,8 +2856,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info >>> *fs_info, u64 chunk_offset) >>> */ >>> lockdep_assert_held(&fs_info->delete_unused_bgs_mutex); >>> >>> - ret = btrfs_can_relocate(fs_info, chunk_offset); >>> - if (ret) >>> + if (!btrfs_can_relocate(fs_info, chunk_offset)) >>> return -ENOSPC; >> >> And ends up converting -ENOMEM to -ENOSPC. >> >> Its better to propagate the accurate error. > > Right, converting to bool is obscuring the reason why the functions > fail. Making the code simpler at this cost does not look like a good > idea to me. I'll remove the patch from misc-next for now. The patch itself does not make the code more obscure than currently is, because even if ENOMEM is returned it's still converted to ENOSPC in btrfs_relocate_chunk. >
Re: experiences running btrfs on external USB disks?
On 2018-12-04 14:59, Chris Murphy wrote: Running 4.19.6 right now, but was experiencing the issue also with 4.18 kernels. # btrfs device stats /data [/dev/sda1].write_io_errs0 [/dev/sda1].read_io_errs 0 [/dev/sda1].flush_io_errs0 [/dev/sda1].corruption_errs 0 [/dev/sda1].generation_errs 0 Hard to say without a complete dmesg; but errno=-5 IO failure is pretty much some kind of hardware problem in my experience. I haven't seen it be a bug. It is a complete dmesg - in sense: # grep -i btrfs -A5 -B5 /var/log/syslog Dec 4 05:06:56 step snapd[747]: udevmon.go:184: udev monitor observed remove event for unknown device "/sys/skbuff_head_cache(1481:anacron.service)" Dec 4 05:06:56 step snapd[747]: udevmon.go:184: udev monitor observed remove event for unknown device "/sys/buffer_head(1481:anacron.service)" Dec 4 05:06:56 step snapd[747]: udevmon.go:184: udev monitor observed remove event for unknown device "/sys/ext4_inode_cache(1481:anacron.service)" Dec 4 05:15:01 step CRON[9352]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Dec 4 05:17:01 step CRON[9358]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 4 05:23:13 step kernel: [77760.444607] BTRFS error (device sdb1): bad tree block start, want 378372096 have 0 Dec 4 05:23:13 step kernel: [77760.550933] BTRFS error (device sdb1): bad tree block start, want 378372096 have 0 Dec 4 05:23:13 step kernel: [77760.550972] BTRFS: error (device sdb1) in __btrfs_free_extent:6804: errno=-5 IO failure Dec 4 05:23:13 step kernel: [77760.550979] BTRFS info (device sdb1): forced readonly Dec 4 05:23:13 step kernel: [77760.551003] BTRFS: error (device sdb1) in btrfs_run_delayed_refs:2935: errno=-5 IO failure Dec 4 05:23:13 step kernel: [77760.553223] BTRFS error (device sdb1): pending csums is 4096 Dec 4 05:23:14 step postfix/pickup[8993]: 13BBE460F86: uid=0 from= Dec 4 05:23:14 step postfix/cleanup[9398]: 13BBE460F86: message-id=<20181204052314.13BBE460F86@step> Dec 4 05:23:14 step postfix/qmgr[2745]: 13BBE460F86: from=, size=404, nrcpt=1 (queue active) Dec 4 05:23:14 step postfix/pickup[8993]: 40A964603EC: uid=0 from= [...some emails follow, usual CRON messages etc., but noting at all generated by the kernel, no hardware issue reported...] Tomasz Chmielewski
Re: experiences running btrfs on external USB disks?
On Mon, Dec 3, 2018 at 10:44 PM Tomasz Chmielewski wrote: > > I'm trying to use btrfs on an external USB drive, without much success. > > When the drive is connected for 2-3+ days, the filesystem gets remounted > readonly, with BTRFS saying "IO failure": > > [77760.444607] BTRFS error (device sdb1): bad tree block start, want > 378372096 have 0 > [77760.550933] BTRFS error (device sdb1): bad tree block start, want > 378372096 have 0 > [77760.550972] BTRFS: error (device sdb1) in __btrfs_free_extent:6804: > errno=-5 IO failure > [77760.550979] BTRFS info (device sdb1): forced readonly > [77760.551003] BTRFS: error (device sdb1) in > btrfs_run_delayed_refs:2935: errno=-5 IO failure > [77760.553223] BTRFS error (device sdb1): pending csums is 4096 > > > Note that there are no other kernel messages (i.e. that would indicate a > problem with disk, cable disconnection etc.). > > The load on the drive itself can be quite heavy at times (i.e. 100% IO > for 1-2 h and more) - can it contribute to the problem (i.e. btrfs > thinks there is some timeout somewhere)? > > Running 4.19.6 right now, but was experiencing the issue also with 4.18 > kernels. > > > > # btrfs device stats /data > [/dev/sda1].write_io_errs0 > [/dev/sda1].read_io_errs 0 > [/dev/sda1].flush_io_errs0 > [/dev/sda1].corruption_errs 0 > [/dev/sda1].generation_errs 0 Hard to say without a complete dmesg; but errno=-5 IO failure is pretty much some kind of hardware problem in my experience. I haven't seen it be a bug. -- Chris Murphy
Re: Ran into "invalid block group size" bug, unclear how to proceed.
On Mon, Dec 3, 2018 at 8:32 PM Mike Javorski wrote: > > Need a bit of advice here ladies / gents. I am running into an issue > which Qu Wenruo seems to have posted a patch for several weeks ago > (see https://patchwork.kernel.org/patch/10694997/). > > Here is the relevant dmesg output which led me to Qu's patch. > > [ 10.032475] BTRFS critical (device sdb): corrupt leaf: root=2 > block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104, > invalid block group size, have 10804527104 expect (0, 10737418240] > [ 10.032493] BTRFS error (device sdb): failed to read block groups: -5 > [ 10.053365] BTRFS error (device sdb): open_ctree failed > > > This server has a 16 disk btrfs filesystem (RAID6) which I boot > periodically to btrfs-send snapshots to. This machine is running > ArchLinux and I had just updated to their latest 4.19.4 kernel > package (from 4.18.10 which was working fine). I've tried updating to > the 4.19.6 kernel that is in testing, but that doesn't seem to resolve > the issue. From what I can see on kernel.org, the patch above is not > pushed to stable or to Linus' tree. > > At this point the question is what to do. Is my FS toast? Could I > revert to the 4.18.10 kernel and boot safely? I don't know if the 4.19 > boot process may have flipped some bits which would make reverting > problematic. That patch is not yet merged in linux-next so to use it, you'd need to apply yourself and compile a kernel. I can't tell for sure if it'd help. But, the less you change the file system, the better chance of saving it. I have no idea why there'd be a corrupt leaf just due to a kernel version change, though. Needless to say, raid56 just seems fragile once it runs into any kind of trouble. I personally wouldn't boot off it at all. I would only mount it from another system, ideally an installed system but a live system with the kernel versions you need would also work. That way you can get more information without changes, and booting will almost immediately mount rw, if mount succeeds at all, and will write a bunch of changes to the file system. Whether it's a case of 4.18.10 not detecting corruption that 4.19 sees, or if 4.19 already caused it, the best chance is to not mount it rw, and not run check --repair, until you get some feedback from a developer. The thing I'd like to see is # btrfs rescue super -v /anydevice/ # btrfs insp dump-s -f /anydevice/ First command will tell us if all the supers are the same and valid across all devices. And the second one, hopefully it's pointed to a device with valid super, will tell us if there's a log root value other than 0. Both of those are read only commands. -- Chris Murphy
experiences running btrfs on external USB disks?
I'm trying to use btrfs on an external USB drive, without much success. When the drive is connected for 2-3+ days, the filesystem gets remounted readonly, with BTRFS saying "IO failure": [77760.444607] BTRFS error (device sdb1): bad tree block start, want 378372096 have 0 [77760.550933] BTRFS error (device sdb1): bad tree block start, want 378372096 have 0 [77760.550972] BTRFS: error (device sdb1) in __btrfs_free_extent:6804: errno=-5 IO failure [77760.550979] BTRFS info (device sdb1): forced readonly [77760.551003] BTRFS: error (device sdb1) in btrfs_run_delayed_refs:2935: errno=-5 IO failure [77760.553223] BTRFS error (device sdb1): pending csums is 4096 Note that there are no other kernel messages (i.e. that would indicate a problem with disk, cable disconnection etc.). The load on the drive itself can be quite heavy at times (i.e. 100% IO for 1-2 h and more) - can it contribute to the problem (i.e. btrfs thinks there is some timeout somewhere)? Running 4.19.6 right now, but was experiencing the issue also with 4.18 kernels. # btrfs device stats /data [/dev/sda1].write_io_errs0 [/dev/sda1].read_io_errs 0 [/dev/sda1].flush_io_errs0 [/dev/sda1].corruption_errs 0 [/dev/sda1].generation_errs 0 Tomasz Chmielewski
Re: Ran into "invalid block group size" bug, unclear how to proceed.
Apologies for not scouring the mailing list completely (just subscribed in fact) as It appears that Patrick Dijkgraaf also ran into this issue. He went ahead with a volume rebuild, whereas I am hoping I can recover having not run anything more than a "btrfs scan" and "btrfs device ready" on this machine to this point since the kernel upgrade and the FS was intact up to that point. - mike On Mon, Dec 3, 2018 at 7:32 PM Mike Javorski wrote: > > Need a bit of advice here ladies / gents. I am running into an issue > which Qu Wenruo seems to have posted a patch for several weeks ago > (see https://patchwork.kernel.org/patch/10694997/). > > Here is the relevant dmesg output which led me to Qu's patch. > > [ 10.032475] BTRFS critical (device sdb): corrupt leaf: root=2 > block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104, > invalid block group size, have 10804527104 expect (0, 10737418240] > [ 10.032493] BTRFS error (device sdb): failed to read block groups: -5 > [ 10.053365] BTRFS error (device sdb): open_ctree failed > > > This server has a 16 disk btrfs filesystem (RAID6) which I boot > periodically to btrfs-send snapshots to. This machine is running > ArchLinux and I had just updated to their latest 4.19.4 kernel > package (from 4.18.10 which was working fine). I've tried updating to > the 4.19.6 kernel that is in testing, but that doesn't seem to resolve > the issue. From what I can see on kernel.org, the patch above is not > pushed to stable or to Linus' tree. > > At this point the question is what to do. Is my FS toast? Could I > revert to the 4.18.10 kernel and boot safely? I don't know if the 4.19 > boot process may have flipped some bits which would make reverting > problematic. > > Thanks much, > > - mike
Ran into "invalid block group size" bug, unclear how to proceed.
Need a bit of advice here ladies / gents. I am running into an issue which Qu Wenruo seems to have posted a patch for several weeks ago (see https://patchwork.kernel.org/patch/10694997/). Here is the relevant dmesg output which led me to Qu's patch. [ 10.032475] BTRFS critical (device sdb): corrupt leaf: root=2 block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104, invalid block group size, have 10804527104 expect (0, 10737418240] [ 10.032493] BTRFS error (device sdb): failed to read block groups: -5 [ 10.053365] BTRFS error (device sdb): open_ctree failed This server has a 16 disk btrfs filesystem (RAID6) which I boot periodically to btrfs-send snapshots to. This machine is running ArchLinux and I had just updated to their latest 4.19.4 kernel package (from 4.18.10 which was working fine). I've tried updating to the 4.19.6 kernel that is in testing, but that doesn't seem to resolve the issue. From what I can see on kernel.org, the patch above is not pushed to stable or to Linus' tree. At this point the question is what to do. Is my FS toast? Could I revert to the 4.18.10 kernel and boot safely? I don't know if the 4.19 boot process may have flipped some bits which would make reverting problematic. Thanks much, - mike
Re: Need help with potential ~45TB dataloss
Also useful information for autopsy, perhaps not for fixing, is to know whether the SCT ERC value for every drive is less than the kernel's SCSI driver block device command timeout value. It's super important that the drive reports an explicit read failure before the read command is considered failed by the kernel. If the drive is still trying to do a read, and the kernel command timer times out, it'll just do a reset of the whole link and we lose the outcome for the hanging command. Upon explicit read error only, can Btrfs, or md RAID, know what device and physical sector has a problem, and therefore how to reconstruct the block, and fix the bad sector with a write of known good data. smartctl -l scterc /device/ and cat /sys/block/sda/device/timeout Only if SCT ERC is enabled with a value below 30, or if the kernel command timer is change to be well above 30 (like 180, which is absolutely crazy but a separate conversation) can we be sure that there haven't just been resets going on for a while, preventing bad sectors from being fixed up all along, and can contribute to the problem. This comes up on the linux-raid (mainly md driver) list all the time, and it contributes to lost RAID all the time. And arguably it leads to unnecessary data loss in even the single device desktop/laptop use case as well. Chris Murphy
Re: BTRFS Mount Delay Time Graph
On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton wrote: > > Le 03/12/2018 à 20:56, Lionel Bouton a écrit : > > [...] > > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various > > tuning of the io queue (switching between classic io-schedulers and > > blk-mq ones in the virtual machines) and BTRFS mount options > > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement > > in mount time (I managed to reduce the mount of IO requests > > Sent to quickly : I meant to write "managed to reduce by half the number > of IO write requests for the same amount of data writen" > > > by half on > > one server in production though although more tests are needed to > > isolate the cause). Interesting. I wonder if it's ssd_spread or space_cache=v2 that reduces the writes by half, or by how much for each? That's a major reduction in writes, and suggests it might be possible for further optimization, to help mitigate the wandering trees impact. -- Chris Murphy
Re: BTRFS Mount Delay Time Graph
On 2018/12/4 上午2:20, Wilson, Ellis wrote: > Hi all, > > Many months ago I promised to graph how long it took to mount a BTRFS > filesystem as it grows. I finally had (made) time for this, and the > attached is the result of my testing. The image is a fairly > self-explanatory graph, and the raw data is also attached in > comma-delimited format for the more curious. The columns are: > Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s). > > Experimental setup: > - System: > Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 > 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux > - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives. > - 3 unmount/mount cycles performed in between adding another 250GB of data > - 250GB of data added each time in the form of 25x10GB files in their > own directory. Files generated in parallel each epoch (25 at the same > time, with a 1MB record size). > - 240 repetitions of this performed (to collect timings in increments of > 250GB between a 0GB and 60TB filesystem) > - Normal "time" command used to measure time to mount. "Real" time used > of the timings reported from time. > - Mount: > /dev/md0 on /btrfs type btrfs > (rw,relatime,space_cache=v2,subvolid=5,subvol=/) > > At 60TB, we take 30s to mount the filesystem, which is actually not as > bad as I originally thought it would be (perhaps as a result of using > RAID0 via mdraid rather than native RAID0 in BTRFS). However, I am open > to comment if folks more intimately familiar with BTRFS think this is > due to the very large files I've used. I can redo the test with much > more realistic data if people have legitimate reason to think it will > drastically change the result. > > With 14TB drives available today, it doesn't take more than a handful of > drives to result in a filesystem that takes around a minute to mount. > As a result of this, I suspect this will become an increasingly problem > for serious users of BTRFS as time goes on. I'm not complaining as I'm > not a contributor so I have no room to do so -- just shedding some light > on a problem that may deserve attention as filesystem sizes continue to > grow. This problem is somewhat known. If you dig further, it's btrfs_read_block_groups() which will try to read *ALL* block group items. And to no one's surprise, when the fs goes larger, the more block group items need to be read from disk. We need to do some delay for such read to improve such case. Thanks, Qu > > Best, > > ellis > signature.asc Description: OpenPGP digital signature
Re: BTRFS Mount Delay Time Graph
Hi, On 12/3/18 8:56 PM, Lionel Bouton wrote: > > Le 03/12/2018 à 19:20, Wilson, Ellis a écrit : >> >> Many months ago I promised to graph how long it took to mount a BTRFS >> filesystem as it grows. I finally had (made) time for this, and the >> attached is the result of my testing. The image is a fairly >> self-explanatory graph, and the raw data is also attached in >> comma-delimited format for the more curious. The columns are: >> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s). >> >> Experimental setup: >> - System: >> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 >> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux >> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives. >> - 3 unmount/mount cycles performed in between adding another 250GB of data >> - 250GB of data added each time in the form of 25x10GB files in their >> own directory. Files generated in parallel each epoch (25 at the same >> time, with a 1MB record size). >> - 240 repetitions of this performed (to collect timings in increments of >> 250GB between a 0GB and 60TB filesystem) >> - Normal "time" command used to measure time to mount. "Real" time used >> of the timings reported from time. >> - Mount: >> /dev/md0 on /btrfs type btrfs >> (rw,relatime,space_cache=v2,subvolid=5,subvol=/) >> >> At 60TB, we take 30s to mount the filesystem, which is actually not as >> bad as I originally thought it would be (perhaps as a result of using >> RAID0 via mdraid rather than native RAID0 in BTRFS). However, I am open >> to comment if folks more intimately familiar with BTRFS think this is >> due to the very large files I've used. Probably yes. The thing that is happening is that all block group items are read from the extent tree. And, instead of being nicely grouped together, they are scattered all over the place, at their virtual address, in between all normal extent items. So, mount time depends on cold random read iops your storage can do, and the size of the extent tree and amount of block groups. And, your extent tree has more items in it if you have more extents. So, yes, writing a lot of 4kiB files should have a similar effect I think as a lot of 128MiB files that are still stored in 1 extent per file. > I can redo the test with much >> more realistic data if people have legitimate reason to think it will >> drastically change the result. > > We are hosting some large BTRFS filesystems on Ceph (RBD used by > QEMU/KVM). I believe the delay is heavily linked to the number of files > (I didn't check if snapshots matter and I suspect it does but not as > much as the number of "original" files at least if you don't heavily > modify existing files but mostly create new ones as we do). > As an example, we have a filesystem with 20TB used space with 4 > subvolumes hosting multi millions files/directories (probably 10-20 > millions total I didn't check the exact number recently as simply > counting files is a very long process) and 40 snapshots for each volume. > Mount takes about 15 minutes. > We have virtual machines that we don't reboot as often as we would like > because of these slow mount times. > > If you want to study this, you could : > - graph the delay for various individual file sizes (instead of 25x10GB, > create 2 500 x 100MB and 250 000 x 1MB files between each run and > compare to the original result) > - graph the delay vs the number of snapshots (probably starting with a > large number of files in the initial subvolume to start with a non > trivial mount delay) > You may want to study the impact of the differences between snapshots by > comparing snapshoting without modifications and snapshots made at > various stages of your suvolume growth. > > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various > tuning of the io queue (switching between classic io-schedulers and > blk-mq ones in the virtual machines) and BTRFS mount options > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement > in mount time (I managed to reduce the mount of IO requests by half on > one server in production though although more tests are needed to > isolate the cause). > I didn't expect much for the mount times, it seems to me that mount is > mostly constrained by the BTRFS on disk structures needed at mount time > and how the filesystem reads them (for example it doesn't benefit at all > from large IO queue depths which probably means that each read depends > on previous ones which prevents io-schedulers from optimizing anything). Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982 What the code is doing here is starting at the beginning of the extent tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which is not that far away), and then based on the information in it, computes w
Re: [PATCH] btrfs-progs: trivial fix about line breaker in repair_inode_nbytes_lowmem()
On Sun, Dec 02, 2018 at 03:08:36PM +, damenly...@gmail.com wrote: > From: Su Yue > > Move "\n" at end of the sentence to print. > > Fixes: 281eec7a9ddf ("btrfs-progs: check: repair inode nbytes in lowmem mode") > Signed-off-by: Su Yue Applied, thanks.
Re: BTRFS Mount Delay Time Graph
Le 03/12/2018 à 20:56, Lionel Bouton a écrit : > [...] > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various > tuning of the io queue (switching between classic io-schedulers and > blk-mq ones in the virtual machines) and BTRFS mount options > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement > in mount time (I managed to reduce the mount of IO requests Sent to quickly : I meant to write "managed to reduce by half the number of IO write requests for the same amount of data writen" > by half on > one server in production though although more tests are needed to > isolate the cause).
Re: BTRFS Mount Delay Time Graph
Hi, Le 03/12/2018 à 19:20, Wilson, Ellis a écrit : > Hi all, > > Many months ago I promised to graph how long it took to mount a BTRFS > filesystem as it grows. I finally had (made) time for this, and the > attached is the result of my testing. The image is a fairly > self-explanatory graph, and the raw data is also attached in > comma-delimited format for the more curious. The columns are: > Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s). > > Experimental setup: > - System: > Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 > 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux > - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives. > - 3 unmount/mount cycles performed in between adding another 250GB of data > - 250GB of data added each time in the form of 25x10GB files in their > own directory. Files generated in parallel each epoch (25 at the same > time, with a 1MB record size). > - 240 repetitions of this performed (to collect timings in increments of > 250GB between a 0GB and 60TB filesystem) > - Normal "time" command used to measure time to mount. "Real" time used > of the timings reported from time. > - Mount: > /dev/md0 on /btrfs type btrfs > (rw,relatime,space_cache=v2,subvolid=5,subvol=/) > > At 60TB, we take 30s to mount the filesystem, which is actually not as > bad as I originally thought it would be (perhaps as a result of using > RAID0 via mdraid rather than native RAID0 in BTRFS). However, I am open > to comment if folks more intimately familiar with BTRFS think this is > due to the very large files I've used. I can redo the test with much > more realistic data if people have legitimate reason to think it will > drastically change the result. We are hosting some large BTRFS filesystems on Ceph (RBD used by QEMU/KVM). I believe the delay is heavily linked to the number of files (I didn't check if snapshots matter and I suspect it does but not as much as the number of "original" files at least if you don't heavily modify existing files but mostly create new ones as we do). As an example, we have a filesystem with 20TB used space with 4 subvolumes hosting multi millions files/directories (probably 10-20 millions total I didn't check the exact number recently as simply counting files is a very long process) and 40 snapshots for each volume. Mount takes about 15 minutes. We have virtual machines that we don't reboot as often as we would like because of these slow mount times. If you want to study this, you could : - graph the delay for various individual file sizes (instead of 25x10GB, create 2 500 x 100MB and 250 000 x 1MB files between each run and compare to the original result) - graph the delay vs the number of snapshots (probably starting with a large number of files in the initial subvolume to start with a non trivial mount delay) You may want to study the impact of the differences between snapshots by comparing snapshoting without modifications and snapshots made at various stages of your suvolume growth. Note : recently I tried upgrading from 4.9 to 4.14 kernels, various tuning of the io queue (switching between classic io-schedulers and blk-mq ones in the virtual machines) and BTRFS mount options (space_cache=v2,ssd_spread) but there wasn't any measurable improvement in mount time (I managed to reduce the mount of IO requests by half on one server in production though although more tests are needed to isolate the cause). I didn't expect much for the mount times, it seems to me that mount is mostly constrained by the BTRFS on disk structures needed at mount time and how the filesystem reads them (for example it doesn't benefit at all from large IO queue depths which probably means that each read depends on previous ones which prevents io-schedulers from optimizing anything). Best regards, Lionel
Re: [PATCH] btrfs-progs: fsck-tests: Move reloc tree images to 020-extent-ref-cases
On Mon, Dec 03, 2018 at 12:39:57PM +0800, Qu Wenruo wrote: > For reloc tree, despite of its short lifespan, it's still the backref, > where reloc tree root backref points back to itself, makes it special. > > So it's more approriate to put them into 020-extent-ref-cases. > > Signed-off-by: Qu Wenruo Applied, thanks.
BTRFS Mount Delay Time Graph
Hi all, Many months ago I promised to graph how long it took to mount a BTRFS filesystem as it grows. I finally had (made) time for this, and the attached is the result of my testing. The image is a fairly self-explanatory graph, and the raw data is also attached in comma-delimited format for the more curious. The columns are: Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s). Experimental setup: - System: Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives. - 3 unmount/mount cycles performed in between adding another 250GB of data - 250GB of data added each time in the form of 25x10GB files in their own directory. Files generated in parallel each epoch (25 at the same time, with a 1MB record size). - 240 repetitions of this performed (to collect timings in increments of 250GB between a 0GB and 60TB filesystem) - Normal "time" command used to measure time to mount. "Real" time used of the timings reported from time. - Mount: /dev/md0 on /btrfs type btrfs (rw,relatime,space_cache=v2,subvolid=5,subvol=/) At 60TB, we take 30s to mount the filesystem, which is actually not as bad as I originally thought it would be (perhaps as a result of using RAID0 via mdraid rather than native RAID0 in BTRFS). However, I am open to comment if folks more intimately familiar with BTRFS think this is due to the very large files I've used. I can redo the test with much more realistic data if people have legitimate reason to think it will drastically change the result. With 14TB drives available today, it doesn't take more than a handful of drives to result in a filesystem that takes around a minute to mount. As a result of this, I suspect this will become an increasingly problem for serious users of BTRFS as time goes on. I'm not complaining as I'm not a contributor so I have no room to do so -- just shedding some light on a problem that may deserve attention as filesystem sizes continue to grow. Best, ellis 0,0.018,0.037,0.016 250,0.245,0.098,0.066 500,0.417,0.119,0.138 750,0.284,0.073,0.066 1000,0.506,0.109,0.126 1250,0.824,0.134,0.204 1500,0.779,0.098,0.147 1750,0.805,0.107,0.215 2000,0.87,0.137,0.223 2250,1.009,0.168,0.226 2500,1.094,0.147,0.174 2750,0.908,0.137,0.246 3000,1.144,0.182,0.313 3250,1.232,0.209,0.312 3500,1.287,0.259,0.292 3750,1.29,0.166,0.298 4000,1.521,0.249,0.418 4250,1.448,0.341,0.395 4500,1.441,0.383,0.362 4750,1.555,0.35,0.371 5000,1.825,0.482,0.638 5250,1.731,0.69,0.928 5500,1.8,0.353,0.348 5750,1.979,0.295,1.194 6000,2.115,0.915,1.241 6250,2.238,0.614,1.735 6500,2.025,0.523,0.536 6750,2.15,0.458,0.727 7000,2.415,2.158,1.925 7250,2.589,1.059,2.24 7500,2.371,1.796,2.102 7750,2.737,1.579,1.659 8000,2.768,1.786,2.579 8250,2.979,2.544,2.654 8500,2.994,2.529,2.847 8750,3.042,2.283,2.947 9000,3.209,2.509,3.077 9250,3.124,2.7,3.096 9500,3.13,3.048,3.105 9750,3.444,2.702,3.33 1,3.671,3.354,3.297 10250,3.639,3.468,3.681 10500,3.693,3.651,3.711 10750,3.729,3.135,3.303 11000,3.846,3.862,3.917 11250,4.006,3.668,3.861 11500,4.113,3.919,3.875 11750,3.968,3.774,3.985 12000,4.205,3.882,4.218 12250,4.454,4.354,4.444 12500,4.528,4.441,4.616 12750,4.688,4.206,4.252 13000,4.551,4.507,4.444 13250,4.806,5.059,4.81 13500,5.041,4.662,4.997 13750,5.057,4.394,4.713 14000,5.029,5.03,4.927 14250,5.173,5.259,5.101 14500,5.104,5.3,5.416 14750,4.809,4.62,4.698 15000,5.045,5.066,4.806 15250,5.101,5.159,5.174 15500,5.074,5.245,5.65 15750,5.123,5.031,5.056 16000,5.518,5.097,5.595 16250,5.318,5.463,5.353 16500,5.63,5.689,5.768 16750,5.375,5.24,5.165 17000,5.578,5.846,5.628 17250,5.73,5.774,5.726 17500,6.108,6.202,6.226 17750,5.645,5.668,5.936 18000,6.308,5.925,6.317 18250,6.19,6.171,6.169 18500,6.442,6.601,6.403 18750,6.558,6.44,6.803 19000,6.664,7.176,6.742 19250,7.37,7.414,6.807 19500,7.021,7.143,7.253 19750,7.051,6.691,7.063 2,6.942,6.858,7.225 20250,7.617,7.39,7.202 20500,7.239,7.525,7.381 20750,7.638,7.332,7.549 21000,7.697,8.081,7.807 21250,7.867,7.929,7.826 21500,7.98,8.208,8.059 21750,7.79,7.614,7.726 22000,8.144,8.611,8.361 22250,8.19,8.558,8.459 22500,8.685,8.785,8.617 22750,8.702,8.454,8.727 23000,8.653,8.699,8.89 23250,8.897,9.328,9.101 23500,9.245,9.456,9.464 23750,9.242,9.072,9.363 24000,9.367,8.934,9.541 24250,9.2,9.754,9.708 24500,9.622,9.472,9.484 24750,9.756,9.672,10.091 25000,10.207,10.304,9.981 25250,10.135,10.166,9.991 25500,9.969,10.234,10.266 25750,10.098,10.515,10.98 26000,10.811,10.6,11.3 26250,11.211,10.761,10.825 26500,10.799,11.075,10.973 26750,10.72,11.12,11.39 27000,11.463,11.106,11.679 27250,11.644,11.363,11.316 27500,11.541,11.748,11.657 27750,11.292,11.794,11.616 28000,11.888,11.697,12.169 28250,12.298,12.183,12.002 28500,12.124,12.48,12.352 28750,11.347,11.815,12.201 29000,12.009,11.72,12.734 29250,11.918,12.02,12.583 29500,12.445,12.439,12.466 29750,12.071,11.863,12.078 3,12.287,12.18
Re: [PATCH 2/5] btrfs: Refactor btrfs_can_relocate
On Sat, Nov 17, 2018 at 09:29:27AM +0800, Anand Jain wrote: > > - ret = find_free_dev_extent(trans, device, min_free, > > - &dev_offset, NULL); > > - if (!ret) > > + if (!find_free_dev_extent(trans, device, min_free, > > + &dev_offset, NULL)) > > This can return -ENOMEM; > > > @@ -2856,8 +2856,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info > > *fs_info, u64 chunk_offset) > > */ > > lockdep_assert_held(&fs_info->delete_unused_bgs_mutex); > > > > - ret = btrfs_can_relocate(fs_info, chunk_offset); > > - if (ret) > > + if (!btrfs_can_relocate(fs_info, chunk_offset)) > > return -ENOSPC; > > And ends up converting -ENOMEM to -ENOSPC. > > Its better to propagate the accurate error. Right, converting to bool is obscuring the reason why the functions fail. Making the code simpler at this cost does not look like a good idea to me. I'll remove the patch from misc-next for now.
Re: [PATCH 0/4] Replace custom device-replace locking with rwsem
On Tue, Nov 20, 2018 at 01:50:54PM +0100, David Sterba wrote: > The first cleanup part went to 4.19, the actual switch from the custom > locking to rswem was postponed as I found performance degradation. This > turned out to be related to VM cache settings, so I'm resending the > series again. > > The custom locking is based on rwlock protected reader/writer counters, > waitqueues, which essentially is what the readwrite semaphore does. > > Previous patchset: > https://lore.kernel.org/linux-btrfs/cover.1536331604.git.dste...@suse.com/ > > Patches correspond to 8/11-11/11 and there's no change besides > refreshing on top of current misc-next. > > David Sterba (4): > btrfs: reada: reorder dev-replace locks before radix tree preload > btrfs: dev-replace: swich locking to rw semaphore > btrfs: dev-replace: remove custom read/write blocking scheme > btrfs: dev-replace: open code trivial locking helpers This has been sitting in for-next for some time, no problems reported. If anybody wants to add a review tag, please let me know. I'm going to add the patchset to misc-next soon.
Btrfs progs pre-release 4.19.1-rc1
Hi, this is a pre-release of btrfs-progs, 4.19.1-rc1. There are build fixes, minor update to libbtrfsutil and documentation updates. The 4.19.1 release is scheduled to this Wednesday, +2 days (2018-12-05). Changelog: * build fixes * big-endian builds fail due to bswap helpers clash * 'swap' macro is too generic, renamed to prevent build failures * libbtrfs * monor version update to 1.1.0 * fix default search to top=0 as documented * rename 'async' to avoid future python binding problems * add support for unprivileged subvolume listing ioctls * added tests, API docs * other * lot of typos fixed * warning cleanups * doc formatting updates * CI tests against zstd 1.3.7 Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git Shortlog: David Sterba (7): btrfs-progs: kerncompat: rename swap to __swap btrfs-progs: README: add link to INSTALL btrfs-progs: docs: fix rendering of exponents in manual pages btrfs-progs: link to libbtrfsutil/README from the main README btrfs-progs: tests: pull zstd version 1.3.7 to the travis CI btrfs-progs: update CHANGES for v4.19.1 Btrfs progs v4.19.1-rc1 Josh Soref (11): btrfs-progs: docs: fix typos in Documentation btrfs-progs: docs: fix typos in READMEs, INSTALL and CHANGES btrfs-progs: fix typos in Makefile btrfs-progs: tests: fix typos in test comments btrfs-progs: tests: fsck/025, fix typo in helpre name btrfs-progs: fix typos in comments btrfs-progs: fix typos in user-visible strings btrfs-progs: check: fix typo in device_extent_record::chunk_objectid btrfs-progs: utils: fix typo in a variable btrfs-progs: mkfs: fix typo in "multipler" variables btrfs-progs: fix typo in btrfs-list function export Omar Sandoval (10): libbtrfsutil: use top=0 as default for SubvolumeIterator() libbtrfsutil: change async parameters to async_ in Python bindings libbtrfsutil: document qgroup_inherit parameter in Python bindings libbtrfsutil: use SubvolumeIterator as context manager in tests libbtrfsutil: add test helpers for dropping privileges libbtrfsutil: allow tests to create multiple Btrfs instances libbtrfsutil: relax the privileges of subvolume_info() libbtrfsutil: relax the privileges of subvolume iterator libbtrfsutil: bump version to 1.1.0 libbtrfsutil: document API in README Rosen Penev (3): btrfs-progs: kernel-lib: bitops: Fix big endian compilation btrfs-progs: task-utils: Fix comparison between pointer and integer btrfs-progs: treewide: Fix missing declarations
Re: Filesystem Corruption
On Mon, Dec 3, 2018, at 4:31 AM, Stefan Malte Schumacher wrote: > I have noticed an unusual amount of crc-errors in downloaded rars, > beginning about a week ago. But lets start with the preliminaries. I > am using Debian Stretch. > Kernel: Linux mars 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4 > (2018-08-21) x86_64 GNU/Linux > > [5390748.884929] Buffer I/O error on dev dm-0, logical block > 976701312, async page read Excuse me for butting when there are *many* more qualified people on this list. But assuming the rar crc errors are related to your unexplained buffer I/O errors, (and not some weird coincidence of simply bad downloads.), I would start, immediately, by testing the Memory. Ram corruption can wreak havok with btrfs, (any filesystem but I think BTRFS has special challenges in this regard.) and this looks like memory error to me.
[PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput
The cleaner thread usually takes care of delayed iputs, with the exception of the btrfs_end_transaction_throttle path. The cleaner thread only gets woken up every 30 seconds, so instead wake it up to do it's work so that we can free up that space as quickly as possible. Reviewed-by: Filipe Manana Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 3 +++ fs/btrfs/disk-io.c | 3 +++ fs/btrfs/inode.c | 2 ++ 3 files changed, 8 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c8ddbacb6748..dc56a4d940c3 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -769,6 +769,9 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr); */ #define BTRFS_FS_BALANCE_RUNNING 18 +/* Indicate that the cleaner thread is awake and doing something. */ +#define BTRFS_FS_CLEANER_RUNNING 19 + struct btrfs_fs_info { u8 fsid[BTRFS_FSID_SIZE]; u8 chunk_tree_uuid[BTRFS_UUID_SIZE]; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index c5918ff8241b..f40f6fdc1019 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1669,6 +1669,8 @@ static int cleaner_kthread(void *arg) while (1) { again = 0; + set_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags); + /* Make the cleaner go to sleep early. */ if (btrfs_need_cleaner_sleep(fs_info)) goto sleep; @@ -1715,6 +1717,7 @@ static int cleaner_kthread(void *arg) */ btrfs_delete_unused_bgs(fs_info); sleep: + clear_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags); if (kthread_should_park()) kthread_parkme(); if (kthread_should_stop()) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8ac7abe2ae9b..0b9f3e482cea 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3264,6 +3264,8 @@ void btrfs_add_delayed_iput(struct inode *inode) ASSERT(list_empty(&binode->delayed_iput)); list_add_tail(&binode->delayed_iput, &fs_info->delayed_iputs); spin_unlock(&fs_info->delayed_iput_lock); + if (!test_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags)) + wake_up_process(fs_info->cleaner_kthread); } void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info) -- 2.14.3
[PATCH 1/3] btrfs: run delayed iputs before committing
Delayed iputs means we can have final iputs of deleted inodes in the queue, which could potentially generate a lot of pinned space that could be free'd. So before we decide to commit the transaction for ENOPSC reasons, run the delayed iputs so that any potential space is free'd up. If there is and we freed enough we can then commit the transaction and potentially be able to make our reservation. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 9 + 1 file changed, 9 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8dfddfd3f315..0127d272cd2a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4953,6 +4953,15 @@ static void flush_space(struct btrfs_fs_info *fs_info, ret = 0; break; case COMMIT_TRANS: + /* +* If we have pending delayed iputs then we could free up a +* bunch of pinned space, so make sure we run the iputs before +* we do our pinned bytes check below. +*/ + mutex_lock(&fs_info->cleaner_delayed_iput_mutex); + btrfs_run_delayed_iputs(fs_info); + mutex_unlock(&fs_info->cleaner_delayed_iput_mutex); + ret = may_commit_transaction(fs_info, space_info); break; default: -- 2.14.3
[PATCH 0/3][V2] Delayed iput fixes
v1->v2: - only wakeup if the cleaner isn't currently doing work. - re-arranged some stuff for running delayed iputs during flushint. - removed the open code wakeup in the waitqueue patch. -- Original message -- Here are some delayed iput fixes. Delayed iputs can hold reservations for a while and there's no real good way to make sure they were gone for good, which means we could early enospc when in reality if we had just waited for the iput we would have had plenty of space. So fix this up by making us wait for delayed iputs when deciding if we need to commit for enospc flushing, and then cleanup and rework how we run delayed iputs to make it more straightforward to wait on them and make sure we're all done using them. Thanks, Josef
[PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
The throttle path doesn't take cleaner_delayed_iput_mutex, which means we could think we're done flushing iputs in the data space reservation path when we could have a throttler doing an iput. There's no real reason to serialize the delayed iput flushing, so instead of taking the cleaner_delayed_iput_mutex whenever we flush the delayed iputs just replace it with an atomic counter and a waitqueue. This removes the short (or long depending on how big the inode is) window where we think there are no more pending iputs when there really are some. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 4 +++- fs/btrfs/disk-io.c | 5 ++--- fs/btrfs/extent-tree.c | 13 - fs/btrfs/inode.c | 21 + 4 files changed, 34 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index dc56a4d940c3..20af5d6d81f1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -915,7 +915,8 @@ struct btrfs_fs_info { spinlock_t delayed_iput_lock; struct list_head delayed_iputs; - struct mutex cleaner_delayed_iput_mutex; + atomic_t nr_delayed_iputs; + wait_queue_head_t delayed_iputs_wait; /* this protects tree_mod_seq_list */ spinlock_t tree_mod_seq_lock; @@ -3240,6 +3241,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root); int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size); void btrfs_add_delayed_iput(struct inode *inode); void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info); +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info); int btrfs_prealloc_file_range(struct inode *inode, int mode, u64 start, u64 num_bytes, u64 min_size, loff_t actual_len, u64 *alloc_hint); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index f40f6fdc1019..238e0113f2d3 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1694,9 +1694,7 @@ static int cleaner_kthread(void *arg) goto sleep; } - mutex_lock(&fs_info->cleaner_delayed_iput_mutex); btrfs_run_delayed_iputs(fs_info); - mutex_unlock(&fs_info->cleaner_delayed_iput_mutex); again = btrfs_clean_one_deleted_snapshot(root); mutex_unlock(&fs_info->cleaner_mutex); @@ -2654,7 +2652,6 @@ int open_ctree(struct super_block *sb, mutex_init(&fs_info->delete_unused_bgs_mutex); mutex_init(&fs_info->reloc_mutex); mutex_init(&fs_info->delalloc_root_mutex); - mutex_init(&fs_info->cleaner_delayed_iput_mutex); seqlock_init(&fs_info->profiles_lock); INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots); @@ -2676,6 +2673,7 @@ int open_ctree(struct super_block *sb, atomic_set(&fs_info->defrag_running, 0); atomic_set(&fs_info->qgroup_op_seq, 0); atomic_set(&fs_info->reada_works_cnt, 0); + atomic_set(&fs_info->nr_delayed_iputs, 0); atomic64_set(&fs_info->tree_mod_seq, 0); fs_info->sb = sb; fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE; @@ -2753,6 +2751,7 @@ int open_ctree(struct super_block *sb, init_waitqueue_head(&fs_info->transaction_wait); init_waitqueue_head(&fs_info->transaction_blocked_wait); init_waitqueue_head(&fs_info->async_submit_wait); + init_waitqueue_head(&fs_info->delayed_iputs_wait); INIT_LIST_HEAD(&fs_info->pinned_chunks); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0127d272cd2a..5b6c9fc227ff 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4280,10 +4280,14 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) /* * The cleaner kthread might still be doing iput * operations. Wait for it to finish so that -* more space is released. +* more space is released. We don't need to +* explicitly run the delayed iputs here because +* the commit_transaction would have woken up +* the cleaner. */ - mutex_lock(&fs_info->cleaner_delayed_iput_mutex); - mutex_unlock(&fs_info->cleaner_delayed_iput_mutex); + ret = btrfs_wait_on_delayed_iputs(fs_info); + if (ret) + return ret; goto again; } else { btrfs_end_transaction(trans); @@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info, * bunch of pinned space, so make sure we run the iputs before * we do our pinned bytes check below.
[PATCH 8/8] btrfs: reserve extra space during evict()
We could generate a lot of delayed refs in evict but never have any left over space from our block rsv to make up for that fact. So reserve some extra space and give it to the transaction so it can be used to refill the delayed refs rsv every loop through the truncate path. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 623a71d871d4..8ac7abe2ae9b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5258,13 +5258,15 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, { struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv; + u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1); int failures = 0; for (;;) { struct btrfs_trans_handle *trans; int ret; - ret = btrfs_block_rsv_refill(root, rsv, rsv->size, + ret = btrfs_block_rsv_refill(root, rsv, +rsv->size + delayed_refs_extra, BTRFS_RESERVE_FLUSH_LIMIT); if (ret && ++failures > 2) { @@ -5273,9 +5275,28 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, return ERR_PTR(-ENOSPC); } + /* +* Evict can generate a large amount of delayed refs without +* having a way to add space back since we exhaust our temporary +* block rsv. We aren't allowed to do FLUSH_ALL in this case +* because we could deadlock with so many things in the flushing +* code, so we have to try and hold some extra space to +* compensate for our delayed ref generation. If we can't get +* that space then we need see if we can steal our minimum from +* the global reserve. We will be ratelimited by the amount of +* space we have for the delayed refs rsv, so we'll end up +* committing and trying again. +*/ trans = btrfs_join_transaction(root); - if (IS_ERR(trans) || !ret) + if (IS_ERR(trans) || !ret) { + if (!IS_ERR(trans)) { + trans->block_rsv = &fs_info->trans_block_rsv; + trans->bytes_reserved = delayed_refs_extra; + btrfs_block_rsv_migrate(rsv, trans->block_rsv, + delayed_refs_extra, 1); + } return trans; + } /* * Try to steal from the global reserve if there is space for -- 2.14.3
[PATCH 1/8] btrfs: check if free bgs for commit
may_commit_transaction will skip committing the transaction if we don't have enough pinned space or if we're trying to find space for a SYSTEM chunk. However if we have pending free block groups in this transaction we still want to commit as we may be able to allocate a chunk to make our reservation. So instead of just returning ENOSPC, check if we have free block groups pending, and if so commit the transaction to allow us to use that free space. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval Reviewed-by: Nikolay Borisov --- fs/btrfs/extent-tree.c | 34 -- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 07ef1b8087f7..755eb226d32d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4853,10 +4853,19 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (!bytes_needed) return 0; - /* See if there is enough pinned space to make this reservation */ - if (__percpu_counter_compare(&space_info->total_bytes_pinned, - bytes_needed, - BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) + trans = btrfs_join_transaction(fs_info->extent_root); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + /* +* See if there is enough pinned space to make this reservation, or if +* we have bg's that are going to be freed, allowing us to possibly do a +* chunk allocation the next loop through. +*/ + if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &trans->transaction->flags) || + __percpu_counter_compare(&space_info->total_bytes_pinned, +bytes_needed, +BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) goto commit; /* @@ -4864,7 +4873,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, * this reservation. */ if (space_info != delayed_rsv->space_info) - return -ENOSPC; + goto enospc; spin_lock(&delayed_rsv->lock); reclaim_bytes += delayed_rsv->reserved; @@ -4878,17 +4887,14 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, bytes_needed -= reclaim_bytes; if (__percpu_counter_compare(&space_info->total_bytes_pinned, - bytes_needed, - BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) { - return -ENOSPC; - } - +bytes_needed, +BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) + goto enospc; commit: - trans = btrfs_join_transaction(fs_info->extent_root); - if (IS_ERR(trans)) - return -ENOSPC; - return btrfs_commit_transaction(trans); +enospc: + btrfs_end_transaction(trans); + return -ENOSPC; } /* -- 2.14.3
[PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
With my change to no longer take into account the global reserve for metadata allocation chunks we have this side-effect for mixed block group fs'es where we are no longer allocating enough chunks for the data/metadata requirements. To deal with this add a ALLOC_CHUNK_FORCE step to the flushing state machine. This will only get used if we've already made a full loop through the flushing machinery and tried committing the transaction. If we have then we can try and force a chunk allocation since we likely need it to make progress. This resolves the issues I was seeing with the mixed bg tests in xfstests with my previous patch. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/extent-tree.c | 18 +- include/trace/events/btrfs.h | 1 + 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 30da075c042e..7cf6ad021d81 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2750,7 +2750,8 @@ enum btrfs_flush_state { FLUSH_DELALLOC = 5, FLUSH_DELALLOC_WAIT = 6, ALLOC_CHUNK = 7, - COMMIT_TRANS= 8, + ALLOC_CHUNK_FORCE = 8, + COMMIT_TRANS= 9, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 667b992d322d..2d0dd70570ca 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4938,6 +4938,7 @@ static void flush_space(struct btrfs_fs_info *fs_info, btrfs_end_transaction(trans); break; case ALLOC_CHUNK: + case ALLOC_CHUNK_FORCE: trans = btrfs_join_transaction(root); if (IS_ERR(trans)) { ret = PTR_ERR(trans); @@ -4945,7 +4946,9 @@ static void flush_space(struct btrfs_fs_info *fs_info, } ret = do_chunk_alloc(trans, btrfs_metadata_alloc_profile(fs_info), -CHUNK_ALLOC_NO_FORCE); +(state == ALLOC_CHUNK) ? +CHUNK_ALLOC_NO_FORCE : +CHUNK_ALLOC_FORCE); btrfs_end_transaction(trans); if (ret > 0 || ret == -ENOSPC) ret = 0; @@ -5081,6 +5084,19 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) commit_cycles--; } + /* +* We don't want to force a chunk allocation until we've tried +* pretty hard to reclaim space. Think of the case where we +* free'd up a bunch of space and so have a lot of pinned space +* to reclaim. We would rather use that than possibly create a +* underutilized metadata chunk. So if this is our first run +* through the flushing state machine skip ALLOC_CHUNK_FORCE and +* commit the transaction. If nothing has changed the next go +* around then we can force a chunk allocation. +*/ + if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles) + flush_state++; + if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 63d1f9d8b8c7..dd0e6f8d6b6e 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush, { FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"}, \ { FLUSH_DELAYED_REFS, "FLUSH_ELAYED_REFS"}, \ { ALLOC_CHUNK, "ALLOC_CHUNK"}, \ + { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"}, \ { COMMIT_TRANS, "COMMIT_TRANS"}) TRACE_EVENT(btrfs_flush_space, -- 2.14.3
[PATCH 0/8][V2] Enospc cleanups and fixeS
v1->v2: - addressed comments from reviewers. - fixed a bug in patch 6 that was introduced because of changes to upstream. -- Original message -- The delayed refs rsv patches exposed a bunch of issues in our enospc infrastructure that needed to be addressed. These aren't really one coherent group, but they are all around flushing and reservations. may_commit_transaction() needed to be updated a little bit, and we needed to add a new state to force chunk allocation if things got dicey. Also because we can end up needed to reserve a whole bunch of extra space for outstanding delayed refs we needed to add the ability to only ENOSPC tickets that were too big to satisfy, instead of failing all of the tickets. There's also a fix in here for one of the corner cases where we didn't quite have enough space reserved for the delayed refs we were generating during evict(). Thanks, Josef
[PATCH 2/8] btrfs: dump block_rsv whe dumping space info
For enospc_debug having the block rsvs is super helpful to see if we've done something wrong. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval Reviewed-by: David Sterba --- fs/btrfs/extent-tree.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 755eb226d32d..204b35434056 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8063,6 +8063,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, return ret; } +#define DUMP_BLOCK_RSV(fs_info, rsv_name) \ +do { \ + struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name; \ + spin_lock(&__rsv->lock);\ + btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu", \ + __rsv->size, __rsv->reserved); \ + spin_unlock(&__rsv->lock); \ +} while (0) + static void dump_space_info(struct btrfs_fs_info *fs_info, struct btrfs_space_info *info, u64 bytes, int dump_block_groups) @@ -8082,6 +8091,12 @@ static void dump_space_info(struct btrfs_fs_info *fs_info, info->bytes_readonly); spin_unlock(&info->lock); + DUMP_BLOCK_RSV(fs_info, global_block_rsv); + DUMP_BLOCK_RSV(fs_info, trans_block_rsv); + DUMP_BLOCK_RSV(fs_info, chunk_block_rsv); + DUMP_BLOCK_RSV(fs_info, delayed_block_rsv); + DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv); + if (!dump_block_groups) return; -- 2.14.3
[PATCH 6/8] btrfs: loop in inode_rsv_refill
With severe fragmentation we can end up with our inode rsv size being huge during writeout, which would cause us to need to make very large metadata reservations. However we may not actually need that much once writeout is complete. So instead try to make our reservation, and if we couldn't make it re-calculate our new reservation size and try again. If our reservation size doesn't change between tries then we know we are actually out of space and can error out. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 58 +- 1 file changed, 43 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0ee77a98f867..0e1a499035ac 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5787,6 +5787,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root, return ret; } +static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv, + u64 *metadata_bytes, u64 *qgroup_bytes) +{ + *metadata_bytes = 0; + *qgroup_bytes = 0; + + spin_lock(&block_rsv->lock); + if (block_rsv->reserved < block_rsv->size) + *metadata_bytes = block_rsv->size - block_rsv->reserved; + if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size) + *qgroup_bytes = block_rsv->qgroup_rsv_size - + block_rsv->qgroup_rsv_reserved; + spin_unlock(&block_rsv->lock); +} + /** * btrfs_inode_rsv_refill - refill the inode block rsv. * @inode - the inode we are refilling. @@ -5802,25 +5817,39 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, { struct btrfs_root *root = inode->root; struct btrfs_block_rsv *block_rsv = &inode->block_rsv; - u64 num_bytes = 0; + u64 num_bytes = 0, last = 0; u64 qgroup_num_bytes = 0; int ret = -ENOSPC; - spin_lock(&block_rsv->lock); - if (block_rsv->reserved < block_rsv->size) - num_bytes = block_rsv->size - block_rsv->reserved; - if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size) - qgroup_num_bytes = block_rsv->qgroup_rsv_size - - block_rsv->qgroup_rsv_reserved; - spin_unlock(&block_rsv->lock); - + __get_refill_bytes(block_rsv, &num_bytes, &qgroup_num_bytes); if (num_bytes == 0) return 0; - ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true); - if (ret) - return ret; - ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); + do { + ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true); + if (ret) + return ret; + ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); + if (ret) { + btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes); + last = num_bytes; + /* +* If we are fragmented we can end up with a lot of +* outstanding extents which will make our size be much +* larger than our reserved amount. If we happen to +* try to do a reservation here that may result in us +* trying to do a pretty hefty reservation, which we may +* not need once delalloc flushing happens. If this is +* the case try and do the reserve again. +*/ + if (flush == BTRFS_RESERVE_FLUSH_ALL) + __get_refill_bytes(block_rsv, &num_bytes, + &qgroup_num_bytes); + if (num_bytes == 0) + return 0; + } + } while (ret && last != num_bytes); + if (!ret) { block_rsv_add_bytes(block_rsv, num_bytes, false); trace_btrfs_space_reservation(root->fs_info, "delalloc", @@ -5830,8 +5859,7 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, spin_lock(&block_rsv->lock); block_rsv->qgroup_rsv_reserved += qgroup_num_bytes; spin_unlock(&block_rsv->lock); - } else - btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes); + } return ret; } -- 2.14.3
[PATCH 3/8] btrfs: don't use global rsv for chunk allocation
The should_alloc_chunk code has math in it to decide if we're getting short on space and if we should go ahead and pre-emptively allocate a new chunk. Previously when we did not have the delayed_refs_rsv, we had to assume that the global block rsv was essentially used space and could be allocated completely at any time, so we counted this space as "used" when determining if we had enough slack space in our current space_info. But on any slightly used file system (10gib or more) you can have a global reserve of 512mib. With our default chunk size being 1gib that means we just assume half of the block group is used, which could result in us allocating more metadata chunks than is actually required. With the delayed refs rsv we can flush delayed refs as the space becomes tight, and if we actually need more block groups then they will be allocated based on space pressure. We no longer require assuming the global reserve is used space in our calculations. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 9 - 1 file changed, 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 204b35434056..667b992d322d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4398,21 +4398,12 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global) static int should_alloc_chunk(struct btrfs_fs_info *fs_info, struct btrfs_space_info *sinfo, int force) { - struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv; u64 bytes_used = btrfs_space_info_used(sinfo, false); u64 thresh; if (force == CHUNK_ALLOC_FORCE) return 1; - /* -* We need to take into account the global rsv because for all intents -* and purposes it's used space. Don't worry about locking the -* global_rsv, it doesn't change except when the transaction commits. -*/ - if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA) - bytes_used += calc_global_rsv_need_space(global_rsv); - /* * in limited mode, we want to have some free space up to * about 1% of the FS size. -- 2.14.3
[PATCH 7/8] btrfs: be more explicit about allowed flush states
For FLUSH_LIMIT flushers (think evict, truncate) we can deadlock when running delalloc because we may be holding a tree lock. We can also deadlock with delayed refs rsv's that are running via the committing mechanism. The only safe operations for FLUSH_LIMIT is to run the delayed operations and to allocate chunks, everything else has the potential to deadlock. Future proof this by explicitly specifying the states that FLUSH_LIMIT is allowed to use. This will keep us from introducing bugs later on when adding new flush states. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0e1a499035ac..ab9d915d9289 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5123,12 +5123,18 @@ void btrfs_init_async_reclaim_work(struct work_struct *work) INIT_WORK(work, btrfs_async_reclaim_metadata_space); } +static const enum btrfs_flush_state priority_flush_states[] = { + FLUSH_DELAYED_ITEMS_NR, + FLUSH_DELAYED_ITEMS, + ALLOC_CHUNK, +}; + static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, struct reserve_ticket *ticket) { u64 to_reclaim; - int flush_state = FLUSH_DELAYED_ITEMS_NR; + int flush_state = 0; spin_lock(&space_info->lock); to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info, @@ -5140,7 +5146,8 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, spin_unlock(&space_info->lock); do { - flush_space(fs_info, space_info, to_reclaim, flush_state); + flush_space(fs_info, space_info, to_reclaim, + priority_flush_states[flush_state]); flush_state++; spin_lock(&space_info->lock); if (ticket->bytes == 0) { @@ -5148,15 +5155,7 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, return; } spin_unlock(&space_info->lock); - - /* -* Priority flushers can't wait on delalloc without -* deadlocking. -*/ - if (flush_state == FLUSH_DELALLOC || - flush_state == FLUSH_DELALLOC_WAIT) - flush_state = ALLOC_CHUNK; - } while (flush_state < COMMIT_TRANS); + } while (flush_state < ARRAY_SIZE(priority_flush_states)); } static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, -- 2.14.3
[PATCH 5/8] btrfs: don't enospc all tickets on flush failure
With the introduction of the per-inode block_rsv it became possible to have really really large reservation requests made because of data fragmentation. Since the ticket stuff assumed that we'd always have relatively small reservation requests it just killed all tickets if we were unable to satisfy the current request. However this is generally not the case anymore. So fix this logic to instead see if we had a ticket that we were able to give some reservation to, and if we were continue the flushing loop again. Likewise we make the tickets use the space_info_add_old_bytes() method of returning what reservation they did receive in hopes that it could satisfy reservations down the line. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 45 + 1 file changed, 25 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 2d0dd70570ca..0ee77a98f867 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4801,6 +4801,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim, } struct reserve_ticket { + u64 orig_bytes; u64 bytes; int error; struct list_head list; @@ -5023,7 +5024,7 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info, !test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state)); } -static void wake_all_tickets(struct list_head *head) +static bool wake_all_tickets(struct list_head *head) { struct reserve_ticket *ticket; @@ -5032,7 +5033,10 @@ static void wake_all_tickets(struct list_head *head) list_del_init(&ticket->list); ticket->error = -ENOSPC; wake_up(&ticket->wait); + if (ticket->bytes != ticket->orig_bytes) + return true; } + return false; } /* @@ -5100,8 +5104,12 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { - wake_all_tickets(&space_info->tickets); - space_info->flush = 0; + if (wake_all_tickets(&space_info->tickets)) { + flush_state = FLUSH_DELAYED_ITEMS_NR; + commit_cycles--; + } else { + space_info->flush = 0; + } } else { flush_state = FLUSH_DELAYED_ITEMS_NR; } @@ -5153,10 +5161,11 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, - struct reserve_ticket *ticket, u64 orig_bytes) + struct reserve_ticket *ticket) { DEFINE_WAIT(wait); + u64 reclaim_bytes = 0; int ret = 0; spin_lock(&space_info->lock); @@ -5177,14 +5186,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, ret = ticket->error; if (!list_empty(&ticket->list)) list_del_init(&ticket->list); - if (ticket->bytes && ticket->bytes < orig_bytes) { - u64 num_bytes = orig_bytes - ticket->bytes; - update_bytes_may_use(space_info, -num_bytes); - trace_btrfs_space_reservation(fs_info, "space_info", - space_info->flags, num_bytes, 0); - } + if (ticket->bytes && ticket->bytes < ticket->orig_bytes) + reclaim_bytes = ticket->orig_bytes - ticket->bytes; spin_unlock(&space_info->lock); + if (reclaim_bytes) + space_info_add_old_bytes(fs_info, space_info, reclaim_bytes); return ret; } @@ -5210,6 +5217,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, { struct reserve_ticket ticket; u64 used; + u64 reclaim_bytes = 0; int ret = 0; ASSERT(orig_bytes); @@ -5245,6 +5253,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, * the list and we will do our own flushing further down. */ if (ret && flush != BTRFS_RESERVE_NO_FLUSH) { + ticket.orig_bytes = orig_bytes; ticket.bytes = orig_bytes; ticket.error = 0; init_waitqueue_head(&ticket.wait); @@ -5285,25 +5294,21 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, return ret; if (flush == BTRFS_RESERVE_FLUSH_ALL) - return wait_reserve_ticket(fs_info, space_info, &ticket, -
[PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs
Now with the delayed_refs_rsv we can now know exactly how much pending delayed refs space we need. This means we can drastically simplify btrfs_check_space_for_delayed_refs by simply checking how much space we have reserved for the global rsv (which acts as a spill over buffer) and the delayed refs rsv. If our total size is beyond that amount then we know it's time to commit the transaction and stop any more delayed refs from being generated. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 2 +- fs/btrfs/extent-tree.c | 48 ++-- fs/btrfs/inode.c | 4 ++-- fs/btrfs/transaction.c | 2 +- 4 files changed, 22 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2eba398c722b..30da075c042e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2631,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct btrfs_fs_info *fs_info, } int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans); -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans); +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info); void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, const u64 start); void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 5a2d0b061f57..07ef1b8087f7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2839,40 +2839,28 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info *fs_info, u64 csum_bytes) return num_csums; } -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans) +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info) { - struct btrfs_fs_info *fs_info = trans->fs_info; - struct btrfs_block_rsv *global_rsv; - u64 num_heads = trans->transaction->delayed_refs.num_heads_ready; - u64 csum_bytes = trans->transaction->delayed_refs.pending_csums; - unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs; - u64 num_bytes, num_dirty_bgs_bytes; - int ret = 0; + struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv; + struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv; + bool ret = false; + u64 reserved; - num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1); - num_heads = heads_to_leaves(fs_info, num_heads); - if (num_heads > 1) - num_bytes += (num_heads - 1) * fs_info->nodesize; - num_bytes <<= 1; - num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) * - fs_info->nodesize; - num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info, -num_dirty_bgs); - global_rsv = &fs_info->global_block_rsv; + spin_lock(&global_rsv->lock); + reserved = global_rsv->reserved; + spin_unlock(&global_rsv->lock); /* -* If we can't allocate any more chunks lets make sure we have _lots_ of -* wiggle room since running delayed refs can create more delayed refs. +* Since the global reserve is just kind of magic we don't really want +* to rely on it to save our bacon, so if our size is more than the +* delayed_refs_rsv and the global rsv then it's time to think about +* bailing. */ - if (global_rsv->space_info->full) { - num_dirty_bgs_bytes <<= 1; - num_bytes <<= 1; - } - - spin_lock(&global_rsv->lock); - if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes) - ret = 1; - spin_unlock(&global_rsv->lock); + spin_lock(&delayed_refs_rsv->lock); + reserved += delayed_refs_rsv->reserved; + if (delayed_refs_rsv->size >= reserved) + ret = true; + spin_unlock(&delayed_refs_rsv->lock); return ret; } @@ -2891,7 +2879,7 @@ int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans) if (val >= NSEC_PER_SEC / 2) return 2; - return btrfs_check_space_for_delayed_refs(trans); + return btrfs_check_space_for_delayed_refs(trans->fs_info); } struct async_delayed_refs { diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index a097f5fde31d..8532a2eb56d1 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5326,8 +5326,8 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, * Try to steal from the global reserve if there is space for * it. */ - if (!btrfs_check_space_for_delayed_refs(trans) && - !btrfs_block_rsv_migrate(global_rsv, rsv, rsv->size, false)) + if (!btrfs_check_space_for_delayed_refs(fs_info) && +
[PATCH 10/10] btrfs: fix truncate throttling
We have a bunch of magic to make sure we're throttling delayed refs when truncating a file. Now that we have a delayed refs rsv and a mechanism for refilling that reserve simply use that instead of all of this magic. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 79 1 file changed, 17 insertions(+), 62 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8532a2eb56d1..623a71d871d4 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4437,31 +4437,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) return err; } -static int truncate_space_check(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 bytes_deleted) -{ - struct btrfs_fs_info *fs_info = root->fs_info; - int ret; - - /* -* This is only used to apply pressure to the enospc system, we don't -* intend to use this reservation at all. -*/ - bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted); - bytes_deleted *= fs_info->nodesize; - ret = btrfs_block_rsv_add(root, &fs_info->trans_block_rsv, - bytes_deleted, BTRFS_RESERVE_NO_FLUSH); - if (!ret) { - trace_btrfs_space_reservation(fs_info, "transaction", - trans->transid, - bytes_deleted, 1); - trans->bytes_reserved += bytes_deleted; - } - return ret; - -} - /* * Return this if we need to call truncate_block for the last bit of the * truncate. @@ -4506,7 +4481,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, u64 bytes_deleted = 0; bool be_nice = false; bool should_throttle = false; - bool should_end = false; BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY); @@ -4719,15 +4693,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, btrfs_abort_transaction(trans, ret); break; } - if (btrfs_should_throttle_delayed_refs(trans)) - btrfs_async_run_delayed_refs(fs_info, - trans->delayed_ref_updates * 2, - trans->transid, 0); if (be_nice) { - if (truncate_space_check(trans, root, -extent_num_bytes)) { - should_end = true; - } if (btrfs_should_throttle_delayed_refs(trans)) should_throttle = true; } @@ -4738,7 +4704,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, if (path->slots[0] == 0 || path->slots[0] != pending_del_slot || - should_throttle || should_end) { + should_throttle) { if (pending_del_nr) { ret = btrfs_del_items(trans, root, path, pending_del_slot, @@ -4750,23 +4716,24 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, pending_del_nr = 0; } btrfs_release_path(path); - if (should_throttle) { - unsigned long updates = trans->delayed_ref_updates; - if (updates) { - trans->delayed_ref_updates = 0; - ret = btrfs_run_delayed_refs(trans, - updates * 2); - if (ret) - break; - } - } + /* -* if we failed to refill our space rsv, bail out -* and let the transaction restart +* We can generate a lot of delayed refs, so we need to +* throttle every once and a while and make sure we're +* adding enough space to keep up with the work we are +* generating. Since we hold a transaction here we +* can't flush, and we don't want to FLUSH_LIMIT because +* we could have generated too many delayed refs to +* actually allocate, so just bail if we're short and +* let the normal reservation dance happen
[PATCH 09/10] btrfs: don't run delayed refs in the end transaction logic
Over the years we have built up a lot of infrastructure to keep delayed refs in check, mostly by running them at btrfs_end_transaction() time. We have a lot of different maths we do to figure out how much, if we should do it inline or async, etc. This existed because we had no feedback mechanism to force the flushing of delayed refs when they became a problem. However with the enospc flushing infrastructure in place for flushing delayed refs when they put too much pressure on the enospc system we have this problem solved. Rip out all of this code as it is no longer needed. Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 38 -- 1 file changed, 38 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 2d8401bf8df9..01f39401619a 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -798,22 +798,12 @@ static int should_end_transaction(struct btrfs_trans_handle *trans) int btrfs_should_end_transaction(struct btrfs_trans_handle *trans) { struct btrfs_transaction *cur_trans = trans->transaction; - int updates; - int err; smp_mb(); if (cur_trans->state >= TRANS_STATE_BLOCKED || cur_trans->delayed_refs.flushing) return 1; - updates = trans->delayed_ref_updates; - trans->delayed_ref_updates = 0; - if (updates) { - err = btrfs_run_delayed_refs(trans, updates * 2); - if (err) /* Error code will also eval true */ - return err; - } - return should_end_transaction(trans); } @@ -843,11 +833,8 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, { struct btrfs_fs_info *info = trans->fs_info; struct btrfs_transaction *cur_trans = trans->transaction; - u64 transid = trans->transid; - unsigned long cur = trans->delayed_ref_updates; int lock = (trans->type != TRANS_JOIN_NOLOCK); int err = 0; - int must_run_delayed_refs = 0; if (refcount_read(&trans->use_count) > 1) { refcount_dec(&trans->use_count); @@ -858,27 +845,6 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - if (!list_empty(&trans->new_bgs)) - btrfs_create_pending_block_groups(trans); - - trans->delayed_ref_updates = 0; - if (!trans->sync) { - must_run_delayed_refs = - btrfs_should_throttle_delayed_refs(trans); - cur = max_t(unsigned long, cur, 32); - - /* -* don't make the caller wait if they are from a NOLOCK -* or ATTACH transaction, it will deadlock with commit -*/ - if (must_run_delayed_refs == 1 && - (trans->type & (__TRANS_JOIN_NOLOCK | __TRANS_ATTACH))) - must_run_delayed_refs = 2; - } - - btrfs_trans_release_metadata(trans); - trans->block_rsv = NULL; - if (!list_empty(&trans->new_bgs)) btrfs_create_pending_block_groups(trans); @@ -923,10 +889,6 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, } kmem_cache_free(btrfs_trans_handle_cachep, trans); - if (must_run_delayed_refs) { - btrfs_async_run_delayed_refs(info, cur, transid, -must_run_delayed_refs == 1); - } return err; } -- 2.14.3
[PATCH 02/10] btrfs: add cleanup_ref_head_accounting helper
From: Josef Bacik We were missing some quota cleanups in check_ref_cleanup, so break the ref head accounting cleanup into a helper and call that from both check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that we don't screw up accounting in the future for other things that we add. Reviewed-by: Omar Sandoval Reviewed-by: Liu Bo Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 67 +- 1 file changed, 39 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c36b3a42f2bb..e3ed3507018d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head) +{ + struct btrfs_fs_info *fs_info = trans->fs_info; + struct btrfs_delayed_ref_root *delayed_refs = + &trans->transaction->delayed_refs; + + if (head->total_ref_mod < 0) { + struct btrfs_space_info *space_info; + u64 flags; + + if (head->is_data) + flags = BTRFS_BLOCK_GROUP_DATA; + else if (head->is_system) + flags = BTRFS_BLOCK_GROUP_SYSTEM; + else + flags = BTRFS_BLOCK_GROUP_METADATA; + space_info = __find_space_info(fs_info, flags); + ASSERT(space_info); + percpu_counter_add_batch(&space_info->total_bytes_pinned, + -head->num_bytes, + BTRFS_TOTAL_BYTES_PINNED_BATCH); + + if (head->is_data) { + spin_lock(&delayed_refs->lock); + delayed_refs->pending_csums -= head->num_bytes; + spin_unlock(&delayed_refs->lock); + } + } + + /* Also free its reserved qgroup space */ + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, + head->qgroup_reserved); +} + static int cleanup_ref_head(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *head) { @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(&head->lock); spin_unlock(&delayed_refs->lock); - trace_run_delayed_ref_head(fs_info, head, 0); - - if (head->total_ref_mod < 0) { - struct btrfs_space_info *space_info; - u64 flags; - - if (head->is_data) - flags = BTRFS_BLOCK_GROUP_DATA; - else if (head->is_system) - flags = BTRFS_BLOCK_GROUP_SYSTEM; - else - flags = BTRFS_BLOCK_GROUP_METADATA; - space_info = __find_space_info(fs_info, flags); - ASSERT(space_info); - percpu_counter_add_batch(&space_info->total_bytes_pinned, - -head->num_bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH); - - if (head->is_data) { - spin_lock(&delayed_refs->lock); - delayed_refs->pending_csums -= head->num_bytes; - spin_unlock(&delayed_refs->lock); - } - } - if (head->must_insert_reserved) { btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, } } - /* Also free its reserved qgroup space */ - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, - head->qgroup_reserved); + cleanup_ref_head_accounting(trans, head); + + trace_run_delayed_ref_head(fs_info, head, 0); btrfs_delayed_ref_unlock(head); btrfs_put_delayed_ref_head(head); return 0; @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (head->must_insert_reserved) ret = 1; + cleanup_ref_head_accounting(trans, head); mutex_unlock(&head->mutex); btrfs_put_delayed_ref_head(head); return ret; -- 2.14.3
[PATCH 00/10][V2] Delayed refs rsv
v1->v2: - addressed the comments from the various reviewers. - split "introduce delayed_refs_rsv" into 5 patches. The patches are the same together as they were, just split out more logically. They can't really be bisected across in that you will likely have fun enospc failures, but they compile properly. This was done to make it easier for review. -- Original message -- This patchset changes how we do space reservations for delayed refs. We were hitting probably 20-40 enospc abort's per day in production while running delayed refs at transaction commit time. This means we ran out of space in the global reserve and couldn't easily get more space in use_block_rsv(). The global reserve has grown to cover everything we don't reserve space explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if we're running short on space and when it's time to force a commit. A failure rate of 20-40 file systems when we run hundreds of thousands of them isn't super high, but cleaning up this code will make things less ugly and more predictible. Thus the delayed refs rsv. We always know how many delayed refs we have outstanding, and although running them generates more we can use the global reserve for that spill over, which fits better into it's desired use than a full blown reservation. This first approach is to simply take how many times we're reserving space for and multiply that by 2 in order to save enough space for the delayed refs that could be generated. This is a niave approach and will probably evolve, but for now it works. With this patchset we've gone down to 2-8 failures per week. It's not perfect, there are some corner cases that still need to be addressed, but is significantly better than what we had. Thanks, Josef
[PATCH 06/10] btrfs: update may_commit_transaction to use the delayed refs rsv
Any space used in the delayed_refs_rsv will be freed up by a transaction commit, so instead of just counting the pinned space we also need to account for any space in the delayed_refs_rsv when deciding if it will make a different to commit the transaction to satisfy our space reservation. If we have enough bytes to satisfy our reservation ticket then we are good to go, otherwise subtract out what space we would gain back by committing the transaction and compare that against the pinned space to make our decision. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 24 +++- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index aa0a638d0263..63ff9d832867 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4843,8 +4843,10 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, { struct reserve_ticket *ticket = NULL; struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_block_rsv; + struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv; struct btrfs_trans_handle *trans; - u64 bytes; + u64 bytes_needed; + u64 reclaim_bytes = 0; trans = (struct btrfs_trans_handle *)current->journal_info; if (trans) @@ -4857,15 +4859,15 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, else if (!list_empty(&space_info->tickets)) ticket = list_first_entry(&space_info->tickets, struct reserve_ticket, list); - bytes = (ticket) ? ticket->bytes : 0; + bytes_needed = (ticket) ? ticket->bytes : 0; spin_unlock(&space_info->lock); - if (!bytes) + if (!bytes_needed) return 0; /* See if there is enough pinned space to make this reservation */ if (__percpu_counter_compare(&space_info->total_bytes_pinned, - bytes, + bytes_needed, BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) goto commit; @@ -4877,14 +4879,18 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, return -ENOSPC; spin_lock(&delayed_rsv->lock); - if (delayed_rsv->size > bytes) - bytes = 0; - else - bytes -= delayed_rsv->size; + reclaim_bytes += delayed_rsv->reserved; spin_unlock(&delayed_rsv->lock); + spin_lock(&delayed_refs_rsv->lock); + reclaim_bytes += delayed_refs_rsv->reserved; + spin_unlock(&delayed_refs_rsv->lock); + if (reclaim_bytes >= bytes_needed) + goto commit; + bytes_needed -= reclaim_bytes; + if (__percpu_counter_compare(&space_info->total_bytes_pinned, - bytes, + bytes_needed, BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) { return -ENOSPC; } -- 2.14.3
[PATCH 05/10] btrfs: introduce delayed_refs_rsv
From: Josef Bacik Traditionally we've had voodoo in btrfs to account for the space that delayed refs may take up by having a global_block_rsv. This works most of the time, except when it doesn't. We've had issues reported and seen in production where sometimes the global reserve is exhausted during transaction commit before we can run all of our delayed refs, resulting in an aborted transaction. Because of this voodoo we have equally dubious flushing semantics around throttling delayed refs which we often get wrong. So instead give them their own block_rsv. This way we can always know exactly how much outstanding space we need for delayed refs. This allows us to make sure we are constantly filling that reservation up with space, and allows us to put more precise pressure on the enospc system. Instead of doing math to see if its a good time to throttle, the normal enospc code will be invoked if we have a lot of delayed refs pending, and they will be run via the normal flushing mechanism. For now the delayed_refs_rsv will hold the reservations for the delayed refs, the block group updates, and deleting csums. We could have a separate rsv for the block group updates, but the csum deletion stuff is still handled via the delayed_refs so that will stay there. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 14 +++- fs/btrfs/delayed-ref.c | 43 -- fs/btrfs/disk-io.c | 4 + fs/btrfs/extent-tree.c | 212 + fs/btrfs/transaction.c | 37 - 5 files changed, 284 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8b41ec42f405..52a87d446945 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -448,8 +448,9 @@ struct btrfs_space_info { #defineBTRFS_BLOCK_RSV_TRANS 3 #defineBTRFS_BLOCK_RSV_CHUNK 4 #defineBTRFS_BLOCK_RSV_DELOPS 5 -#defineBTRFS_BLOCK_RSV_EMPTY 6 -#defineBTRFS_BLOCK_RSV_TEMP7 +#define BTRFS_BLOCK_RSV_DELREFS6 +#defineBTRFS_BLOCK_RSV_EMPTY 7 +#defineBTRFS_BLOCK_RSV_TEMP8 struct btrfs_block_rsv { u64 size; @@ -812,6 +813,8 @@ struct btrfs_fs_info { struct btrfs_block_rsv chunk_block_rsv; /* block reservation for delayed operations */ struct btrfs_block_rsv delayed_block_rsv; + /* block reservation for delayed refs */ + struct btrfs_block_rsv delayed_refs_rsv; struct btrfs_block_rsv empty_block_rsv; @@ -2796,6 +2799,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info, void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *block_rsv, u64 num_bytes); +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr); +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans); +int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info, + enum btrfs_reserve_flush_enum flush); +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, + struct btrfs_block_rsv *src, + u64 num_bytes); int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_put_block_group_cache(struct btrfs_fs_info *info); diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 48725fa757a3..a198ab91879c 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -474,11 +474,14 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans, * existing and update must have the same bytenr */ static noinline void -update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs, +update_existing_head_ref(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *existing, struct btrfs_delayed_ref_head *update, int *old_ref_mod_ret) { + struct btrfs_delayed_ref_root *delayed_refs = + &trans->transaction->delayed_refs; + struct btrfs_fs_info *fs_info = trans->fs_info; int old_ref_mod; BUG_ON(existing->is_data != update->is_data); @@ -536,10 +539,18 @@ update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs, * versa we need to make sure to adjust pending_csums accordingly. */ if (existing->is_data) { - if (existing->total_ref_mod >= 0 && old_ref_mod < 0) + u64 csum_leaves = + btrfs_csum_bytes_to_leaves(fs_info, + existing->num_bytes); + + if (existing->total_ref_mod >= 0 && old_ref_mod < 0) { delayed_refs->pending_csums -= existing->num_bytes; -
[PATCH 04/10] btrfs: only track ref_heads in delayed_ref_updates
From: Josef Bacik We use this number to figure out how many delayed refs to run, but __btrfs_run_delayed_refs really only checks every time we need a new delayed ref head, so we always run at least one ref head completely no matter what the number of items on it. Fix the accounting to only be adjusted when we add/remove a ref head. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index b3e4c9fcb664..48725fa757a3 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans, ref->in_tree = 0; btrfs_put_delayed_ref(ref); atomic_dec(&delayed_refs->num_entries); - if (trans->delayed_ref_updates) - trans->delayed_ref_updates--; } static bool merge_ref(struct btrfs_trans_handle *trans, @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans, if (ref->action == BTRFS_ADD_DELAYED_REF) list_add_tail(&ref->add_list, &href->ref_add_list); atomic_inc(&root->num_entries); - trans->delayed_ref_updates++; spin_unlock(&href->lock); return ret; } -- 2.14.3
[PATCH 03/10] btrfs: cleanup extent_op handling
From: Josef Bacik The cleanup_extent_op function actually would run the extent_op if it needed running, which made the name sort of a misnomer. Change it to run_and_cleanup_extent_op, and move the actual cleanup work to cleanup_extent_op so it can be used by check_ref_cleanup() in order to unify the extent op handling. Reviewed-by: Lu Fengqi Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 35 ++- 1 file changed, 22 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e3ed3507018d..9f169f3c5fdb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2424,19 +2424,32 @@ static void unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_ref btrfs_delayed_ref_unlock(head); } -static int cleanup_extent_op(struct btrfs_trans_handle *trans, -struct btrfs_delayed_ref_head *head) +static struct btrfs_delayed_extent_op *cleanup_extent_op( + struct btrfs_delayed_ref_head *head) { struct btrfs_delayed_extent_op *extent_op = head->extent_op; - int ret; if (!extent_op) - return 0; - head->extent_op = NULL; + return NULL; + if (head->must_insert_reserved) { + head->extent_op = NULL; btrfs_free_delayed_extent_op(extent_op); - return 0; + return NULL; } + return extent_op; +} + +static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans, +struct btrfs_delayed_ref_head *head) +{ + struct btrfs_delayed_extent_op *extent_op; + int ret; + + extent_op = cleanup_extent_op(head); + if (!extent_op) + return 0; + head->extent_op = NULL; spin_unlock(&head->lock); ret = run_delayed_extent_op(trans, head, extent_op); btrfs_free_delayed_extent_op(extent_op); @@ -2488,7 +2501,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, delayed_refs = &trans->transaction->delayed_refs; - ret = cleanup_extent_op(trans, head); + ret = run_and_cleanup_extent_op(trans, head); if (ret < 0) { unselect_delayed_ref_head(delayed_refs, head); btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret); @@ -6977,12 +6990,8 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!RB_EMPTY_ROOT(&head->ref_tree.rb_root)) goto out; - if (head->extent_op) { - if (!head->must_insert_reserved) - goto out; - btrfs_free_delayed_extent_op(head->extent_op); - head->extent_op = NULL; - } + if (cleanup_extent_op(head) != NULL) + goto out; /* * waiting for the lock here would deadlock. If someone else has it -- 2.14.3
[PATCH 01/10] btrfs: add btrfs_delete_ref_head helper
From: Josef Bacik We do this dance in cleanup_ref_head and check_ref_cleanup, unify it into a helper and cleanup the calling functions. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/delayed-ref.c | 14 ++ fs/btrfs/delayed-ref.h | 3 ++- fs/btrfs/extent-tree.c | 22 +++--- 3 files changed, 19 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 9301b3ad9217..b3e4c9fcb664 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head( return head; } +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) +{ + lockdep_assert_held(&delayed_refs->lock); + lockdep_assert_held(&head->lock); + + rb_erase_cached(&head->href_node, &delayed_refs->href_root); + RB_CLEAR_NODE(&head->href_node); + atomic_dec(&delayed_refs->num_entries); + delayed_refs->num_heads--; + if (head->processing == 0) + delayed_refs->num_heads_ready--; +} + /* * Helper to insert the ref_node to the tail or merge with tail. * diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 8e20c5cb5404..d2af974f68a1 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head) { mutex_unlock(&head->mutex); } - +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head); struct btrfs_delayed_ref_head *btrfs_select_ref_head( struct btrfs_delayed_ref_root *delayed_refs); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d242a1174e50..c36b3a42f2bb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(&delayed_refs->lock); return 1; } - delayed_refs->num_heads--; - rb_erase_cached(&head->href_node, &delayed_refs->href_root); - RB_CLEAR_NODE(&head->href_node); + btrfs_delete_ref_head(delayed_refs, head); spin_unlock(&head->lock); spin_unlock(&delayed_refs->lock); - atomic_dec(&delayed_refs->num_entries); trace_run_delayed_ref_head(fs_info, head, 0); @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!mutex_trylock(&head->mutex)) goto out; - /* -* at this point we have a head with no other entries. Go -* ahead and process it. -*/ - rb_erase_cached(&head->href_node, &delayed_refs->href_root); - RB_CLEAR_NODE(&head->href_node); - atomic_dec(&delayed_refs->num_entries); - - /* -* we don't take a ref on the node because we're removing it from the -* tree, so we just steal the ref the tree was holding. -*/ - delayed_refs->num_heads--; - if (head->processing == 0) - delayed_refs->num_heads_ready--; + btrfs_delete_ref_head(delayed_refs, head); head->processing = 0; + spin_unlock(&head->lock); spin_unlock(&delayed_refs->lock); -- 2.14.3
[PATCH 07/10] btrfs: add new flushing states for the delayed refs rsv
A nice thing we gain with the delayed refs rsv is the ability to flush the delayed refs on demand to deal with enospc pressure. Add states to flush delayed refs on demand, and this will allow us to remove a lot of ad-hoc work around checking to see if we should commit the transaction to run our delayed refs. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 10 ++ fs/btrfs/extent-tree.c | 14 ++ include/trace/events/btrfs.h | 2 ++ 3 files changed, 22 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 52a87d446945..2eba398c722b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2745,10 +2745,12 @@ enum btrfs_reserve_flush_enum { enum btrfs_flush_state { FLUSH_DELAYED_ITEMS_NR = 1, FLUSH_DELAYED_ITEMS = 2, - FLUSH_DELALLOC = 3, - FLUSH_DELALLOC_WAIT = 4, - ALLOC_CHUNK = 5, - COMMIT_TRANS= 6, + FLUSH_DELAYED_REFS_NR = 3, + FLUSH_DELAYED_REFS = 4, + FLUSH_DELALLOC = 5, + FLUSH_DELALLOC_WAIT = 6, + ALLOC_CHUNK = 7, + COMMIT_TRANS= 8, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 63ff9d832867..5a2d0b061f57 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4938,6 +4938,20 @@ static void flush_space(struct btrfs_fs_info *fs_info, shrink_delalloc(fs_info, num_bytes * 2, num_bytes, state == FLUSH_DELALLOC_WAIT); break; + case FLUSH_DELAYED_REFS_NR: + case FLUSH_DELAYED_REFS: + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + break; + } + if (state == FLUSH_DELAYED_REFS_NR) + nr = calc_reclaim_items_nr(fs_info, num_bytes); + else + nr = 0; + btrfs_run_delayed_refs(trans, nr); + btrfs_end_transaction(trans); + break; case ALLOC_CHUNK: trans = btrfs_join_transaction(root); if (IS_ERR(trans)) { diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 8568946f491d..63d1f9d8b8c7 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1048,6 +1048,8 @@ TRACE_EVENT(btrfs_trigger_flush, { FLUSH_DELAYED_ITEMS, "FLUSH_DELAYED_ITEMS"}, \ { FLUSH_DELALLOC, "FLUSH_DELALLOC"}, \ { FLUSH_DELALLOC_WAIT, "FLUSH_DELALLOC_WAIT"}, \ + { FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"}, \ + { FLUSH_DELAYED_REFS, "FLUSH_ELAYED_REFS"}, \ { ALLOC_CHUNK, "ALLOC_CHUNK"}, \ { COMMIT_TRANS, "COMMIT_TRANS"}) -- 2.14.3
Re: [RFC PATCH] btrfs: Remove __extent_readpages
On 3.12.18 г. 12:25 ч., Nikolay Borisov wrote: > When extent_readpages is called from the generic readahead code it first > builds a batch of 16 pages (which might or might not be consecutive, > depending on whether add_to_page_cache_lru failed) and submits them to > __extent_readpages. The latter ensures that the range of pages (in the > batch of 16) that is passed to __do_contiguous_readpages is consecutive. > > If add_to_page_cache_lru does't fail then __extent_readpages will call > __do_contiguous_readpages only once with the whole batch of 16. > Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an > example) > then the contigous page read code will be called twice. > > All of this can be simplified by exploiting the fact that all pages passed > to extent_readpages are consecutive, thus when the batch is built in > that function it is already consecutive (barring add_to_page_cache_lru > failures) so are ready to be submitted directly to __do_contiguous_readpages. > Also simplify the name of the function to contiguous_readpages. > > Signed-off-by: Nikolay Borisov > --- > > So this patch looks like a very nice cleanup, however when doing performance > measurements with fio I was shocked to see that it actually is detrimental to > performance. Here are the results: > > The command line used for fio: > fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k > --numjobs=1 --size=1G --runtime=600 --group_reporting --loop 10 > > This was tested on a vm with 4g of ram so the size of the test is smaller > than > the memory, so pages should have been nicely readahead. > > PATCHED: > > Starting 1 process > Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] > /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec 3 > 09:57:17 2018 > read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec) > clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28 > lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07 > clat percentiles (nsec): > | 1.00th=[ 1848], 5.00th=[ 1896], 10.00th=[ 1912], > | 20.00th=[ 1960], 30.00th=[ 2024], 40.00th=[ 2160], > | 50.00th=[ 2384], 60.00th=[ 2576], 70.00th=[ 2800], > | 80.00th=[ 3120], 90.00th=[ 3824], 95.00th=[ 4768], > | 99.00th=[ 7968], 99.50th=[ 14912], 99.90th=[ 50944], > | 99.95th=[ 667648], 99.99th=[5931008] >bw ( KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, > stdev=162333.72, samples=50 >iops: min= 692, max=136135, avg=102478.08, stdev=40583.47, > samples=50 > lat (usec) : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13% > lat (usec) : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% > lat (msec) : 2=0.01%, 4=0.01%, 10=0.05% > cpu : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9 > IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0 > latency : target=0, window=0, percentile=100.00%, depth=1 > > Run status group 0 (all jobs): >READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), > io=10.0GiB (10.7GB), run=33444-33444msec > > > UNPATCHED: > > Starting 1 process > Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] > /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec 3 > 10:07:38 2018 > read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec) > clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00 > lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36 > clat percentiles (nsec): > | 1.00th=[ 1576], 5.00th=[ 1608], 10.00th=[ 1640], > | 20.00th=[ 1672], 30.00th=[ 1720], 40.00th=[ 1832], > | 50.00th=[ 2096], 60.00th=[ 2288], 70.00th=[ 2480], > | 80.00th=[ 2736], 90.00th=[ 3248], 95.00th=[ 3952], > | 99.00th=[ 6368], 99.50th=[ 12736], 99.90th=[ 43776], > | 99.95th=[ 798720], 99.99th=[5341184] >bw ( KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, > stdev=177637.57, samples=45 >iops: min= 8536, max=151552, avg=116434.33, stdev=44409.46, > samples=45 > lat (usec) : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08% > lat (usec) : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% > lat (msec) : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01% > cpu : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9 > IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued rwts: total=2621440,0,0,0
Re: Filesystem Corruption
On 2018/12/3 下午5:31, Stefan Malte Schumacher wrote: > Hello, > > I have noticed an unusual amount of crc-errors in downloaded rars, > beginning about a week ago. But lets start with the preliminaries. I > am using Debian Stretch. > Kernel: Linux mars 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4 > (2018-08-21) x86_64 GNU/Linux > BTRFS-Tools btrfs-progs 4.7.3-1 > Smartctl shows no errors for any of the drives in the filesystem. > > Btrfs /dev/stats shows zero errors, but dmesg gives me a lot of > filesystem related error messages. > > [5390748.884929] Buffer I/O error on dev dm-0, logical block > 976701312, async page read > This errors is shown a lot of time in the log. No "btrfs:" prefix, looks more like an error message from block level, no wonder btrfs shows no error at all. What is the underlying device mapper? And further more, is there any kernel message with "btrfs" (case-insensitive) in it? Thanks, Qu > > This seems to affect just newly written files. This is the output of > btrfs scrub status: > scrub status for 1609e4e1-4037-4d31-bf12-f84a691db5d8 > scrub started at Tue Nov 27 06:02:04 2018 and finished after 07:34:16 > total bytes scrubbed: 17.29TiB with 0 errors > > What is the probable cause of these errors? How can I fix this? > > Thanks in advance for your advice > Stefan > signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 2/3] btrfs: use offset_in_page for start_offset in map_private_extent_buffer()
On 28/11/2018 16:41, David Sterba wrote: > On Wed, Nov 28, 2018 at 09:54:55AM +0100, Johannes Thumshirn wrote: >> In map_private_extent_buffer() use offset_in_page() to initialize >> 'start_offset' instead of open-coding it. > > Can you please fix all instances where it's opencoded? Grepping for > 'PAGE_SIZE - 1' finds a number of them. Thanks. Sure will do. -- Johannes ThumshirnSUSE Labs Filesystems jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
[RFC PATCH] btrfs: Remove __extent_readpages
When extent_readpages is called from the generic readahead code it first builds a batch of 16 pages (which might or might not be consecutive, depending on whether add_to_page_cache_lru failed) and submits them to __extent_readpages. The latter ensures that the range of pages (in the batch of 16) that is passed to __do_contiguous_readpages is consecutive. If add_to_page_cache_lru does't fail then __extent_readpages will call __do_contiguous_readpages only once with the whole batch of 16. Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an example) then the contigous page read code will be called twice. All of this can be simplified by exploiting the fact that all pages passed to extent_readpages are consecutive, thus when the batch is built in that function it is already consecutive (barring add_to_page_cache_lru failures) so are ready to be submitted directly to __do_contiguous_readpages. Also simplify the name of the function to contiguous_readpages. Signed-off-by: Nikolay Borisov --- So this patch looks like a very nice cleanup, however when doing performance measurements with fio I was shocked to see that it actually is detrimental to performance. Here are the results: The command line used for fio: fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k --numjobs=1 --size=1G --runtime=600 --group_reporting --loop 10 This was tested on a vm with 4g of ram so the size of the test is smaller than the memory, so pages should have been nicely readahead. PATCHED: Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec 3 09:57:17 2018 read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec) clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28 lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07 clat percentiles (nsec): | 1.00th=[ 1848], 5.00th=[ 1896], 10.00th=[ 1912], | 20.00th=[ 1960], 30.00th=[ 2024], 40.00th=[ 2160], | 50.00th=[ 2384], 60.00th=[ 2576], 70.00th=[ 2800], | 80.00th=[ 3120], 90.00th=[ 3824], 95.00th=[ 4768], | 99.00th=[ 7968], 99.50th=[ 14912], 99.90th=[ 50944], | 99.95th=[ 667648], 99.99th=[5931008] bw ( KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, stdev=162333.72, samples=50 iops: min= 692, max=136135, avg=102478.08, stdev=40583.47, samples=50 lat (usec) : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13% lat (usec) : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.05% cpu : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), io=10.0GiB (10.7GB), run=33444-33444msec UNPATCHED: Starting 1 process Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec 3 10:07:38 2018 read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec) clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00 lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36 clat percentiles (nsec): | 1.00th=[ 1576], 5.00th=[ 1608], 10.00th=[ 1640], | 20.00th=[ 1672], 30.00th=[ 1720], 40.00th=[ 1832], | 50.00th=[ 2096], 60.00th=[ 2288], 70.00th=[ 2480], | 80.00th=[ 2736], 90.00th=[ 3248], 95.00th=[ 3952], | 99.00th=[ 6368], 99.50th=[ 12736], 99.90th=[ 43776], | 99.95th=[ 798720], 99.99th=[5341184] bw ( KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, stdev=177637.57, samples=45 iops: min= 8536, max=151552, avg=116434.33, stdev=44409.46, samples=45 lat (usec) : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08% lat (usec) : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01% cpu : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=10.0GiB (10.7GB), run=29008-29008msec Clearly both ban
Re: Possible deadlock when writing
I've been running into (what I believe) is the same issue ever since upgrading to 4.19: [28950.083040] BTRFS error (device dm-0): bad tree block start, want 1815648960512 have 0 [28950.083047] BTRFS: error (device dm-0) in __btrfs_free_extent:6804: errno=-5 IO failure [28950.083048] BTRFS info (device dm-0): forced readonly [28950.083050] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2935: errno=-5 IO failure [28950.083866] BTRFS error (device dm-0): pending csums is 9564160 [29040.413973] TaskSchedulerFo[17189]: segfault at 0 ip 56121a2cb73b sp 7f1cca425b80 error 4 in chrome[561218101000+6513000] This has been happening consistently to me on two laptops and a workstation all running Arch Linux -- all different hardware the only thing in common is they have SSDs/nvme storage and they all use btrfs. I initially thought it had something to do with the fstrim.timer unit kicking off an fstrim run that was somehow causing contention with btrfs. As luck would have it my btrfs file-system on one laptop just remounted read-only and I believe that while my physical memory was not entirely used up (I would guess usage to be ~45% physical). While I believe the rest of available memory was being utilized by the VFS buffer cache, I'm not 100% on actual utilization but after reading the email from mbakiev@ I did make a mental note before initiating a required reboot. I came across this comment from Ubuntu's bugtracker: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/159356/comments/62 The author of post #62 notes that this particular behavior happens when they are running several instances of Chrome. I don't know if this bug filed or issue is related at all, but an interesting note is that I also almost always happen to be interacting with Google Chrome when the read-only remount happens. Here is the last entry from journald before I rebooted: Dec 03 00:00:39 tenforward kernel: BTRFS error (device dm-3): bad tree block start, want 761659392 have 15159222128734632161 Here are the only changes I made that would be relevant: vm.swappiness = 10 vm.overcommit_memory = 1 vm.oom_kill_allocating_task = 1 vm.panic_on_oom = 1 Hope I didn't miss anything, thanks! On Sat, Dec 1, 2018 at 6:21 PM Martin Bakiev wrote: > > I was having the same issue with kernels 4.19.2 and 4.19.4. I don’t appear to > have the issue with 4.20.0-0.rc1 on Fedora Server 29. > > The issue is very easy to reproduce on my setup, not sure how much of it is > actually relevant, but here it is: > > - 3 drive RAID5 created > - Some data moved to it > - Expanded to 7 drives > - No balancing > > The issue is easily reproduced (within 30 mins) by starting multiple > transfers to the volume (several TB in the form of many 30GB+ files). > Multiple concurrent ‘rsync’ transfers seems to take a bit longer to trigger > the issue, but multiple ‘cp’ commands will do it much quicker (again not sure > if relevant). > > I have not seen the issue occur with a single ‘rsync’ or ‘cp’ transfer, but I > haven’t left one running alone for too long (copying the data from multiple > drives, so there is a lot to be gained from parallelizing the transfers). > > I’m not sure what state the FS is left in after Magic SysRq reboot after it > deadlocks, but seemingly it’s fine. No problems mounting and ‘btrfs check’ > passes OK. I’m sure some of the data doesn’t get flushed, but it’s no problem > for my use case. > > I’ve been running nonstop concurrent transfers with kernel 4.20.0-0.rc1 for > 24hr nonstop and I haven’t experienced the issue. > > Hope this helps.
Filesystem Corruption
Hello, I have noticed an unusual amount of crc-errors in downloaded rars, beginning about a week ago. But lets start with the preliminaries. I am using Debian Stretch. Kernel: Linux mars 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4 (2018-08-21) x86_64 GNU/Linux BTRFS-Tools btrfs-progs 4.7.3-1 Smartctl shows no errors for any of the drives in the filesystem. Btrfs /dev/stats shows zero errors, but dmesg gives me a lot of filesystem related error messages. [5390748.884929] Buffer I/O error on dev dm-0, logical block 976701312, async page read This errors is shown a lot of time in the log. This seems to affect just newly written files. This is the output of btrfs scrub status: scrub status for 1609e4e1-4037-4d31-bf12-f84a691db5d8 scrub started at Tue Nov 27 06:02:04 2018 and finished after 07:34:16 total bytes scrubbed: 17.29TiB with 0 errors What is the probable cause of these errors? How can I fix this? Thanks in advance for your advice Stefan