date:20181203

Re: Ran into "invalid block group size" bug, unclear how to proceed.

2018-12-03 Thread Mike Javorski

Apologies for the dupe Chris, I neglected to hit Reply-All.. Comments below.

On Mon, Dec 3, 2018 at 9:56 PM Chris Murphy  wrote:
>
> On Mon, Dec 3, 2018 at 8:32 PM Mike Javorski  wrote:
> >
> > Need a bit of advice here ladies / gents. I am running into an issue
> > which Qu Wenruo seems to have posted a patch for several weeks ago
> > (see https://patchwork.kernel.org/patch/10694997/).
> >
> > Here is the relevant dmesg output which led me to Qu's patch.
> > 
> > [   10.032475] BTRFS critical (device sdb): corrupt leaf: root=2
> > block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104,
> > invalid block group size, have 10804527104 expect (0, 10737418240]
> > [   10.032493] BTRFS error (device sdb): failed to read block groups: -5
> > [   10.053365] BTRFS error (device sdb): open_ctree failed
> > 
> >
> > This server has a 16 disk btrfs filesystem (RAID6) which I boot
> > periodically to btrfs-send snapshots to. This machine is running
> > ArchLinux and I had just updated  to their latest 4.19.4 kernel
> > package (from 4.18.10 which was working fine). I've tried updating to
> > the 4.19.6 kernel that is in testing, but that doesn't seem to resolve
> > the issue. From what I can see on kernel.org, the patch above is not
> > pushed to stable or to Linus' tree.
> >
> > At this point the question is what to do. Is my FS toast? Could I
> > revert to the 4.18.10 kernel and boot safely? I don't know if the 4.19
> > boot process may have flipped some bits which would make reverting
> > problematic.
>
> That patch is not yet merged in linux-next so to use it, you'd need to
> apply yourself and compile a kernel. I can't tell for sure if it'd
> help.
>
> But, the less you change the file system, the better chance of saving
> it. I have no idea why there'd be a corrupt leaf just due to a kernel
> version change, though.
>
> Needless to say, raid56 just seems fragile once it runs into any kind
> of trouble. I personally wouldn't boot off it at all. I would only
> mount it from another system, ideally an installed system but a live
> system with the kernel versions you need would also work. That way you
> can get more information without changes, and booting will almost
> immediately mount rw, if mount succeeds at all, and will write a bunch
> of changes to the file system.
>

If the boot could corrupt the disk, that ship has already sailed as I
have previously attempted to mount the volume with the 4.19.4 and
4.19.6 kernels, but they both failed reporting the log lines in my
original message. I am hoping that Qu notices this thread at some
point as they are the author of the original patch which introduced
the check which is now failing, as well as the un-merged patch linked
earlier that adjusts the check condition.

What I don't know is if the checks up until that mount failure have
been read-only and thus I can revert to the older kernel, or if
something would have been written to disk prior to the mount call
failing. I don't want to risk the 23 TiB of snapshot data stored if
it's an easy workaround :-).  I realize there are risks with the
RAID56 code, but I've done my best to mitigate with server-grade
hardware, ECC memory, a UPS and a redundant copy of the data via
btrfs-send to this machine. Losing this snapshot volume is not the end
of the world, but I am about to upgrade the primary server (which is
currently running 4.19.4 without issue btw) and want to have a best
effort snapshot/backup in place before I do so.

> Whether it's a case of 4.18.10 not detecting corruption that 4.19
> sees, or if 4.19 already caused it, the best chance is to not mount it
> rw, and not run check --repair, until you get some feedback from a
> developer.
>

+1

> The thing I'd like to see is
> # btrfs rescue super -v /anydevice/
> # btrfs insp dump-s -f /anydevice/
>
> First command will tell us if all the supers are the same and valid
> across all devices. And the second one, hopefully it's pointed to a
> device with valid super, will tell us if there's a log root value
> other than 0. Both of those are read only commands.
>

It was my understanding that rescue writes to the disk (the man page
indicates it recovers superblocks which would imply writes). If you
are sure they both leave the on-disk data structures completely
intact, I would be willing to run them.

- mike

Re: [PATCH 2/5] btrfs: Refactor btrfs_can_relocate

2018-12-03 Thread Nikolay Borisov




On 3.12.18 г. 19:25 ч., David Sterba wrote:
> On Sat, Nov 17, 2018 at 09:29:27AM +0800, Anand Jain wrote:
>>> -   ret = find_free_dev_extent(trans, device, min_free,
>>> -  &dev_offset, NULL);
>>> -   if (!ret)
>>> +   if (!find_free_dev_extent(trans, device, min_free,
>>> +  &dev_offset, NULL))
>>
>>   This can return -ENOMEM;
>>
>>> @@ -2856,8 +2856,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
>>> *fs_info, u64 chunk_offset)
>>>  */
>>> lockdep_assert_held(&fs_info->delete_unused_bgs_mutex);
>>>   
>>> -   ret = btrfs_can_relocate(fs_info, chunk_offset);
>>> -   if (ret)
>>> +   if (!btrfs_can_relocate(fs_info, chunk_offset))
>>> return -ENOSPC;
>>
>>   And ends up converting -ENOMEM to -ENOSPC.
>>
>>   Its better to propagate the accurate error.
> 
> Right, converting to bool is obscuring the reason why the functions
> fail. Making the code simpler at this cost does not look like a good
> idea to me. I'll remove the patch from misc-next for now.

The patch itself does not make the code more obscure than currently is,
because even if ENOMEM is returned it's still converted to ENOSPC in
btrfs_relocate_chunk.
>

Re: experiences running btrfs on external USB disks?

2018-12-03 Thread Tomasz Chmielewski


On 2018-12-04 14:59, Chris Murphy wrote:

Running 4.19.6 right now, but was experiencing the issue also with 
4.18

kernels.



# btrfs device stats /data
[/dev/sda1].write_io_errs0
[/dev/sda1].read_io_errs 0
[/dev/sda1].flush_io_errs0
[/dev/sda1].corruption_errs  0
[/dev/sda1].generation_errs  0



Hard to say without a complete dmesg; but errno=-5 IO failure is
pretty much some kind of hardware problem in my experience. I haven't
seen it be a bug.


It is a complete dmesg - in sense:

# grep -i btrfs -A5 -B5 /var/log/syslog
Dec  4 05:06:56 step snapd[747]: udevmon.go:184: udev monitor observed 
remove event for unknown device 
"/sys/skbuff_head_cache(1481:anacron.service)"
Dec  4 05:06:56 step snapd[747]: udevmon.go:184: udev monitor observed 
remove event for unknown device "/sys/buffer_head(1481:anacron.service)"
Dec  4 05:06:56 step snapd[747]: udevmon.go:184: udev monitor observed 
remove event for unknown device 
"/sys/ext4_inode_cache(1481:anacron.service)"
Dec  4 05:15:01 step CRON[9352]: (root) CMD (command -v debian-sa1 > 
/dev/null && debian-sa1 1 1)
Dec  4 05:17:01 step CRON[9358]: (root) CMD (   cd / && run-parts 
--report /etc/cron.hourly)
Dec  4 05:23:13 step kernel: [77760.444607] BTRFS error (device sdb1): 
bad tree block start, want 378372096 have 0
Dec  4 05:23:13 step kernel: [77760.550933] BTRFS error (device sdb1): 
bad tree block start, want 378372096 have 0
Dec  4 05:23:13 step kernel: [77760.550972] BTRFS: error (device sdb1) 
in __btrfs_free_extent:6804: errno=-5 IO failure
Dec  4 05:23:13 step kernel: [77760.550979] BTRFS info (device sdb1): 
forced readonly
Dec  4 05:23:13 step kernel: [77760.551003] BTRFS: error (device sdb1) 
in btrfs_run_delayed_refs:2935: errno=-5 IO failure
Dec  4 05:23:13 step kernel: [77760.553223] BTRFS error (device sdb1): 
pending csums is 4096
Dec  4 05:23:14 step postfix/pickup[8993]: 13BBE460F86: uid=0 
from=
Dec  4 05:23:14 step postfix/cleanup[9398]: 13BBE460F86: 
message-id=<20181204052314.13BBE460F86@step>
Dec  4 05:23:14 step postfix/qmgr[2745]: 13BBE460F86: from=, 
size=404, nrcpt=1 (queue active)
Dec  4 05:23:14 step postfix/pickup[8993]: 40A964603EC: uid=0 
from=


[...some emails follow, usual CRON messages etc., but noting at all 
generated by the kernel, no hardware issue reported...]




Tomasz Chmielewski

Re: experiences running btrfs on external USB disks?

2018-12-03 Thread Chris Murphy

On Mon, Dec 3, 2018 at 10:44 PM Tomasz Chmielewski  wrote:
>
> I'm trying to use btrfs on an external USB drive, without much success.
>
> When the drive is connected for 2-3+ days, the filesystem gets remounted
> readonly, with BTRFS saying "IO failure":
>
> [77760.444607] BTRFS error (device sdb1): bad tree block start, want
> 378372096 have 0
> [77760.550933] BTRFS error (device sdb1): bad tree block start, want
> 378372096 have 0
> [77760.550972] BTRFS: error (device sdb1) in __btrfs_free_extent:6804:
> errno=-5 IO failure
> [77760.550979] BTRFS info (device sdb1): forced readonly
> [77760.551003] BTRFS: error (device sdb1) in
> btrfs_run_delayed_refs:2935: errno=-5 IO failure
> [77760.553223] BTRFS error (device sdb1): pending csums is 4096
>
>
> Note that there are no other kernel messages (i.e. that would indicate a
> problem with disk, cable disconnection etc.).
>
> The load on the drive itself can be quite heavy at times (i.e. 100% IO
> for 1-2 h and more) - can it contribute to the problem (i.e. btrfs
> thinks there is some timeout somewhere)?
>
> Running 4.19.6 right now, but was experiencing the issue also with 4.18
> kernels.
>
>
>
> # btrfs device stats /data
> [/dev/sda1].write_io_errs0
> [/dev/sda1].read_io_errs 0
> [/dev/sda1].flush_io_errs0
> [/dev/sda1].corruption_errs  0
> [/dev/sda1].generation_errs  0


Hard to say without a complete dmesg; but errno=-5 IO failure is
pretty much some kind of hardware problem in my experience. I haven't
seen it be a bug.

-- 
Chris Murphy

Re: Ran into "invalid block group size" bug, unclear how to proceed.

2018-12-03 Thread Chris Murphy

On Mon, Dec 3, 2018 at 8:32 PM Mike Javorski  wrote:
>
> Need a bit of advice here ladies / gents. I am running into an issue
> which Qu Wenruo seems to have posted a patch for several weeks ago
> (see https://patchwork.kernel.org/patch/10694997/).
>
> Here is the relevant dmesg output which led me to Qu's patch.
> 
> [   10.032475] BTRFS critical (device sdb): corrupt leaf: root=2
> block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104,
> invalid block group size, have 10804527104 expect (0, 10737418240]
> [   10.032493] BTRFS error (device sdb): failed to read block groups: -5
> [   10.053365] BTRFS error (device sdb): open_ctree failed
> 
>
> This server has a 16 disk btrfs filesystem (RAID6) which I boot
> periodically to btrfs-send snapshots to. This machine is running
> ArchLinux and I had just updated  to their latest 4.19.4 kernel
> package (from 4.18.10 which was working fine). I've tried updating to
> the 4.19.6 kernel that is in testing, but that doesn't seem to resolve
> the issue. From what I can see on kernel.org, the patch above is not
> pushed to stable or to Linus' tree.
>
> At this point the question is what to do. Is my FS toast? Could I
> revert to the 4.18.10 kernel and boot safely? I don't know if the 4.19
> boot process may have flipped some bits which would make reverting
> problematic.

That patch is not yet merged in linux-next so to use it, you'd need to
apply yourself and compile a kernel. I can't tell for sure if it'd
help.

But, the less you change the file system, the better chance of saving
it. I have no idea why there'd be a corrupt leaf just due to a kernel
version change, though.

Needless to say, raid56 just seems fragile once it runs into any kind
of trouble. I personally wouldn't boot off it at all. I would only
mount it from another system, ideally an installed system but a live
system with the kernel versions you need would also work. That way you
can get more information without changes, and booting will almost
immediately mount rw, if mount succeeds at all, and will write a bunch
of changes to the file system.

Whether it's a case of 4.18.10 not detecting corruption that 4.19
sees, or if 4.19 already caused it, the best chance is to not mount it
rw, and not run check --repair, until you get some feedback from a
developer.

The thing I'd like to see is
# btrfs rescue super -v /anydevice/
# btrfs insp dump-s -f /anydevice/

First command will tell us if all the supers are the same and valid
across all devices. And the second one, hopefully it's pointed to a
device with valid super, will tell us if there's a log root value
other than 0. Both of those are read only commands.

-- 
Chris Murphy

experiences running btrfs on external USB disks?

2018-12-03 Thread Tomasz Chmielewski


I'm trying to use btrfs on an external USB drive, without much success.

When the drive is connected for 2-3+ days, the filesystem gets remounted 
readonly, with BTRFS saying "IO failure":


[77760.444607] BTRFS error (device sdb1): bad tree block start, want 
378372096 have 0
[77760.550933] BTRFS error (device sdb1): bad tree block start, want 
378372096 have 0
[77760.550972] BTRFS: error (device sdb1) in __btrfs_free_extent:6804: 
errno=-5 IO failure

[77760.550979] BTRFS info (device sdb1): forced readonly
[77760.551003] BTRFS: error (device sdb1) in 
btrfs_run_delayed_refs:2935: errno=-5 IO failure

[77760.553223] BTRFS error (device sdb1): pending csums is 4096


Note that there are no other kernel messages (i.e. that would indicate a 
problem with disk, cable disconnection etc.).


The load on the drive itself can be quite heavy at times (i.e. 100% IO 
for 1-2 h and more) - can it contribute to the problem (i.e. btrfs 
thinks there is some timeout somewhere)?


Running 4.19.6 right now, but was experiencing the issue also with 4.18 
kernels.




# btrfs device stats /data
[/dev/sda1].write_io_errs0
[/dev/sda1].read_io_errs 0
[/dev/sda1].flush_io_errs0
[/dev/sda1].corruption_errs  0
[/dev/sda1].generation_errs  0



Tomasz Chmielewski

Re: Ran into "invalid block group size" bug, unclear how to proceed.

2018-12-03 Thread Mike Javorski

Apologies for not scouring the mailing list completely (just
subscribed in fact) as It appears that Patrick Dijkgraaf also ran into
this issue. He went ahead with a volume rebuild, whereas I am hoping I
can recover having not run anything more than a "btrfs scan" and
"btrfs device ready" on this machine to this point since the kernel
upgrade and the FS was intact up to that point.

- mike

On Mon, Dec 3, 2018 at 7:32 PM Mike Javorski  wrote:
>
> Need a bit of advice here ladies / gents. I am running into an issue
> which Qu Wenruo seems to have posted a patch for several weeks ago
> (see https://patchwork.kernel.org/patch/10694997/).
>
> Here is the relevant dmesg output which led me to Qu's patch.
> 
> [   10.032475] BTRFS critical (device sdb): corrupt leaf: root=2
> block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104,
> invalid block group size, have 10804527104 expect (0, 10737418240]
> [   10.032493] BTRFS error (device sdb): failed to read block groups: -5
> [   10.053365] BTRFS error (device sdb): open_ctree failed
> 
>
> This server has a 16 disk btrfs filesystem (RAID6) which I boot
> periodically to btrfs-send snapshots to. This machine is running
> ArchLinux and I had just updated  to their latest 4.19.4 kernel
> package (from 4.18.10 which was working fine). I've tried updating to
> the 4.19.6 kernel that is in testing, but that doesn't seem to resolve
> the issue. From what I can see on kernel.org, the patch above is not
> pushed to stable or to Linus' tree.
>
> At this point the question is what to do. Is my FS toast? Could I
> revert to the 4.18.10 kernel and boot safely? I don't know if the 4.19
> boot process may have flipped some bits which would make reverting
> problematic.
>
> Thanks much,
>
> - mike

Ran into "invalid block group size" bug, unclear how to proceed.

2018-12-03 Thread Mike Javorski

Need a bit of advice here ladies / gents. I am running into an issue
which Qu Wenruo seems to have posted a patch for several weeks ago
(see https://patchwork.kernel.org/patch/10694997/).

Here is the relevant dmesg output which led me to Qu's patch.

[   10.032475] BTRFS critical (device sdb): corrupt leaf: root=2
block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104,
invalid block group size, have 10804527104 expect (0, 10737418240]
[   10.032493] BTRFS error (device sdb): failed to read block groups: -5
[   10.053365] BTRFS error (device sdb): open_ctree failed


This server has a 16 disk btrfs filesystem (RAID6) which I boot
periodically to btrfs-send snapshots to. This machine is running
ArchLinux and I had just updated  to their latest 4.19.4 kernel
package (from 4.18.10 which was working fine). I've tried updating to
the 4.19.6 kernel that is in testing, but that doesn't seem to resolve
the issue. From what I can see on kernel.org, the patch above is not
pushed to stable or to Linus' tree.

At this point the question is what to do. Is my FS toast? Could I
revert to the 4.18.10 kernel and boot safely? I don't know if the 4.19
boot process may have flipped some bits which would make reverting
problematic.

Thanks much,

- mike

Re: Need help with potential ~45TB dataloss

2018-12-03 Thread Chris Murphy

Also useful information for autopsy, perhaps not for fixing, is to
know whether the SCT ERC value for every drive is less than the
kernel's SCSI driver block device command timeout value. It's super
important that the drive reports an explicit read failure before the
read command is considered failed by the kernel. If the drive is still
trying to do a read, and the kernel command timer times out, it'll
just do a reset of the whole link and we lose the outcome for the
hanging command. Upon explicit read error only, can Btrfs, or md RAID,
know what device and physical sector has a problem, and therefore how
to reconstruct the block, and fix the bad sector with a write of known
good data.

smartctl -l scterc /device/
and
cat /sys/block/sda/device/timeout

Only if SCT ERC is enabled with a value below 30, or if the kernel
command timer is change to be well above 30 (like 180, which is
absolutely crazy but a separate conversation) can we be sure that
there haven't just been resets going on for a while, preventing bad
sectors from being fixed up all along, and can contribute to the
problem. This comes up on the linux-raid (mainly md driver) list all
the time, and it contributes to lost RAID all the time. And arguably
it leads to unnecessary data loss in even the single device
desktop/laptop use case as well.


Chris Murphy

Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Chris Murphy

On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton
 wrote:
>
> Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
> > [...]
> > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> > tuning of the io queue (switching between classic io-schedulers and
> > blk-mq ones in the virtual machines) and BTRFS mount options
> > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> > in mount time (I managed to reduce the mount of IO requests
>
> Sent to quickly : I meant to write "managed to reduce by half the number
> of IO write requests for the same amount of data writen"
>
> >  by half on
> > one server in production though although more tests are needed to
> > isolate the cause).

Interesting. I wonder if it's ssd_spread or space_cache=v2 that
reduces the writes by half, or by how much for each? That's a major
reduction in writes, and suggests it might be possible for further
optimization, to help mitigate the wandering trees impact.


-- 
Chris Murphy

Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Qu Wenruo



On 2018/12/4 上午2:20, Wilson, Ellis wrote:
> Hi all,
> 
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
> 
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.
> 
> With 14TB drives available today, it doesn't take more than a handful of 
> drives to result in a filesystem that takes around a minute to mount. 
> As a result of this, I suspect this will become an increasingly problem 
> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
> not a contributor so I have no room to do so -- just shedding some light 
> on a problem that may deserve attention as filesystem sizes continue to 
> grow.

This problem is somewhat known.

If you dig further, it's btrfs_read_block_groups() which will try to
read *ALL* block group items.
And to no one's surprise, when the fs goes larger, the more block group
items need to be read from disk.

We need to do some delay for such read to improve such case.

Thanks,
Qu

> 
> Best,
> 
> ellis
> 



signature.asc
Description: OpenPGP digital signature

Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Hans van Kranenburg

Hi,

On 12/3/18 8:56 PM, Lionel Bouton wrote:
> 
> Le 03/12/2018 à 19:20, Wilson, Ellis a écrit :
>>
>> Many months ago I promised to graph how long it took to mount a BTRFS 
>> filesystem as it grows.  I finally had (made) time for this, and the 
>> attached is the result of my testing.  The image is a fairly 
>> self-explanatory graph, and the raw data is also attached in 
>> comma-delimited format for the more curious.  The columns are: 
>> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>>
>> Experimental setup:
>> - System:
>> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
>> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
>> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
>> - 3 unmount/mount cycles performed in between adding another 250GB of data
>> - 250GB of data added each time in the form of 25x10GB files in their 
>> own directory.  Files generated in parallel each epoch (25 at the same 
>> time, with a 1MB record size).
>> - 240 repetitions of this performed (to collect timings in increments of 
>> 250GB between a 0GB and 60TB filesystem)
>> - Normal "time" command used to measure time to mount.  "Real" time used 
>> of the timings reported from time.
>> - Mount:
>> /dev/md0 on /btrfs type btrfs 
>> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>>
>> At 60TB, we take 30s to mount the filesystem, which is actually not as 
>> bad as I originally thought it would be (perhaps as a result of using 
>> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
>> to comment if folks more intimately familiar with BTRFS think this is 
>> due to the very large files I've used.

Probably yes. The thing that is happening is that all block group items
are read from the extent tree. And, instead of being nicely grouped
together, they are scattered all over the place, at their virtual
address, in between all normal extent items.

So, mount time depends on cold random read iops your storage can do, and
the size of the extent tree and amount of block groups. And, your extent
tree has more items in it if you have more extents. So, yes, writing a
lot of 4kiB files should have a similar effect I think as a lot of
128MiB files that are still stored in 1 extent per file.

>  I can redo the test with much 
>> more realistic data if people have legitimate reason to think it will 
>> drastically change the result.
> 
> We are hosting some large BTRFS filesystems on Ceph (RBD used by
> QEMU/KVM). I believe the delay is heavily linked to the number of files
> (I didn't check if snapshots matter and I suspect it does but not as
> much as the number of "original" files at least if you don't heavily
> modify existing files but mostly create new ones as we do).
> As an example, we have a filesystem with 20TB used space with 4
> subvolumes hosting multi millions files/directories (probably 10-20
> millions total I didn't check the exact number recently as simply
> counting files is a very long process) and 40 snapshots for each volume.
> Mount takes about 15 minutes.
> We have virtual machines that we don't reboot as often as we would like
> because of these slow mount times.
> 
> If you want to study this, you could :
> - graph the delay for various individual file sizes (instead of 25x10GB,
> create 2 500 x 100MB and 250 000 x 1MB files between each run and
> compare to the original result)
> - graph the delay vs the number of snapshots (probably starting with a
> large number of files in the initial subvolume to start with a non
> trivial mount delay)
> You may want to study the impact of the differences between snapshots by
> comparing snapshoting without modifications and snapshots made at
> various stages of your suvolume growth.
> 
> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> tuning of the io queue (switching between classic io-schedulers and
> blk-mq ones in the virtual machines) and BTRFS mount options
> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> in mount time (I managed to reduce the mount of IO requests by half on
> one server in production though although more tests are needed to
> isolate the cause).
> I didn't expect much for the mount times, it seems to me that mount is
> mostly constrained by the BTRFS on disk structures needed at mount time
> and how the filesystem reads them (for example it doesn't benefit at all
> from large IO queue depths which probably means that each read depends
> on previous ones which prevents io-schedulers from optimizing anything).

Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982

What the code is doing here is starting at the beginning of the extent
tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which
is not that far away), and then based on the information in it, computes
w

Re: [PATCH] btrfs-progs: trivial fix about line breaker in repair_inode_nbytes_lowmem()

2018-12-03 Thread David Sterba

On Sun, Dec 02, 2018 at 03:08:36PM +, damenly...@gmail.com wrote:
> From: Su Yue 
> 
> Move "\n" at end of the sentence to print.
> 
> Fixes: 281eec7a9ddf ("btrfs-progs: check: repair inode nbytes in lowmem mode")
> Signed-off-by: Su Yue 

Applied, thanks.

Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Lionel Bouton

Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
> [...]
> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> tuning of the io queue (switching between classic io-schedulers and
> blk-mq ones in the virtual machines) and BTRFS mount options
> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> in mount time (I managed to reduce the mount of IO requests

Sent to quickly : I meant to write "managed to reduce by half the number
of IO write requests for the same amount of data writen"

>  by half on
> one server in production though although more tests are needed to
> isolate the cause).

Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Lionel Bouton

Hi,

Le 03/12/2018 à 19:20, Wilson, Ellis a écrit :
> Hi all,
>
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.

We are hosting some large BTRFS filesystems on Ceph (RBD used by
QEMU/KVM). I believe the delay is heavily linked to the number of files
(I didn't check if snapshots matter and I suspect it does but not as
much as the number of "original" files at least if you don't heavily
modify existing files but mostly create new ones as we do).
As an example, we have a filesystem with 20TB used space with 4
subvolumes hosting multi millions files/directories (probably 10-20
millions total I didn't check the exact number recently as simply
counting files is a very long process) and 40 snapshots for each volume.
Mount takes about 15 minutes.
We have virtual machines that we don't reboot as often as we would like
because of these slow mount times.

If you want to study this, you could :
- graph the delay for various individual file sizes (instead of 25x10GB,
create 2 500 x 100MB and 250 000 x 1MB files between each run and
compare to the original result)
- graph the delay vs the number of snapshots (probably starting with a
large number of files in the initial subvolume to start with a non
trivial mount delay)
You may want to study the impact of the differences between snapshots by
comparing snapshoting without modifications and snapshots made at
various stages of your suvolume growth.

Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
tuning of the io queue (switching between classic io-schedulers and
blk-mq ones in the virtual machines) and BTRFS mount options
(space_cache=v2,ssd_spread) but there wasn't any measurable improvement
in mount time (I managed to reduce the mount of IO requests by half on
one server in production though although more tests are needed to
isolate the cause).
I didn't expect much for the mount times, it seems to me that mount is
mostly constrained by the BTRFS on disk structures needed at mount time
and how the filesystem reads them (for example it doesn't benefit at all
from large IO queue depths which probably means that each read depends
on previous ones which prevents io-schedulers from optimizing anything).

Best regards,

Lionel

Re: [PATCH] btrfs-progs: fsck-tests: Move reloc tree images to 020-extent-ref-cases

2018-12-03 Thread David Sterba

On Mon, Dec 03, 2018 at 12:39:57PM +0800, Qu Wenruo wrote:
> For reloc tree, despite of its short lifespan, it's still the backref,
> where reloc tree root backref points back to itself, makes it special.
> 
> So it's more approriate to put them into 020-extent-ref-cases.
> 
> Signed-off-by: Qu Wenruo 

Applied, thanks.

BTRFS Mount Delay Time Graph

2018-12-03 Thread Wilson, Ellis

Hi all,

Many months ago I promised to graph how long it took to mount a BTRFS 
filesystem as it grows.  I finally had (made) time for this, and the 
attached is the result of my testing.  The image is a fairly 
self-explanatory graph, and the raw data is also attached in 
comma-delimited format for the more curious.  The columns are: 
Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).

Experimental setup:
- System:
Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
- 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
- 3 unmount/mount cycles performed in between adding another 250GB of data
- 250GB of data added each time in the form of 25x10GB files in their 
own directory.  Files generated in parallel each epoch (25 at the same 
time, with a 1MB record size).
- 240 repetitions of this performed (to collect timings in increments of 
250GB between a 0GB and 60TB filesystem)
- Normal "time" command used to measure time to mount.  "Real" time used 
of the timings reported from time.
- Mount:
/dev/md0 on /btrfs type btrfs 
(rw,relatime,space_cache=v2,subvolid=5,subvol=/)

At 60TB, we take 30s to mount the filesystem, which is actually not as 
bad as I originally thought it would be (perhaps as a result of using 
RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
to comment if folks more intimately familiar with BTRFS think this is 
due to the very large files I've used.  I can redo the test with much 
more realistic data if people have legitimate reason to think it will 
drastically change the result.

With 14TB drives available today, it doesn't take more than a handful of 
drives to result in a filesystem that takes around a minute to mount. 
As a result of this, I suspect this will become an increasingly problem 
for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
not a contributor so I have no room to do so -- just shedding some light 
on a problem that may deserve attention as filesystem sizes continue to 
grow.

Best,

ellis
0,0.018,0.037,0.016
250,0.245,0.098,0.066
500,0.417,0.119,0.138
750,0.284,0.073,0.066
1000,0.506,0.109,0.126
1250,0.824,0.134,0.204
1500,0.779,0.098,0.147
1750,0.805,0.107,0.215
2000,0.87,0.137,0.223
2250,1.009,0.168,0.226
2500,1.094,0.147,0.174
2750,0.908,0.137,0.246
3000,1.144,0.182,0.313
3250,1.232,0.209,0.312
3500,1.287,0.259,0.292
3750,1.29,0.166,0.298
4000,1.521,0.249,0.418
4250,1.448,0.341,0.395
4500,1.441,0.383,0.362
4750,1.555,0.35,0.371
5000,1.825,0.482,0.638
5250,1.731,0.69,0.928
5500,1.8,0.353,0.348
5750,1.979,0.295,1.194
6000,2.115,0.915,1.241
6250,2.238,0.614,1.735
6500,2.025,0.523,0.536
6750,2.15,0.458,0.727
7000,2.415,2.158,1.925
7250,2.589,1.059,2.24
7500,2.371,1.796,2.102
7750,2.737,1.579,1.659
8000,2.768,1.786,2.579
8250,2.979,2.544,2.654
8500,2.994,2.529,2.847
8750,3.042,2.283,2.947
9000,3.209,2.509,3.077
9250,3.124,2.7,3.096
9500,3.13,3.048,3.105
9750,3.444,2.702,3.33
1,3.671,3.354,3.297
10250,3.639,3.468,3.681
10500,3.693,3.651,3.711
10750,3.729,3.135,3.303
11000,3.846,3.862,3.917
11250,4.006,3.668,3.861
11500,4.113,3.919,3.875
11750,3.968,3.774,3.985
12000,4.205,3.882,4.218
12250,4.454,4.354,4.444
12500,4.528,4.441,4.616
12750,4.688,4.206,4.252
13000,4.551,4.507,4.444
13250,4.806,5.059,4.81
13500,5.041,4.662,4.997
13750,5.057,4.394,4.713
14000,5.029,5.03,4.927
14250,5.173,5.259,5.101
14500,5.104,5.3,5.416
14750,4.809,4.62,4.698
15000,5.045,5.066,4.806
15250,5.101,5.159,5.174
15500,5.074,5.245,5.65
15750,5.123,5.031,5.056
16000,5.518,5.097,5.595
16250,5.318,5.463,5.353
16500,5.63,5.689,5.768
16750,5.375,5.24,5.165
17000,5.578,5.846,5.628
17250,5.73,5.774,5.726
17500,6.108,6.202,6.226
17750,5.645,5.668,5.936
18000,6.308,5.925,6.317
18250,6.19,6.171,6.169
18500,6.442,6.601,6.403
18750,6.558,6.44,6.803
19000,6.664,7.176,6.742
19250,7.37,7.414,6.807
19500,7.021,7.143,7.253
19750,7.051,6.691,7.063
2,6.942,6.858,7.225
20250,7.617,7.39,7.202
20500,7.239,7.525,7.381
20750,7.638,7.332,7.549
21000,7.697,8.081,7.807
21250,7.867,7.929,7.826
21500,7.98,8.208,8.059
21750,7.79,7.614,7.726
22000,8.144,8.611,8.361
22250,8.19,8.558,8.459
22500,8.685,8.785,8.617
22750,8.702,8.454,8.727
23000,8.653,8.699,8.89
23250,8.897,9.328,9.101
23500,9.245,9.456,9.464
23750,9.242,9.072,9.363
24000,9.367,8.934,9.541
24250,9.2,9.754,9.708
24500,9.622,9.472,9.484
24750,9.756,9.672,10.091
25000,10.207,10.304,9.981
25250,10.135,10.166,9.991
25500,9.969,10.234,10.266
25750,10.098,10.515,10.98
26000,10.811,10.6,11.3
26250,11.211,10.761,10.825
26500,10.799,11.075,10.973
26750,10.72,11.12,11.39
27000,11.463,11.106,11.679
27250,11.644,11.363,11.316
27500,11.541,11.748,11.657
27750,11.292,11.794,11.616
28000,11.888,11.697,12.169
28250,12.298,12.183,12.002
28500,12.124,12.48,12.352
28750,11.347,11.815,12.201
29000,12.009,11.72,12.734
29250,11.918,12.02,12.583
29500,12.445,12.439,12.466
29750,12.071,11.863,12.078
3,12.287,12.18

Re: [PATCH 2/5] btrfs: Refactor btrfs_can_relocate

2018-12-03 Thread David Sterba

On Sat, Nov 17, 2018 at 09:29:27AM +0800, Anand Jain wrote:
> > -   ret = find_free_dev_extent(trans, device, min_free,
> > -  &dev_offset, NULL);
> > -   if (!ret)
> > +   if (!find_free_dev_extent(trans, device, min_free,
> > +  &dev_offset, NULL))
> 
>   This can return -ENOMEM;
> 
> > @@ -2856,8 +2856,7 @@ static int btrfs_relocate_chunk(struct btrfs_fs_info 
> > *fs_info, u64 chunk_offset)
> >  */
> > lockdep_assert_held(&fs_info->delete_unused_bgs_mutex);
> >   
> > -   ret = btrfs_can_relocate(fs_info, chunk_offset);
> > -   if (ret)
> > +   if (!btrfs_can_relocate(fs_info, chunk_offset))
> > return -ENOSPC;
> 
>   And ends up converting -ENOMEM to -ENOSPC.
> 
>   Its better to propagate the accurate error.

Right, converting to bool is obscuring the reason why the functions
fail. Making the code simpler at this cost does not look like a good
idea to me. I'll remove the patch from misc-next for now.

Re: [PATCH 0/4] Replace custom device-replace locking with rwsem

2018-12-03 Thread David Sterba

On Tue, Nov 20, 2018 at 01:50:54PM +0100, David Sterba wrote:
> The first cleanup part went to 4.19, the actual switch from the custom
> locking to rswem was postponed as I found performance degradation. This
> turned out to be related to VM cache settings, so I'm resending the
> series again.
> 
> The custom locking is based on rwlock protected reader/writer counters,
> waitqueues, which essentially is what the readwrite semaphore does.
> 
> Previous patchset:
> https://lore.kernel.org/linux-btrfs/cover.1536331604.git.dste...@suse.com/
> 
> Patches correspond to 8/11-11/11 and there's no change besides
> refreshing on top of current misc-next.
> 
> David Sterba (4):
>   btrfs: reada: reorder dev-replace locks before radix tree preload
>   btrfs: dev-replace: swich locking to rw semaphore
>   btrfs: dev-replace: remove custom read/write blocking scheme
>   btrfs: dev-replace: open code trivial locking helpers

This has been sitting in for-next for some time, no problems reported.
If anybody wants to add a review tag, please let me know. I'm going to
add the patchset to misc-next soon.

Btrfs progs pre-release 4.19.1-rc1

2018-12-03 Thread David Sterba

Hi,

this is a pre-release of btrfs-progs, 4.19.1-rc1. There are build fixes, minor
update to libbtrfsutil and documentation updates.

The 4.19.1 release is scheduled to this Wednesday, +2 days (2018-12-05).

Changelog:
  * build fixes
* big-endian builds fail due to bswap helpers clash
* 'swap' macro is too generic, renamed to prevent build failures
  * libbtrfs
* monor version update to 1.1.0
* fix default search to top=0 as documented
* rename 'async' to avoid future python binding problems
* add support for unprivileged subvolume listing ioctls
* added tests, API docs
  * other
* lot of typos fixed
* warning cleanups
* doc formatting updates
* CI tests against zstd 1.3.7

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

David Sterba (7):
  btrfs-progs: kerncompat: rename swap to __swap
  btrfs-progs: README: add link to INSTALL
  btrfs-progs: docs: fix rendering of exponents in manual pages
  btrfs-progs: link to libbtrfsutil/README from the main README
  btrfs-progs: tests: pull zstd version 1.3.7 to the travis CI
  btrfs-progs: update CHANGES for v4.19.1
  Btrfs progs v4.19.1-rc1

Josh Soref (11):
  btrfs-progs: docs: fix typos in Documentation
  btrfs-progs: docs: fix typos in READMEs, INSTALL and CHANGES
  btrfs-progs: fix typos in Makefile
  btrfs-progs: tests: fix typos in test comments
  btrfs-progs: tests: fsck/025, fix typo in helpre name
  btrfs-progs: fix typos in comments
  btrfs-progs: fix typos in user-visible strings
  btrfs-progs: check: fix typo in device_extent_record::chunk_objectid
  btrfs-progs: utils: fix typo in a variable
  btrfs-progs: mkfs: fix typo in "multipler" variables
  btrfs-progs: fix typo in btrfs-list function export

Omar Sandoval (10):
  libbtrfsutil: use top=0 as default for SubvolumeIterator()
  libbtrfsutil: change async parameters to async_ in Python bindings
  libbtrfsutil: document qgroup_inherit parameter in Python bindings
  libbtrfsutil: use SubvolumeIterator as context manager in tests
  libbtrfsutil: add test helpers for dropping privileges
  libbtrfsutil: allow tests to create multiple Btrfs instances
  libbtrfsutil: relax the privileges of subvolume_info()
  libbtrfsutil: relax the privileges of subvolume iterator
  libbtrfsutil: bump version to 1.1.0
  libbtrfsutil: document API in README

Rosen Penev (3):
  btrfs-progs: kernel-lib: bitops: Fix big endian compilation
  btrfs-progs: task-utils: Fix comparison between pointer and integer
  btrfs-progs: treewide: Fix missing declarations

Re: Filesystem Corruption

2018-12-03 Thread remi

On Mon, Dec 3, 2018, at 4:31 AM, Stefan Malte Schumacher wrote:

> I have noticed an unusual amount of crc-errors in downloaded rars,
> beginning about a week ago. But lets start with the preliminaries. I
> am using Debian Stretch.
> Kernel: Linux mars 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4
> (2018-08-21) x86_64 GNU/Linux
> 
> [5390748.884929] Buffer I/O error on dev dm-0, logical block
> 976701312, async page read

Excuse me for butting when there are *many* more qualified people on this list.

But assuming the rar crc errors are related to your unexplained buffer I/O 
errors, (and not some weird coincidence of simply bad downloads.), I would 
start, immediately, by testing the Memory.  Ram corruption can wreak havok with 
btrfs, (any filesystem but I think BTRFS has special challenges in this 
regard.)  and this looks like memory error to me.

[PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-12-03 Thread Josef Bacik

The cleaner thread usually takes care of delayed iputs, with the
exception of the btrfs_end_transaction_throttle path.  The cleaner
thread only gets woken up every 30 seconds, so instead wake it up to do
it's work so that we can free up that space as quickly as possible.

Reviewed-by: Filipe Manana 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   | 3 +++
 fs/btrfs/disk-io.c | 3 +++
 fs/btrfs/inode.c   | 2 ++
 3 files changed, 8 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c8ddbacb6748..dc56a4d940c3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -769,6 +769,9 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info 
*fs_info, void *ptr);
  */
 #define BTRFS_FS_BALANCE_RUNNING   18
 
+/* Indicate that the cleaner thread is awake and doing something. */
+#define BTRFS_FS_CLEANER_RUNNING   19
+
 struct btrfs_fs_info {
u8 fsid[BTRFS_FSID_SIZE];
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c5918ff8241b..f40f6fdc1019 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1669,6 +1669,8 @@ static int cleaner_kthread(void *arg)
while (1) {
again = 0;
 
+   set_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags);
+
/* Make the cleaner go to sleep early. */
if (btrfs_need_cleaner_sleep(fs_info))
goto sleep;
@@ -1715,6 +1717,7 @@ static int cleaner_kthread(void *arg)
 */
btrfs_delete_unused_bgs(fs_info);
 sleep:
+   clear_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags);
if (kthread_should_park())
kthread_parkme();
if (kthread_should_stop())
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8ac7abe2ae9b..0b9f3e482cea 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3264,6 +3264,8 @@ void btrfs_add_delayed_iput(struct inode *inode)
ASSERT(list_empty(&binode->delayed_iput));
list_add_tail(&binode->delayed_iput, &fs_info->delayed_iputs);
spin_unlock(&fs_info->delayed_iput_lock);
+   if (!test_bit(BTRFS_FS_CLEANER_RUNNING, &fs_info->flags))
+   wake_up_process(fs_info->cleaner_kthread);
 }
 
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info)
-- 
2.14.3

[PATCH 1/3] btrfs: run delayed iputs before committing

2018-12-03 Thread Josef Bacik

Delayed iputs means we can have final iputs of deleted inodes in the
queue, which could potentially generate a lot of pinned space that could
be free'd.  So before we decide to commit the transaction for ENOPSC
reasons, run the delayed iputs so that any potential space is free'd up.
If there is and we freed enough we can then commit the transaction and
potentially be able to make our reservation.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/extent-tree.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 8dfddfd3f315..0127d272cd2a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4953,6 +4953,15 @@ static void flush_space(struct btrfs_fs_info *fs_info,
ret = 0;
break;
case COMMIT_TRANS:
+   /*
+* If we have pending delayed iputs then we could free up a
+* bunch of pinned space, so make sure we run the iputs before
+* we do our pinned bytes check below.
+*/
+   mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
+   btrfs_run_delayed_iputs(fs_info);
+   mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
+
ret = may_commit_transaction(fs_info, space_info);
break;
default:
-- 
2.14.3

[PATCH 0/3][V2] Delayed iput fixes

2018-12-03 Thread Josef Bacik

v1->v2:
- only wakeup if the cleaner isn't currently doing work.
- re-arranged some stuff for running delayed iputs during flushint.
- removed the open code wakeup in the waitqueue patch.

-- Original message --

Here are some delayed iput fixes.  Delayed iputs can hold reservations for a
while and there's no real good way to make sure they were gone for good, which
means we could early enospc when in reality if we had just waited for the iput
we would have had plenty of space.  So fix this up by making us wait for delayed
iputs when deciding if we need to commit for enospc flushing, and then cleanup
and rework how we run delayed iputs to make it more straightforward to wait on
them and make sure we're all done using them.  Thanks,

Josef

[PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

2018-12-03 Thread Josef Bacik

The throttle path doesn't take cleaner_delayed_iput_mutex, which means
we could think we're done flushing iputs in the data space reservation
path when we could have a throttler doing an iput.  There's no real
reason to serialize the delayed iput flushing, so instead of taking the
cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
replace it with an atomic counter and a waitqueue.  This removes the
short (or long depending on how big the inode is) window where we think
there are no more pending iputs when there really are some.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  4 +++-
 fs/btrfs/disk-io.c |  5 ++---
 fs/btrfs/extent-tree.c | 13 -
 fs/btrfs/inode.c   | 21 +
 4 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dc56a4d940c3..20af5d6d81f1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -915,7 +915,8 @@ struct btrfs_fs_info {
 
spinlock_t delayed_iput_lock;
struct list_head delayed_iputs;
-   struct mutex cleaner_delayed_iput_mutex;
+   atomic_t nr_delayed_iputs;
+   wait_queue_head_t delayed_iputs_wait;
 
/* this protects tree_mod_seq_list */
spinlock_t tree_mod_seq_lock;
@@ -3240,6 +3241,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root);
 int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size);
 void btrfs_add_delayed_iput(struct inode *inode);
 void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info);
+int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info);
 int btrfs_prealloc_file_range(struct inode *inode, int mode,
  u64 start, u64 num_bytes, u64 min_size,
  loff_t actual_len, u64 *alloc_hint);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f40f6fdc1019..238e0113f2d3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1694,9 +1694,7 @@ static int cleaner_kthread(void *arg)
goto sleep;
}
 
-   mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
btrfs_run_delayed_iputs(fs_info);
-   mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
 
again = btrfs_clean_one_deleted_snapshot(root);
mutex_unlock(&fs_info->cleaner_mutex);
@@ -2654,7 +2652,6 @@ int open_ctree(struct super_block *sb,
mutex_init(&fs_info->delete_unused_bgs_mutex);
mutex_init(&fs_info->reloc_mutex);
mutex_init(&fs_info->delalloc_root_mutex);
-   mutex_init(&fs_info->cleaner_delayed_iput_mutex);
seqlock_init(&fs_info->profiles_lock);
 
INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
@@ -2676,6 +2673,7 @@ int open_ctree(struct super_block *sb,
atomic_set(&fs_info->defrag_running, 0);
atomic_set(&fs_info->qgroup_op_seq, 0);
atomic_set(&fs_info->reada_works_cnt, 0);
+   atomic_set(&fs_info->nr_delayed_iputs, 0);
atomic64_set(&fs_info->tree_mod_seq, 0);
fs_info->sb = sb;
fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE;
@@ -2753,6 +2751,7 @@ int open_ctree(struct super_block *sb,
init_waitqueue_head(&fs_info->transaction_wait);
init_waitqueue_head(&fs_info->transaction_blocked_wait);
init_waitqueue_head(&fs_info->async_submit_wait);
+   init_waitqueue_head(&fs_info->delayed_iputs_wait);
 
INIT_LIST_HEAD(&fs_info->pinned_chunks);
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0127d272cd2a..5b6c9fc227ff 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4280,10 +4280,14 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode 
*inode, u64 bytes)
/*
 * The cleaner kthread might still be doing iput
 * operations. Wait for it to finish so that
-* more space is released.
+* more space is released.  We don't need to
+* explicitly run the delayed iputs here because
+* the commit_transaction would have woken up
+* the cleaner.
 */
-   
mutex_lock(&fs_info->cleaner_delayed_iput_mutex);
-   
mutex_unlock(&fs_info->cleaner_delayed_iput_mutex);
+   ret = btrfs_wait_on_delayed_iputs(fs_info);
+   if (ret)
+   return ret;
goto again;
} else {
btrfs_end_transaction(trans);
@@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info,
 * bunch of pinned space, so make sure we run the iputs before
 * we do our pinned bytes check below.

[PATCH 8/8] btrfs: reserve extra space during evict()

2018-12-03 Thread Josef Bacik

We could generate a lot of delayed refs in evict but never have any left
over space from our block rsv to make up for that fact.  So reserve some
extra space and give it to the transaction so it can be used to refill
the delayed refs rsv every loop through the truncate path.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 623a71d871d4..8ac7abe2ae9b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5258,13 +5258,15 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
 {
struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
+   u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1);
int failures = 0;
 
for (;;) {
struct btrfs_trans_handle *trans;
int ret;
 
-   ret = btrfs_block_rsv_refill(root, rsv, rsv->size,
+   ret = btrfs_block_rsv_refill(root, rsv,
+rsv->size + delayed_refs_extra,
 BTRFS_RESERVE_FLUSH_LIMIT);
 
if (ret && ++failures > 2) {
@@ -5273,9 +5275,28 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
return ERR_PTR(-ENOSPC);
}
 
+   /*
+* Evict can generate a large amount of delayed refs without
+* having a way to add space back since we exhaust our temporary
+* block rsv.  We aren't allowed to do FLUSH_ALL in this case
+* because we could deadlock with so many things in the flushing
+* code, so we have to try and hold some extra space to
+* compensate for our delayed ref generation.  If we can't get
+* that space then we need see if we can steal our minimum from
+* the global reserve.  We will be ratelimited by the amount of
+* space we have for the delayed refs rsv, so we'll end up
+* committing and trying again.
+*/
trans = btrfs_join_transaction(root);
-   if (IS_ERR(trans) || !ret)
+   if (IS_ERR(trans) || !ret) {
+   if (!IS_ERR(trans)) {
+   trans->block_rsv = &fs_info->trans_block_rsv;
+   trans->bytes_reserved = delayed_refs_extra;
+   btrfs_block_rsv_migrate(rsv, trans->block_rsv,
+   delayed_refs_extra, 1);
+   }
return trans;
+   }
 
/*
 * Try to steal from the global reserve if there is space for
-- 
2.14.3

[PATCH 1/8] btrfs: check if free bgs for commit

2018-12-03 Thread Josef Bacik

may_commit_transaction will skip committing the transaction if we don't
have enough pinned space or if we're trying to find space for a SYSTEM
chunk.  However if we have pending free block groups in this transaction
we still want to commit as we may be able to allocate a chunk to make
our reservation.  So instead of just returning ENOSPC, check if we have
free block groups pending, and if so commit the transaction to allow us
to use that free space.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
Reviewed-by: Nikolay Borisov 
---
 fs/btrfs/extent-tree.c | 34 --
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 07ef1b8087f7..755eb226d32d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4853,10 +4853,19 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes_needed)
return 0;
 
-   /* See if there is enough pinned space to make this reservation */
-   if (__percpu_counter_compare(&space_info->total_bytes_pinned,
-  bytes_needed,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
+   trans = btrfs_join_transaction(fs_info->extent_root);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
+   /*
+* See if there is enough pinned space to make this reservation, or if
+* we have bg's that are going to be freed, allowing us to possibly do a
+* chunk allocation the next loop through.
+*/
+   if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &trans->transaction->flags) ||
+   __percpu_counter_compare(&space_info->total_bytes_pinned,
+bytes_needed,
+BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
goto commit;
 
/*
@@ -4864,7 +4873,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 * this reservation.
 */
if (space_info != delayed_rsv->space_info)
-   return -ENOSPC;
+   goto enospc;
 
spin_lock(&delayed_rsv->lock);
reclaim_bytes += delayed_rsv->reserved;
@@ -4878,17 +4887,14 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
bytes_needed -= reclaim_bytes;
 
if (__percpu_counter_compare(&space_info->total_bytes_pinned,
-  bytes_needed,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
-   return -ENOSPC;
-   }
-
+bytes_needed,
+BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
+   goto enospc;
 commit:
-   trans = btrfs_join_transaction(fs_info->extent_root);
-   if (IS_ERR(trans))
-   return -ENOSPC;
-
return btrfs_commit_transaction(trans);
+enospc:
+   btrfs_end_transaction(trans);
+   return -ENOSPC;
 }
 
 /*
-- 
2.14.3

[PATCH 4/8] btrfs: add ALLOC_CHUNK_FORCE to the flushing code

2018-12-03 Thread Josef Bacik

With my change to no longer take into account the global reserve for
metadata allocation chunks we have this side-effect for mixed block
group fs'es where we are no longer allocating enough chunks for the
data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
step to the flushing state machine.  This will only get used if we've
already made a full loop through the flushing machinery and tried
committing the transaction.  If we have then we can try and force a
chunk allocation since we likely need it to make progress.  This
resolves the issues I was seeing with the mixed bg tests in xfstests
with my previous patch.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  3 ++-
 fs/btrfs/extent-tree.c   | 18 +-
 include/trace/events/btrfs.h |  1 +
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 30da075c042e..7cf6ad021d81 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2750,7 +2750,8 @@ enum btrfs_flush_state {
FLUSH_DELALLOC  =   5,
FLUSH_DELALLOC_WAIT =   6,
ALLOC_CHUNK =   7,
-   COMMIT_TRANS=   8,
+   ALLOC_CHUNK_FORCE   =   8,
+   COMMIT_TRANS=   9,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 667b992d322d..2d0dd70570ca 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4938,6 +4938,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
btrfs_end_transaction(trans);
break;
case ALLOC_CHUNK:
+   case ALLOC_CHUNK_FORCE:
trans = btrfs_join_transaction(root);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
@@ -4945,7 +4946,9 @@ static void flush_space(struct btrfs_fs_info *fs_info,
}
ret = do_chunk_alloc(trans,
 btrfs_metadata_alloc_profile(fs_info),
-CHUNK_ALLOC_NO_FORCE);
+(state == ALLOC_CHUNK) ?
+CHUNK_ALLOC_NO_FORCE :
+CHUNK_ALLOC_FORCE);
btrfs_end_transaction(trans);
if (ret > 0 || ret == -ENOSPC)
ret = 0;
@@ -5081,6 +5084,19 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
commit_cycles--;
}
 
+   /*
+* We don't want to force a chunk allocation until we've tried
+* pretty hard to reclaim space.  Think of the case where we
+* free'd up a bunch of space and so have a lot of pinned space
+* to reclaim.  We would rather use that than possibly create a
+* underutilized metadata chunk.  So if this is our first run
+* through the flushing state machine skip ALLOC_CHUNK_FORCE and
+* commit the transaction.  If nothing has changed the next go
+* around then we can force a chunk allocation.
+*/
+   if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
+   flush_state++;
+
if (flush_state > COMMIT_TRANS) {
commit_cycles++;
if (commit_cycles > 2) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 63d1f9d8b8c7..dd0e6f8d6b6e 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush,
{ FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"},   
\
{ FLUSH_DELAYED_REFS,   "FLUSH_ELAYED_REFS"},   
\
{ ALLOC_CHUNK,  "ALLOC_CHUNK"}, 
\
+   { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"},   
\
{ COMMIT_TRANS, "COMMIT_TRANS"})
 
 TRACE_EVENT(btrfs_flush_space,
-- 
2.14.3

[PATCH 0/8][V2] Enospc cleanups and fixeS

2018-12-03 Thread Josef Bacik

v1->v2:
- addressed comments from reviewers.
- fixed a bug in patch 6 that was introduced because of changes to upstream.

-- Original message --

The delayed refs rsv patches exposed a bunch of issues in our enospc
infrastructure that needed to be addressed.  These aren't really one coherent
group, but they are all around flushing and reservations.
may_commit_transaction() needed to be updated a little bit, and we needed to add
a new state to force chunk allocation if things got dicey.  Also because we can
end up needed to reserve a whole bunch of extra space for outstanding delayed
refs we needed to add the ability to only ENOSPC tickets that were too big to
satisfy, instead of failing all of the tickets.  There's also a fix in here for
one of the corner cases where we didn't quite have enough space reserved for the
delayed refs we were generating during evict().  Thanks,

Josef

[PATCH 2/8] btrfs: dump block_rsv whe dumping space info

2018-12-03 Thread Josef Bacik

For enospc_debug having the block rsvs is super helpful to see if we've
done something wrong.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
Reviewed-by: David Sterba 
---
 fs/btrfs/extent-tree.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 755eb226d32d..204b35434056 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8063,6 +8063,15 @@ static noinline int find_free_extent(struct 
btrfs_fs_info *fs_info,
return ret;
 }
 
+#define DUMP_BLOCK_RSV(fs_info, rsv_name)  \
+do {   \
+   struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name;   \
+   spin_lock(&__rsv->lock);\
+   btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu",  \
+  __rsv->size, __rsv->reserved);   \
+   spin_unlock(&__rsv->lock);  \
+} while (0)
+
 static void dump_space_info(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *info, u64 bytes,
int dump_block_groups)
@@ -8082,6 +8091,12 @@ static void dump_space_info(struct btrfs_fs_info 
*fs_info,
info->bytes_readonly);
spin_unlock(&info->lock);
 
+   DUMP_BLOCK_RSV(fs_info, global_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, chunk_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, delayed_block_rsv);
+   DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv);
+
if (!dump_block_groups)
return;
 
-- 
2.14.3

[PATCH 6/8] btrfs: loop in inode_rsv_refill

2018-12-03 Thread Josef Bacik

With severe fragmentation we can end up with our inode rsv size being
huge during writeout, which would cause us to need to make very large
metadata reservations.  However we may not actually need that much once
writeout is complete.  So instead try to make our reservation, and if we
couldn't make it re-calculate our new reservation size and try again.
If our reservation size doesn't change between tries then we know we are
actually out of space and can error out.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 58 +-
 1 file changed, 43 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0ee77a98f867..0e1a499035ac 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5787,6 +5787,21 @@ int btrfs_block_rsv_refill(struct btrfs_root *root,
return ret;
 }
 
+static inline void __get_refill_bytes(struct btrfs_block_rsv *block_rsv,
+ u64 *metadata_bytes, u64 *qgroup_bytes)
+{
+   *metadata_bytes = 0;
+   *qgroup_bytes = 0;
+
+   spin_lock(&block_rsv->lock);
+   if (block_rsv->reserved < block_rsv->size)
+   *metadata_bytes = block_rsv->size - block_rsv->reserved;
+   if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
+   *qgroup_bytes = block_rsv->qgroup_rsv_size -
+   block_rsv->qgroup_rsv_reserved;
+   spin_unlock(&block_rsv->lock);
+}
+
 /**
  * btrfs_inode_rsv_refill - refill the inode block rsv.
  * @inode - the inode we are refilling.
@@ -5802,25 +5817,39 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
*inode,
 {
struct btrfs_root *root = inode->root;
struct btrfs_block_rsv *block_rsv = &inode->block_rsv;
-   u64 num_bytes = 0;
+   u64 num_bytes = 0, last = 0;
u64 qgroup_num_bytes = 0;
int ret = -ENOSPC;
 
-   spin_lock(&block_rsv->lock);
-   if (block_rsv->reserved < block_rsv->size)
-   num_bytes = block_rsv->size - block_rsv->reserved;
-   if (block_rsv->qgroup_rsv_reserved < block_rsv->qgroup_rsv_size)
-   qgroup_num_bytes = block_rsv->qgroup_rsv_size -
-  block_rsv->qgroup_rsv_reserved;
-   spin_unlock(&block_rsv->lock);
-
+   __get_refill_bytes(block_rsv, &num_bytes, &qgroup_num_bytes);
if (num_bytes == 0)
return 0;
 
-   ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true);
-   if (ret)
-   return ret;
-   ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+   do {
+   ret = btrfs_qgroup_reserve_meta_prealloc(root, 
qgroup_num_bytes, true);
+   if (ret)
+   return ret;
+   ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
+   if (ret) {
+   btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+   last = num_bytes;
+   /*
+* If we are fragmented we can end up with a lot of
+* outstanding extents which will make our size be much
+* larger than our reserved amount.  If we happen to
+* try to do a reservation here that may result in us
+* trying to do a pretty hefty reservation, which we may
+* not need once delalloc flushing happens.  If this is
+* the case try and do the reserve again.
+*/
+   if (flush == BTRFS_RESERVE_FLUSH_ALL)
+   __get_refill_bytes(block_rsv, &num_bytes,
+  &qgroup_num_bytes);
+   if (num_bytes == 0)
+   return 0;
+   }
+   } while (ret && last != num_bytes);
+
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, false);
trace_btrfs_space_reservation(root->fs_info, "delalloc",
@@ -5830,8 +5859,7 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
*inode,
spin_lock(&block_rsv->lock);
block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
spin_unlock(&block_rsv->lock);
-   } else
-   btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+   }
return ret;
 }
 
-- 
2.14.3

[PATCH 3/8] btrfs: don't use global rsv for chunk allocation

2018-12-03 Thread Josef Bacik

The should_alloc_chunk code has math in it to decide if we're getting
short on space and if we should go ahead and pre-emptively allocate a
new chunk.  Previously when we did not have the delayed_refs_rsv, we had
to assume that the global block rsv was essentially used space and could
be allocated completely at any time, so we counted this space as "used"
when determining if we had enough slack space in our current space_info.
But on any slightly used file system (10gib or more) you can have a
global reserve of 512mib.  With our default chunk size being 1gib that
means we just assume half of the block group is used, which could result
in us allocating more metadata chunks than is actually required.

With the delayed refs rsv we can flush delayed refs as the space becomes
tight, and if we actually need more block groups then they will be
allocated based on space pressure.  We no longer require assuming the
global reserve is used space in our calculations.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 204b35434056..667b992d322d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4398,21 +4398,12 @@ static inline u64 calc_global_rsv_need_space(struct 
btrfs_block_rsv *global)
 static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
  struct btrfs_space_info *sinfo, int force)
 {
-   struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
u64 bytes_used = btrfs_space_info_used(sinfo, false);
u64 thresh;
 
if (force == CHUNK_ALLOC_FORCE)
return 1;
 
-   /*
-* We need to take into account the global rsv because for all intents
-* and purposes it's used space.  Don't worry about locking the
-* global_rsv, it doesn't change except when the transaction commits.
-*/
-   if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA)
-   bytes_used += calc_global_rsv_need_space(global_rsv);
-
/*
 * in limited mode, we want to have some free space up to
 * about 1% of the FS size.
-- 
2.14.3

[PATCH 7/8] btrfs: be more explicit about allowed flush states

2018-12-03 Thread Josef Bacik

For FLUSH_LIMIT flushers (think evict, truncate) we can deadlock when
running delalloc because we may be holding a tree lock.  We can also
deadlock with delayed refs rsv's that are running via the committing
mechanism.  The only safe operations for FLUSH_LIMIT is to run the
delayed operations and to allocate chunks, everything else has the
potential to deadlock.  Future proof this by explicitly specifying the
states that FLUSH_LIMIT is allowed to use.  This will keep us from
introducing bugs later on when adding new flush states.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0e1a499035ac..ab9d915d9289 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5123,12 +5123,18 @@ void btrfs_init_async_reclaim_work(struct work_struct 
*work)
INIT_WORK(work, btrfs_async_reclaim_metadata_space);
 }
 
+static const enum btrfs_flush_state priority_flush_states[] = {
+   FLUSH_DELAYED_ITEMS_NR,
+   FLUSH_DELAYED_ITEMS,
+   ALLOC_CHUNK,
+};
+
 static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *space_info,
struct reserve_ticket *ticket)
 {
u64 to_reclaim;
-   int flush_state = FLUSH_DELAYED_ITEMS_NR;
+   int flush_state = 0;
 
spin_lock(&space_info->lock);
to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info,
@@ -5140,7 +5146,8 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
spin_unlock(&space_info->lock);
 
do {
-   flush_space(fs_info, space_info, to_reclaim, flush_state);
+   flush_space(fs_info, space_info, to_reclaim,
+   priority_flush_states[flush_state]);
flush_state++;
spin_lock(&space_info->lock);
if (ticket->bytes == 0) {
@@ -5148,15 +5155,7 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
return;
}
spin_unlock(&space_info->lock);
-
-   /*
-* Priority flushers can't wait on delalloc without
-* deadlocking.
-*/
-   if (flush_state == FLUSH_DELALLOC ||
-   flush_state == FLUSH_DELALLOC_WAIT)
-   flush_state = ALLOC_CHUNK;
-   } while (flush_state < COMMIT_TRANS);
+   } while (flush_state < ARRAY_SIZE(priority_flush_states));
 }
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
-- 
2.14.3

[PATCH 5/8] btrfs: don't enospc all tickets on flush failure

2018-12-03 Thread Josef Bacik

With the introduction of the per-inode block_rsv it became possible to
have really really large reservation requests made because of data
fragmentation.  Since the ticket stuff assumed that we'd always have
relatively small reservation requests it just killed all tickets if we
were unable to satisfy the current request.  However this is generally
not the case anymore.  So fix this logic to instead see if we had a
ticket that we were able to give some reservation to, and if we were
continue the flushing loop again.  Likewise we make the tickets use the
space_info_add_old_bytes() method of returning what reservation they did
receive in hopes that it could satisfy reservations down the line.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 45 +
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2d0dd70570ca..0ee77a98f867 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4801,6 +4801,7 @@ static void shrink_delalloc(struct btrfs_fs_info 
*fs_info, u64 to_reclaim,
 }
 
 struct reserve_ticket {
+   u64 orig_bytes;
u64 bytes;
int error;
struct list_head list;
@@ -5023,7 +5024,7 @@ static inline int need_do_async_reclaim(struct 
btrfs_fs_info *fs_info,
!test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state));
 }
 
-static void wake_all_tickets(struct list_head *head)
+static bool wake_all_tickets(struct list_head *head)
 {
struct reserve_ticket *ticket;
 
@@ -5032,7 +5033,10 @@ static void wake_all_tickets(struct list_head *head)
list_del_init(&ticket->list);
ticket->error = -ENOSPC;
wake_up(&ticket->wait);
+   if (ticket->bytes != ticket->orig_bytes)
+   return true;
}
+   return false;
 }
 
 /*
@@ -5100,8 +5104,12 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
if (flush_state > COMMIT_TRANS) {
commit_cycles++;
if (commit_cycles > 2) {
-   wake_all_tickets(&space_info->tickets);
-   space_info->flush = 0;
+   if (wake_all_tickets(&space_info->tickets)) {
+   flush_state = FLUSH_DELAYED_ITEMS_NR;
+   commit_cycles--;
+   } else {
+   space_info->flush = 0;
+   }
} else {
flush_state = FLUSH_DELAYED_ITEMS_NR;
}
@@ -5153,10 +5161,11 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
   struct btrfs_space_info *space_info,
-  struct reserve_ticket *ticket, u64 orig_bytes)
+  struct reserve_ticket *ticket)
 
 {
DEFINE_WAIT(wait);
+   u64 reclaim_bytes = 0;
int ret = 0;
 
spin_lock(&space_info->lock);
@@ -5177,14 +5186,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info 
*fs_info,
ret = ticket->error;
if (!list_empty(&ticket->list))
list_del_init(&ticket->list);
-   if (ticket->bytes && ticket->bytes < orig_bytes) {
-   u64 num_bytes = orig_bytes - ticket->bytes;
-   update_bytes_may_use(space_info, -num_bytes);
-   trace_btrfs_space_reservation(fs_info, "space_info",
- space_info->flags, num_bytes, 0);
-   }
+   if (ticket->bytes && ticket->bytes < ticket->orig_bytes)
+   reclaim_bytes = ticket->orig_bytes - ticket->bytes;
spin_unlock(&space_info->lock);
 
+   if (reclaim_bytes)
+   space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
return ret;
 }
 
@@ -5210,6 +5217,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info 
*fs_info,
 {
struct reserve_ticket ticket;
u64 used;
+   u64 reclaim_bytes = 0;
int ret = 0;
 
ASSERT(orig_bytes);
@@ -5245,6 +5253,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info 
*fs_info,
 * the list and we will do our own flushing further down.
 */
if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
+   ticket.orig_bytes = orig_bytes;
ticket.bytes = orig_bytes;
ticket.error = 0;
init_waitqueue_head(&ticket.wait);
@@ -5285,25 +5294,21 @@ static int __reserve_metadata_bytes(struct 
btrfs_fs_info *fs_info,
return ret;
 
if (flush == BTRFS_RESERVE_FLUSH_ALL)
-   return wait_reserve_ticket(fs_info, space_info, &ticket,
-

[PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-03 Thread Josef Bacik

Now with the delayed_refs_rsv we can now know exactly how much pending
delayed refs space we need.  This means we can drastically simplify
btrfs_check_space_for_delayed_refs by simply checking how much space we
have reserved for the global rsv (which acts as a spill over buffer) and
the delayed refs rsv.  If our total size is beyond that amount then we
know it's time to commit the transaction and stop any more delayed refs
from being generated.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  2 +-
 fs/btrfs/extent-tree.c | 48 ++--
 fs/btrfs/inode.c   |  4 ++--
 fs/btrfs/transaction.c |  2 +-
 4 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2eba398c722b..30da075c042e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2631,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct 
btrfs_fs_info *fs_info,
 }
 
 int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans);
-int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans);
+bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
 void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
 const u64 start);
 void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5a2d0b061f57..07ef1b8087f7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2839,40 +2839,28 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info 
*fs_info, u64 csum_bytes)
return num_csums;
 }
 
-int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans)
+bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info)
 {
-   struct btrfs_fs_info *fs_info = trans->fs_info;
-   struct btrfs_block_rsv *global_rsv;
-   u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
-   u64 csum_bytes = trans->transaction->delayed_refs.pending_csums;
-   unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs;
-   u64 num_bytes, num_dirty_bgs_bytes;
-   int ret = 0;
+   struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
+   struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv;
+   bool ret = false;
+   u64 reserved;
 
-   num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
-   num_heads = heads_to_leaves(fs_info, num_heads);
-   if (num_heads > 1)
-   num_bytes += (num_heads - 1) * fs_info->nodesize;
-   num_bytes <<= 1;
-   num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) *
-   fs_info->nodesize;
-   num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info,
-num_dirty_bgs);
-   global_rsv = &fs_info->global_block_rsv;
+   spin_lock(&global_rsv->lock);
+   reserved = global_rsv->reserved;
+   spin_unlock(&global_rsv->lock);
 
/*
-* If we can't allocate any more chunks lets make sure we have _lots_ of
-* wiggle room since running delayed refs can create more delayed refs.
+* Since the global reserve is just kind of magic we don't really want
+* to rely on it to save our bacon, so if our size is more than the
+* delayed_refs_rsv and the global rsv then it's time to think about
+* bailing.
 */
-   if (global_rsv->space_info->full) {
-   num_dirty_bgs_bytes <<= 1;
-   num_bytes <<= 1;
-   }
-
-   spin_lock(&global_rsv->lock);
-   if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes)
-   ret = 1;
-   spin_unlock(&global_rsv->lock);
+   spin_lock(&delayed_refs_rsv->lock);
+   reserved += delayed_refs_rsv->reserved;
+   if (delayed_refs_rsv->size >= reserved)
+   ret = true;
+   spin_unlock(&delayed_refs_rsv->lock);
return ret;
 }
 
@@ -2891,7 +2879,7 @@ int btrfs_should_throttle_delayed_refs(struct 
btrfs_trans_handle *trans)
if (val >= NSEC_PER_SEC / 2)
return 2;
 
-   return btrfs_check_space_for_delayed_refs(trans);
+   return btrfs_check_space_for_delayed_refs(trans->fs_info);
 }
 
 struct async_delayed_refs {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a097f5fde31d..8532a2eb56d1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5326,8 +5326,8 @@ static struct btrfs_trans_handle 
*evict_refill_and_join(struct btrfs_root *root,
 * Try to steal from the global reserve if there is space for
 * it.
 */
-   if (!btrfs_check_space_for_delayed_refs(trans) &&
-   !btrfs_block_rsv_migrate(global_rsv, rsv, rsv->size, false))
+   if (!btrfs_check_space_for_delayed_refs(fs_info) &&
+

[PATCH 10/10] btrfs: fix truncate throttling

2018-12-03 Thread Josef Bacik

We have a bunch of magic to make sure we're throttling delayed refs when
truncating a file.  Now that we have a delayed refs rsv and a mechanism
for refilling that reserve simply use that instead of all of this magic.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 79 
 1 file changed, 17 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8532a2eb56d1..623a71d871d4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4437,31 +4437,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry 
*dentry)
return err;
 }
 
-static int truncate_space_check(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root,
-   u64 bytes_deleted)
-{
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   int ret;
-
-   /*
-* This is only used to apply pressure to the enospc system, we don't
-* intend to use this reservation at all.
-*/
-   bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted);
-   bytes_deleted *= fs_info->nodesize;
-   ret = btrfs_block_rsv_add(root, &fs_info->trans_block_rsv,
- bytes_deleted, BTRFS_RESERVE_NO_FLUSH);
-   if (!ret) {
-   trace_btrfs_space_reservation(fs_info, "transaction",
- trans->transid,
- bytes_deleted, 1);
-   trans->bytes_reserved += bytes_deleted;
-   }
-   return ret;
-
-}
-
 /*
  * Return this if we need to call truncate_block for the last bit of the
  * truncate.
@@ -4506,7 +4481,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
u64 bytes_deleted = 0;
bool be_nice = false;
bool should_throttle = false;
-   bool should_end = false;
 
BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
 
@@ -4719,15 +4693,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
btrfs_abort_transaction(trans, ret);
break;
}
-   if (btrfs_should_throttle_delayed_refs(trans))
-   btrfs_async_run_delayed_refs(fs_info,
-   trans->delayed_ref_updates * 2,
-   trans->transid, 0);
if (be_nice) {
-   if (truncate_space_check(trans, root,
-extent_num_bytes)) {
-   should_end = true;
-   }
if (btrfs_should_throttle_delayed_refs(trans))
should_throttle = true;
}
@@ -4738,7 +4704,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
 
if (path->slots[0] == 0 ||
path->slots[0] != pending_del_slot ||
-   should_throttle || should_end) {
+   should_throttle) {
if (pending_del_nr) {
ret = btrfs_del_items(trans, root, path,
pending_del_slot,
@@ -4750,23 +4716,24 @@ int btrfs_truncate_inode_items(struct 
btrfs_trans_handle *trans,
pending_del_nr = 0;
}
btrfs_release_path(path);
-   if (should_throttle) {
-   unsigned long updates = 
trans->delayed_ref_updates;
-   if (updates) {
-   trans->delayed_ref_updates = 0;
-   ret = btrfs_run_delayed_refs(trans,
-  updates * 2);
-   if (ret)
-   break;
-   }
-   }
+
/*
-* if we failed to refill our space rsv, bail out
-* and let the transaction restart
+* We can generate a lot of delayed refs, so we need to
+* throttle every once and a while and make sure we're
+* adding enough space to keep up with the work we are
+* generating.  Since we hold a transaction here we
+* can't flush, and we don't want to FLUSH_LIMIT because
+* we could have generated too many delayed refs to
+* actually allocate, so just bail if we're short and
+* let the normal reservation dance happen

[PATCH 09/10] btrfs: don't run delayed refs in the end transaction logic

2018-12-03 Thread Josef Bacik

Over the years we have built up a lot of infrastructure to keep delayed
refs in check, mostly by running them at btrfs_end_transaction() time.
We have a lot of different maths we do to figure out how much, if we
should do it inline or async, etc.  This existed because we had no
feedback mechanism to force the flushing of delayed refs when they
became a problem.  However with the enospc flushing infrastructure in
place for flushing delayed refs when they put too much pressure on the
enospc system we have this problem solved.  Rip out all of this code as
it is no longer needed.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 38 --
 1 file changed, 38 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 2d8401bf8df9..01f39401619a 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -798,22 +798,12 @@ static int should_end_transaction(struct 
btrfs_trans_handle *trans)
 int btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
 {
struct btrfs_transaction *cur_trans = trans->transaction;
-   int updates;
-   int err;
 
smp_mb();
if (cur_trans->state >= TRANS_STATE_BLOCKED ||
cur_trans->delayed_refs.flushing)
return 1;
 
-   updates = trans->delayed_ref_updates;
-   trans->delayed_ref_updates = 0;
-   if (updates) {
-   err = btrfs_run_delayed_refs(trans, updates * 2);
-   if (err) /* Error code will also eval true */
-   return err;
-   }
-
return should_end_transaction(trans);
 }
 
@@ -843,11 +833,8 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
 {
struct btrfs_fs_info *info = trans->fs_info;
struct btrfs_transaction *cur_trans = trans->transaction;
-   u64 transid = trans->transid;
-   unsigned long cur = trans->delayed_ref_updates;
int lock = (trans->type != TRANS_JOIN_NOLOCK);
int err = 0;
-   int must_run_delayed_refs = 0;
 
if (refcount_read(&trans->use_count) > 1) {
refcount_dec(&trans->use_count);
@@ -858,27 +845,6 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
btrfs_trans_release_metadata(trans);
trans->block_rsv = NULL;
 
-   if (!list_empty(&trans->new_bgs))
-   btrfs_create_pending_block_groups(trans);
-
-   trans->delayed_ref_updates = 0;
-   if (!trans->sync) {
-   must_run_delayed_refs =
-   btrfs_should_throttle_delayed_refs(trans);
-   cur = max_t(unsigned long, cur, 32);
-
-   /*
-* don't make the caller wait if they are from a NOLOCK
-* or ATTACH transaction, it will deadlock with commit
-*/
-   if (must_run_delayed_refs == 1 &&
-   (trans->type & (__TRANS_JOIN_NOLOCK | __TRANS_ATTACH)))
-   must_run_delayed_refs = 2;
-   }
-
-   btrfs_trans_release_metadata(trans);
-   trans->block_rsv = NULL;
-
if (!list_empty(&trans->new_bgs))
btrfs_create_pending_block_groups(trans);
 
@@ -923,10 +889,6 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
}
 
kmem_cache_free(btrfs_trans_handle_cachep, trans);
-   if (must_run_delayed_refs) {
-   btrfs_async_run_delayed_refs(info, cur, transid,
-must_run_delayed_refs == 1);
-   }
return err;
 }
 
-- 
2.14.3

[PATCH 02/10] btrfs: add cleanup_ref_head_accounting helper

2018-12-03 Thread Josef Bacik

From: Josef Bacik 

We were missing some quota cleanups in check_ref_cleanup, so break the
ref head accounting cleanup into a helper and call that from both
check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
we don't screw up accounting in the future for other things that we add.

Reviewed-by: Omar Sandoval 
Reviewed-by: Liu Bo 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 67 +-
 1 file changed, 39 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c36b3a42f2bb..e3ed3507018d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
*trans,
return ret ? ret : 1;
 }
 
+static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_ref_head *head)
+{
+   struct btrfs_fs_info *fs_info = trans->fs_info;
+   struct btrfs_delayed_ref_root *delayed_refs =
+   &trans->transaction->delayed_refs;
+
+   if (head->total_ref_mod < 0) {
+   struct btrfs_space_info *space_info;
+   u64 flags;
+
+   if (head->is_data)
+   flags = BTRFS_BLOCK_GROUP_DATA;
+   else if (head->is_system)
+   flags = BTRFS_BLOCK_GROUP_SYSTEM;
+   else
+   flags = BTRFS_BLOCK_GROUP_METADATA;
+   space_info = __find_space_info(fs_info, flags);
+   ASSERT(space_info);
+   percpu_counter_add_batch(&space_info->total_bytes_pinned,
+  -head->num_bytes,
+  BTRFS_TOTAL_BYTES_PINNED_BATCH);
+
+   if (head->is_data) {
+   spin_lock(&delayed_refs->lock);
+   delayed_refs->pending_csums -= head->num_bytes;
+   spin_unlock(&delayed_refs->lock);
+   }
+   }
+
+   /* Also free its reserved qgroup space */
+   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
+ head->qgroup_reserved);
+}
+
 static int cleanup_ref_head(struct btrfs_trans_handle *trans,
struct btrfs_delayed_ref_head *head)
 {
@@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
spin_unlock(&head->lock);
spin_unlock(&delayed_refs->lock);
 
-   trace_run_delayed_ref_head(fs_info, head, 0);
-
-   if (head->total_ref_mod < 0) {
-   struct btrfs_space_info *space_info;
-   u64 flags;
-
-   if (head->is_data)
-   flags = BTRFS_BLOCK_GROUP_DATA;
-   else if (head->is_system)
-   flags = BTRFS_BLOCK_GROUP_SYSTEM;
-   else
-   flags = BTRFS_BLOCK_GROUP_METADATA;
-   space_info = __find_space_info(fs_info, flags);
-   ASSERT(space_info);
-   percpu_counter_add_batch(&space_info->total_bytes_pinned,
-  -head->num_bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH);
-
-   if (head->is_data) {
-   spin_lock(&delayed_refs->lock);
-   delayed_refs->pending_csums -= head->num_bytes;
-   spin_unlock(&delayed_refs->lock);
-   }
-   }
-
if (head->must_insert_reserved) {
btrfs_pin_extent(fs_info, head->bytenr,
 head->num_bytes, 1);
@@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
}
}
 
-   /* Also free its reserved qgroup space */
-   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
- head->qgroup_reserved);
+   cleanup_ref_head_accounting(trans, head);
+
+   trace_run_delayed_ref_head(fs_info, head, 0);
btrfs_delayed_ref_unlock(head);
btrfs_put_delayed_ref_head(head);
return 0;
@@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (head->must_insert_reserved)
ret = 1;
 
+   cleanup_ref_head_accounting(trans, head);
mutex_unlock(&head->mutex);
btrfs_put_delayed_ref_head(head);
return ret;
-- 
2.14.3

[PATCH 00/10][V2] Delayed refs rsv

2018-12-03 Thread Josef Bacik

v1->v2:
- addressed the comments from the various reviewers.
- split "introduce delayed_refs_rsv" into 5 patches.  The patches are the same
  together as they were, just split out more logically.  They can't really be
  bisected across in that you will likely have fun enospc failures, but they
  compile properly.  This was done to make it easier for review.

-- Original message --

This patchset changes how we do space reservations for delayed refs.  We were
hitting probably 20-40 enospc abort's per day in production while running
delayed refs at transaction commit time.  This means we ran out of space in the
global reserve and couldn't easily get more space in use_block_rsv().

The global reserve has grown to cover everything we don't reserve space
explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if
we're running short on space and when it's time to force a commit.  A failure
rate of 20-40 file systems when we run hundreds of thousands of them isn't super
high, but cleaning up this code will make things less ugly and more predictible.

Thus the delayed refs rsv.  We always know how many delayed refs we have
outstanding, and although running them generates more we can use the global
reserve for that spill over, which fits better into it's desired use than a full
blown reservation.  This first approach is to simply take how many times we're
reserving space for and multiply that by 2 in order to save enough space for the
delayed refs that could be generated.  This is a niave approach and will
probably evolve, but for now it works.

With this patchset we've gone down to 2-8 failures per week.  It's not perfect,
there are some corner cases that still need to be addressed, but is
significantly better than what we had.  Thanks,

Josef

[PATCH 06/10] btrfs: update may_commit_transaction to use the delayed refs rsv

2018-12-03 Thread Josef Bacik

Any space used in the delayed_refs_rsv will be freed up by a transaction
commit, so instead of just counting the pinned space we also need to
account for any space in the delayed_refs_rsv when deciding if it will
make a different to commit the transaction to satisfy our space
reservation.  If we have enough bytes to satisfy our reservation ticket
then we are good to go, otherwise subtract out what space we would gain
back by committing the transaction and compare that against the pinned
space to make our decision.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index aa0a638d0263..63ff9d832867 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4843,8 +4843,10 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 {
struct reserve_ticket *ticket = NULL;
struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_block_rsv;
+   struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
struct btrfs_trans_handle *trans;
-   u64 bytes;
+   u64 bytes_needed;
+   u64 reclaim_bytes = 0;
 
trans = (struct btrfs_trans_handle *)current->journal_info;
if (trans)
@@ -4857,15 +4859,15 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
else if (!list_empty(&space_info->tickets))
ticket = list_first_entry(&space_info->tickets,
  struct reserve_ticket, list);
-   bytes = (ticket) ? ticket->bytes : 0;
+   bytes_needed = (ticket) ? ticket->bytes : 0;
spin_unlock(&space_info->lock);
 
-   if (!bytes)
+   if (!bytes_needed)
return 0;
 
/* See if there is enough pinned space to make this reservation */
if (__percpu_counter_compare(&space_info->total_bytes_pinned,
-  bytes,
+  bytes_needed,
   BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
goto commit;
 
@@ -4877,14 +4879,18 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
return -ENOSPC;
 
spin_lock(&delayed_rsv->lock);
-   if (delayed_rsv->size > bytes)
-   bytes = 0;
-   else
-   bytes -= delayed_rsv->size;
+   reclaim_bytes += delayed_rsv->reserved;
spin_unlock(&delayed_rsv->lock);
 
+   spin_lock(&delayed_refs_rsv->lock);
+   reclaim_bytes += delayed_refs_rsv->reserved;
+   spin_unlock(&delayed_refs_rsv->lock);
+   if (reclaim_bytes >= bytes_needed)
+   goto commit;
+   bytes_needed -= reclaim_bytes;
+
if (__percpu_counter_compare(&space_info->total_bytes_pinned,
-  bytes,
+  bytes_needed,
   BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
return -ENOSPC;
}
-- 
2.14.3

[PATCH 05/10] btrfs: introduce delayed_refs_rsv

2018-12-03 Thread Josef Bacik

From: Josef Bacik 

Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv.  This works most
of the time, except when it doesn't.  We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction.  Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.

So instead give them their own block_rsv.  This way we can always know
exactly how much outstanding space we need for delayed refs.  This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system.  Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.

For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums.  We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  14 +++-
 fs/btrfs/delayed-ref.c |  43 --
 fs/btrfs/disk-io.c |   4 +
 fs/btrfs/extent-tree.c | 212 +
 fs/btrfs/transaction.c |  37 -
 5 files changed, 284 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8b41ec42f405..52a87d446945 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -448,8 +448,9 @@ struct btrfs_space_info {
 #defineBTRFS_BLOCK_RSV_TRANS   3
 #defineBTRFS_BLOCK_RSV_CHUNK   4
 #defineBTRFS_BLOCK_RSV_DELOPS  5
-#defineBTRFS_BLOCK_RSV_EMPTY   6
-#defineBTRFS_BLOCK_RSV_TEMP7
+#define BTRFS_BLOCK_RSV_DELREFS6
+#defineBTRFS_BLOCK_RSV_EMPTY   7
+#defineBTRFS_BLOCK_RSV_TEMP8
 
 struct btrfs_block_rsv {
u64 size;
@@ -812,6 +813,8 @@ struct btrfs_fs_info {
struct btrfs_block_rsv chunk_block_rsv;
/* block reservation for delayed operations */
struct btrfs_block_rsv delayed_block_rsv;
+   /* block reservation for delayed refs */
+   struct btrfs_block_rsv delayed_refs_rsv;
 
struct btrfs_block_rsv empty_block_rsv;
 
@@ -2796,6 +2799,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info 
*fs_info,
 void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
 struct btrfs_block_rsv *block_rsv,
 u64 num_bytes);
+void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
+void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
+int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
+ enum btrfs_reserve_flush_enum flush);
+void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
+  struct btrfs_block_rsv *src,
+  u64 num_bytes);
 int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 48725fa757a3..a198ab91879c 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -474,11 +474,14 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
*trans,
  * existing and update must have the same bytenr
  */
 static noinline void
-update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
+update_existing_head_ref(struct btrfs_trans_handle *trans,
 struct btrfs_delayed_ref_head *existing,
 struct btrfs_delayed_ref_head *update,
 int *old_ref_mod_ret)
 {
+   struct btrfs_delayed_ref_root *delayed_refs =
+   &trans->transaction->delayed_refs;
+   struct btrfs_fs_info *fs_info = trans->fs_info;
int old_ref_mod;
 
BUG_ON(existing->is_data != update->is_data);
@@ -536,10 +539,18 @@ update_existing_head_ref(struct btrfs_delayed_ref_root 
*delayed_refs,
 * versa we need to make sure to adjust pending_csums accordingly.
 */
if (existing->is_data) {
-   if (existing->total_ref_mod >= 0 && old_ref_mod < 0)
+   u64 csum_leaves =
+   btrfs_csum_bytes_to_leaves(fs_info,
+  existing->num_bytes);
+
+   if (existing->total_ref_mod >= 0 && old_ref_mod < 0) {
delayed_refs->pending_csums -= existing->num_bytes;
-

[PATCH 04/10] btrfs: only track ref_heads in delayed_ref_updates

2018-12-03 Thread Josef Bacik

From: Josef Bacik 

We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it.  Fix the accounting to only be
adjusted when we add/remove a ref head.

Reviewed-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index b3e4c9fcb664..48725fa757a3 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct 
btrfs_trans_handle *trans,
ref->in_tree = 0;
btrfs_put_delayed_ref(ref);
atomic_dec(&delayed_refs->num_entries);
-   if (trans->delayed_ref_updates)
-   trans->delayed_ref_updates--;
 }
 
 static bool merge_ref(struct btrfs_trans_handle *trans,
@@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
*trans,
if (ref->action == BTRFS_ADD_DELAYED_REF)
list_add_tail(&ref->add_list, &href->ref_add_list);
atomic_inc(&root->num_entries);
-   trans->delayed_ref_updates++;
spin_unlock(&href->lock);
return ret;
 }
-- 
2.14.3

[PATCH 03/10] btrfs: cleanup extent_op handling

2018-12-03 Thread Josef Bacik

From: Josef Bacik 

The cleanup_extent_op function actually would run the extent_op if it
needed running, which made the name sort of a misnomer.  Change it to
run_and_cleanup_extent_op, and move the actual cleanup work to
cleanup_extent_op so it can be used by check_ref_cleanup() in order to
unify the extent op handling.

Reviewed-by: Lu Fengqi 
Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 35 ++-
 1 file changed, 22 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index e3ed3507018d..9f169f3c5fdb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2424,19 +2424,32 @@ static void unselect_delayed_ref_head(struct 
btrfs_delayed_ref_root *delayed_ref
btrfs_delayed_ref_unlock(head);
 }
 
-static int cleanup_extent_op(struct btrfs_trans_handle *trans,
-struct btrfs_delayed_ref_head *head)
+static struct btrfs_delayed_extent_op *cleanup_extent_op(
+   struct btrfs_delayed_ref_head *head)
 {
struct btrfs_delayed_extent_op *extent_op = head->extent_op;
-   int ret;
 
if (!extent_op)
-   return 0;
-   head->extent_op = NULL;
+   return NULL;
+
if (head->must_insert_reserved) {
+   head->extent_op = NULL;
btrfs_free_delayed_extent_op(extent_op);
-   return 0;
+   return NULL;
}
+   return extent_op;
+}
+
+static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans,
+struct btrfs_delayed_ref_head *head)
+{
+   struct btrfs_delayed_extent_op *extent_op;
+   int ret;
+
+   extent_op = cleanup_extent_op(head);
+   if (!extent_op)
+   return 0;
+   head->extent_op = NULL;
spin_unlock(&head->lock);
ret = run_delayed_extent_op(trans, head, extent_op);
btrfs_free_delayed_extent_op(extent_op);
@@ -2488,7 +2501,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
 
delayed_refs = &trans->transaction->delayed_refs;
 
-   ret = cleanup_extent_op(trans, head);
+   ret = run_and_cleanup_extent_op(trans, head);
if (ret < 0) {
unselect_delayed_ref_head(delayed_refs, head);
btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
@@ -6977,12 +6990,8 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (!RB_EMPTY_ROOT(&head->ref_tree.rb_root))
goto out;
 
-   if (head->extent_op) {
-   if (!head->must_insert_reserved)
-   goto out;
-   btrfs_free_delayed_extent_op(head->extent_op);
-   head->extent_op = NULL;
-   }
+   if (cleanup_extent_op(head) != NULL)
+   goto out;
 
/*
 * waiting for the lock here would deadlock.  If someone else has it
-- 
2.14.3

[PATCH 01/10] btrfs: add btrfs_delete_ref_head helper

2018-12-03 Thread Josef Bacik

From: Josef Bacik 

We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.

Signed-off-by: Josef Bacik 
Reviewed-by: Omar Sandoval 
---
 fs/btrfs/delayed-ref.c | 14 ++
 fs/btrfs/delayed-ref.h |  3 ++-
 fs/btrfs/extent-tree.c | 22 +++---
 3 files changed, 19 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 9301b3ad9217..b3e4c9fcb664 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
return head;
 }
 
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+  struct btrfs_delayed_ref_head *head)
+{
+   lockdep_assert_held(&delayed_refs->lock);
+   lockdep_assert_held(&head->lock);
+
+   rb_erase_cached(&head->href_node, &delayed_refs->href_root);
+   RB_CLEAR_NODE(&head->href_node);
+   atomic_dec(&delayed_refs->num_entries);
+   delayed_refs->num_heads--;
+   if (head->processing == 0)
+   delayed_refs->num_heads_ready--;
+}
+
 /*
  * Helper to insert the ref_node to the tail or merge with tail.
  *
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 8e20c5cb5404..d2af974f68a1 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
btrfs_delayed_ref_head *head)
 {
mutex_unlock(&head->mutex);
 }
-
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+  struct btrfs_delayed_ref_head *head);
 
 struct btrfs_delayed_ref_head *btrfs_select_ref_head(
struct btrfs_delayed_ref_root *delayed_refs);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d242a1174e50..c36b3a42f2bb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
spin_unlock(&delayed_refs->lock);
return 1;
}
-   delayed_refs->num_heads--;
-   rb_erase_cached(&head->href_node, &delayed_refs->href_root);
-   RB_CLEAR_NODE(&head->href_node);
+   btrfs_delete_ref_head(delayed_refs, head);
spin_unlock(&head->lock);
spin_unlock(&delayed_refs->lock);
-   atomic_dec(&delayed_refs->num_entries);
 
trace_run_delayed_ref_head(fs_info, head, 0);
 
@@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (!mutex_trylock(&head->mutex))
goto out;
 
-   /*
-* at this point we have a head with no other entries.  Go
-* ahead and process it.
-*/
-   rb_erase_cached(&head->href_node, &delayed_refs->href_root);
-   RB_CLEAR_NODE(&head->href_node);
-   atomic_dec(&delayed_refs->num_entries);
-
-   /*
-* we don't take a ref on the node because we're removing it from the
-* tree, so we just steal the ref the tree was holding.
-*/
-   delayed_refs->num_heads--;
-   if (head->processing == 0)
-   delayed_refs->num_heads_ready--;
+   btrfs_delete_ref_head(delayed_refs, head);
head->processing = 0;
+
spin_unlock(&head->lock);
spin_unlock(&delayed_refs->lock);
 
-- 
2.14.3

[PATCH 07/10] btrfs: add new flushing states for the delayed refs rsv

2018-12-03 Thread Josef Bacik

A nice thing we gain with the delayed refs rsv is the ability to flush
the delayed refs on demand to deal with enospc pressure.  Add states to
flush delayed refs on demand, and this will allow us to remove a lot of
ad-hoc work around checking to see if we should commit the transaction
to run our delayed refs.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h | 10 ++
 fs/btrfs/extent-tree.c   | 14 ++
 include/trace/events/btrfs.h |  2 ++
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 52a87d446945..2eba398c722b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2745,10 +2745,12 @@ enum btrfs_reserve_flush_enum {
 enum btrfs_flush_state {
FLUSH_DELAYED_ITEMS_NR  =   1,
FLUSH_DELAYED_ITEMS =   2,
-   FLUSH_DELALLOC  =   3,
-   FLUSH_DELALLOC_WAIT =   4,
-   ALLOC_CHUNK =   5,
-   COMMIT_TRANS=   6,
+   FLUSH_DELAYED_REFS_NR   =   3,
+   FLUSH_DELAYED_REFS  =   4,
+   FLUSH_DELALLOC  =   5,
+   FLUSH_DELALLOC_WAIT =   6,
+   ALLOC_CHUNK =   7,
+   COMMIT_TRANS=   8,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 63ff9d832867..5a2d0b061f57 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4938,6 +4938,20 @@ static void flush_space(struct btrfs_fs_info *fs_info,
shrink_delalloc(fs_info, num_bytes * 2, num_bytes,
state == FLUSH_DELALLOC_WAIT);
break;
+   case FLUSH_DELAYED_REFS_NR:
+   case FLUSH_DELAYED_REFS:
+   trans = btrfs_join_transaction(root);
+   if (IS_ERR(trans)) {
+   ret = PTR_ERR(trans);
+   break;
+   }
+   if (state == FLUSH_DELAYED_REFS_NR)
+   nr = calc_reclaim_items_nr(fs_info, num_bytes);
+   else
+   nr = 0;
+   btrfs_run_delayed_refs(trans, nr);
+   btrfs_end_transaction(trans);
+   break;
case ALLOC_CHUNK:
trans = btrfs_join_transaction(root);
if (IS_ERR(trans)) {
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 8568946f491d..63d1f9d8b8c7 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1048,6 +1048,8 @@ TRACE_EVENT(btrfs_trigger_flush,
{ FLUSH_DELAYED_ITEMS,  "FLUSH_DELAYED_ITEMS"}, 
\
{ FLUSH_DELALLOC,   "FLUSH_DELALLOC"},  
\
{ FLUSH_DELALLOC_WAIT,  "FLUSH_DELALLOC_WAIT"}, 
\
+   { FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"},   
\
+   { FLUSH_DELAYED_REFS,   "FLUSH_ELAYED_REFS"},   
\
{ ALLOC_CHUNK,  "ALLOC_CHUNK"}, 
\
{ COMMIT_TRANS, "COMMIT_TRANS"})
 
-- 
2.14.3

Re: [RFC PATCH] btrfs: Remove __extent_readpages

2018-12-03 Thread Nikolay Borisov




On 3.12.18 г. 12:25 ч., Nikolay Borisov wrote:
> When extent_readpages is called from the generic readahead code it first
> builds a batch of 16 pages (which might or might not be consecutive,
> depending on whether add_to_page_cache_lru failed) and submits them to
> __extent_readpages. The latter ensures that the range of pages (in the
> batch of 16) that is passed to __do_contiguous_readpages is consecutive.
> 
> If add_to_page_cache_lru does't fail then __extent_readpages will call
> __do_contiguous_readpages only once with the whole batch of 16.
> Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an 
> example)
> then the contigous page read code will be called twice.
> 
> All of this can be simplified by exploiting the fact that all pages passed
> to extent_readpages are consecutive, thus when the batch is built in
> that function it is already consecutive (barring add_to_page_cache_lru
> failures) so are ready to be submitted directly to __do_contiguous_readpages.
> Also simplify the name of the function to contiguous_readpages. 
> 
> Signed-off-by: Nikolay Borisov 
> ---
> 
> So this patch looks like a very nice cleanup, however when doing performance 
> measurements with fio I was shocked to see that it actually is detrimental to 
> performance. Here are the results: 
> 
> The command line used for fio: 
> fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
>  --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10
> 
> This was tested on a vm with 4g of ram so the size of the test is smaller 
> than 
> the memory, so pages should have been nicely readahead. 
> 
> PATCHED: 
> 
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] 
> /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec  3 
> 09:57:17 2018
>   read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec)
> clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28
>  lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07
> clat percentiles (nsec):
>  |  1.00th=[   1848],  5.00th=[   1896], 10.00th=[   1912],
>  | 20.00th=[   1960], 30.00th=[   2024], 40.00th=[   2160],
>  | 50.00th=[   2384], 60.00th=[   2576], 70.00th=[   2800],
>  | 80.00th=[   3120], 90.00th=[   3824], 95.00th=[   4768],
>  | 99.00th=[   7968], 99.50th=[  14912], 99.90th=[  50944],
>  | 99.95th=[ 667648], 99.99th=[5931008]
>bw (  KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, 
> stdev=162333.72, samples=50
>iops: min=  692, max=136135, avg=102478.08, stdev=40583.47, 
> samples=50
>   lat (usec)   : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13%
>   lat (usec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%
>   cpu  : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9
>   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>  latency   : target=0, window=0, percentile=100.00%, depth=1
> 
> Run status group 0 (all jobs):
>READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), 
> io=10.0GiB (10.7GB), run=33444-33444msec
> 
> 
> UNPATCHED:
> 
> Starting 1 process
> Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] 
> /media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec  3 
> 10:07:38 2018
>   read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec)
> clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00
>  lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36
> clat percentiles (nsec):
>  |  1.00th=[   1576],  5.00th=[   1608], 10.00th=[   1640],
>  | 20.00th=[   1672], 30.00th=[   1720], 40.00th=[   1832],
>  | 50.00th=[   2096], 60.00th=[   2288], 70.00th=[   2480],
>  | 80.00th=[   2736], 90.00th=[   3248], 95.00th=[   3952],
>  | 99.00th=[   6368], 99.50th=[  12736], 99.90th=[  43776],
>  | 99.95th=[ 798720], 99.99th=[5341184]
>bw (  KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, 
> stdev=177637.57, samples=45
>iops: min= 8536, max=151552, avg=116434.33, stdev=44409.46, 
> samples=45
>   lat (usec)   : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08%
>   lat (usec)   : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
>   lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01%
>   cpu  : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9
>   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  issued rwts: total=2621440,0,0,0

Re: Filesystem Corruption

2018-12-03 Thread Qu Wenruo



On 2018/12/3 下午5:31, Stefan Malte Schumacher wrote:
> Hello,
> 
> I have noticed an unusual amount of crc-errors in downloaded rars,
> beginning about a week ago. But lets start with the preliminaries. I
> am using Debian Stretch.
> Kernel: Linux mars 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4
> (2018-08-21) x86_64 GNU/Linux
> BTRFS-Tools btrfs-progs  4.7.3-1
> Smartctl shows no errors for any of the drives in the filesystem.
> 
> Btrfs /dev/stats shows zero errors, but dmesg gives me a lot of
> filesystem related error messages.
> 
> [5390748.884929] Buffer I/O error on dev dm-0, logical block
> 976701312, async page read
> This errors is shown a lot of time in the log.

No "btrfs:" prefix, looks more like an error message from block level,
no wonder btrfs shows no error at all.

What is the underlying device mapper?

And further more, is there any kernel message with "btrfs"
(case-insensitive) in it?

Thanks,
Qu
> 
> This seems to affect just newly written files. This is the output of
> btrfs scrub status:
> scrub status for 1609e4e1-4037-4d31-bf12-f84a691db5d8
> scrub started at Tue Nov 27 06:02:04 2018 and finished after 07:34:16
> total bytes scrubbed: 17.29TiB with 0 errors
> 
> What is the probable cause of these errors? How can I fix this?
> 
> Thanks in advance for your advice
> Stefan
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH v2 2/3] btrfs: use offset_in_page for start_offset in map_private_extent_buffer()

2018-12-03 Thread Johannes Thumshirn

On 28/11/2018 16:41, David Sterba wrote:
> On Wed, Nov 28, 2018 at 09:54:55AM +0100, Johannes Thumshirn wrote:
>> In map_private_extent_buffer() use offset_in_page() to initialize
>> 'start_offset' instead of open-coding it.
> 
> Can you please fix all instances where it's opencoded? Grepping for
> 'PAGE_SIZE - 1' finds a number of them. Thanks.

Sure will do.

-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

[RFC PATCH] btrfs: Remove __extent_readpages

2018-12-03 Thread Nikolay Borisov

When extent_readpages is called from the generic readahead code it first
builds a batch of 16 pages (which might or might not be consecutive,
depending on whether add_to_page_cache_lru failed) and submits them to
__extent_readpages. The latter ensures that the range of pages (in the
batch of 16) that is passed to __do_contiguous_readpages is consecutive.

If add_to_page_cache_lru does't fail then __extent_readpages will call
__do_contiguous_readpages only once with the whole batch of 16.
Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an 
example)
then the contigous page read code will be called twice.

All of this can be simplified by exploiting the fact that all pages passed
to extent_readpages are consecutive, thus when the batch is built in
that function it is already consecutive (barring add_to_page_cache_lru
failures) so are ready to be submitted directly to __do_contiguous_readpages.
Also simplify the name of the function to contiguous_readpages. 

Signed-off-by: Nikolay Borisov 
---

So this patch looks like a very nice cleanup, however when doing performance 
measurements with fio I was shocked to see that it actually is detrimental to 
performance. Here are the results: 

The command line used for fio: 
fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
 --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10

This was tested on a vm with 4g of ram so the size of the test is smaller than 
the memory, so pages should have been nicely readahead. 

PATCHED: 

Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=519MiB/s][r=133k IOPS][eta 00m:00s] 
/media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=3722: Mon Dec  3 
09:57:17 2018
  read: IOPS=78.4k, BW=306MiB/s (321MB/s)(10.0GiB/33444msec)
clat (nsec): min=1703, max=9042.7k, avg=5463.97, stdev=121068.28
 lat (usec): min=2, max=9043, avg= 6.00, stdev=121.07
clat percentiles (nsec):
 |  1.00th=[   1848],  5.00th=[   1896], 10.00th=[   1912],
 | 20.00th=[   1960], 30.00th=[   2024], 40.00th=[   2160],
 | 50.00th=[   2384], 60.00th=[   2576], 70.00th=[   2800],
 | 80.00th=[   3120], 90.00th=[   3824], 95.00th=[   4768],
 | 99.00th=[   7968], 99.50th=[  14912], 99.90th=[  50944],
 | 99.95th=[ 667648], 99.99th=[5931008]
   bw (  KiB/s): min= 2768, max=544542, per=100.00%, avg=409912.68, 
stdev=162333.72, samples=50
   iops: min=  692, max=136135, avg=102478.08, stdev=40583.47, 
samples=50
  lat (usec)   : 2=25.93%, 4=65.58%, 10=7.69%, 20=0.57%, 50=0.13%
  lat (usec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%
  cpu  : usr=7.20%, sys=92.55%, ctx=396, majf=0, minf=9
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=306MiB/s (321MB/s), 306MiB/s-306MiB/s (321MB/s-321MB/s), io=10.0GiB 
(10.7GB), run=33444-33444msec


UNPATCHED:

Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=568MiB/s][r=145k IOPS][eta 00m:00s] 
/media/scratch/seqread: (groupid=0, jobs=1): err= 0: pid=640: Mon Dec  3 
10:07:38 2018
  read: IOPS=90.4k, BW=353MiB/s (370MB/s)(10.0GiB/29008msec)
clat (nsec): min=1418, max=12374k, avg=4816.38, stdev=109448.00
 lat (nsec): min=1836, max=12374k, avg=5284.46, stdev=109451.36
clat percentiles (nsec):
 |  1.00th=[   1576],  5.00th=[   1608], 10.00th=[   1640],
 | 20.00th=[   1672], 30.00th=[   1720], 40.00th=[   1832],
 | 50.00th=[   2096], 60.00th=[   2288], 70.00th=[   2480],
 | 80.00th=[   2736], 90.00th=[   3248], 95.00th=[   3952],
 | 99.00th=[   6368], 99.50th=[  12736], 99.90th=[  43776],
 | 99.95th=[ 798720], 99.99th=[5341184]
   bw (  KiB/s): min=34144, max=606208, per=100.00%, avg=465737.56, 
stdev=177637.57, samples=45
   iops: min= 8536, max=151552, avg=116434.33, stdev=44409.46, 
samples=45
  lat (usec)   : 2=45.74%, 4=49.50%, 10=4.13%, 20=0.45%, 50=0.08%
  lat (usec)   : 100=0.03%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.05%, 20=0.01%
  cpu  : usr=7.14%, sys=92.39%, ctx=1059, majf=0, minf=9
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued rwts: total=2621440,0,0,0 short=0,0,0,0 dropped=0,0,0,0
 latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=353MiB/s (370MB/s), 353MiB/s-353MiB/s (370MB/s-370MB/s), io=10.0GiB 
(10.7GB), run=29008-29008msec

Clearly both ban

Re: Possible deadlock when writing

2018-12-03 Thread Tony Lambiris

I've been running into (what I believe) is the same issue ever since
upgrading to 4.19:

[28950.083040] BTRFS error (device dm-0): bad tree block start, want
1815648960512 have 0
[28950.083047] BTRFS: error (device dm-0) in __btrfs_free_extent:6804:
errno=-5 IO failure
[28950.083048] BTRFS info (device dm-0): forced readonly
[28950.083050] BTRFS: error (device dm-0) in
btrfs_run_delayed_refs:2935: errno=-5 IO failure
[28950.083866] BTRFS error (device dm-0): pending csums is 9564160
[29040.413973] TaskSchedulerFo[17189]: segfault at 0 ip
56121a2cb73b sp 7f1cca425b80 error 4 in
chrome[561218101000+6513000]

This has been happening consistently to me on two laptops and a
workstation all running Arch Linux -- all different hardware the only
thing in common is they have SSDs/nvme storage and they all use btrfs.

I initially thought it had something to do with the fstrim.timer unit
kicking off an fstrim run that was somehow causing contention with
btrfs. As luck would have it my btrfs file-system on one laptop just
remounted read-only and I believe that while my physical memory was
not entirely used up (I would guess usage to be ~45% physical). While
I believe the rest of available memory was being utilized by the VFS
buffer cache, I'm not 100% on actual utilization but after reading the
email from mbakiev@ I did make a mental note before initiating a
required reboot.

I came across this comment from Ubuntu's bugtracker:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/159356/comments/62

The author of post #62 notes that this particular behavior happens
when they are running several instances of Chrome. I don't know if
this bug filed or issue is related at all, but an interesting note is
that I also almost always happen to be interacting with Google Chrome
when the read-only remount happens.

Here is the last entry from journald before I rebooted:
Dec 03 00:00:39 tenforward kernel: BTRFS error (device dm-3): bad tree
block start, want 761659392 have 15159222128734632161

Here are the only changes I made that would be relevant:
vm.swappiness = 10
vm.overcommit_memory = 1
vm.oom_kill_allocating_task = 1
vm.panic_on_oom = 1

Hope I didn't miss anything, thanks!

On Sat, Dec 1, 2018 at 6:21 PM Martin Bakiev  wrote:
>
> I was having the same issue with kernels 4.19.2 and 4.19.4. I don’t appear to 
> have the issue with 4.20.0-0.rc1 on Fedora Server 29.
>
> The issue is very easy to reproduce on my setup, not sure how much of it is 
> actually relevant, but here it is:
>
> - 3 drive RAID5 created
> - Some data moved to it
> - Expanded to 7 drives
> - No balancing
>
> The issue is easily reproduced (within 30 mins) by starting multiple 
> transfers to the volume (several TB in the form of many 30GB+ files). 
> Multiple concurrent ‘rsync’ transfers seems to take a bit longer to trigger 
> the issue, but multiple ‘cp’ commands will do it much quicker (again not sure 
> if relevant).
>
> I have not seen the issue occur with a single ‘rsync’ or ‘cp’ transfer, but I 
> haven’t left one running alone for too long (copying the data from multiple 
> drives, so there is a lot to be gained from parallelizing the transfers).
>
> I’m not sure what state the FS is left in after Magic SysRq reboot after it 
> deadlocks, but seemingly it’s fine. No problems mounting and ‘btrfs check’ 
> passes OK. I’m sure some of the data doesn’t get flushed, but it’s no problem 
> for my use case.
>
> I’ve been running nonstop concurrent transfers with kernel 4.20.0-0.rc1 for 
> 24hr nonstop and I haven’t experienced the issue.
>
> Hope this helps.

Filesystem Corruption

2018-12-03 Thread Stefan Malte Schumacher

Hello,

I have noticed an unusual amount of crc-errors in downloaded rars,
beginning about a week ago. But lets start with the preliminaries. I
am using Debian Stretch.
Kernel: Linux mars 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4
(2018-08-21) x86_64 GNU/Linux
BTRFS-Tools btrfs-progs  4.7.3-1
Smartctl shows no errors for any of the drives in the filesystem.

Btrfs /dev/stats shows zero errors, but dmesg gives me a lot of
filesystem related error messages.

[5390748.884929] Buffer I/O error on dev dm-0, logical block
976701312, async page read
This errors is shown a lot of time in the log.

This seems to affect just newly written files. This is the output of
btrfs scrub status:
scrub status for 1609e4e1-4037-4d31-bf12-f84a691db5d8
scrub started at Tue Nov 27 06:02:04 2018 and finished after 07:34:16
total bytes scrubbed: 17.29TiB with 0 errors

What is the probable cause of these errors? How can I fix this?

Thanks in advance for your advice
Stefan

51 matches

Mail list logo