[PATCH v5] test unaligned punch hole at ENOSPC
Try to punch hole with unaligned size and offset when the FS is full. Mainly holes are punched at locations which are unaligned with the file extent boundaries when the FS is full by data. As the punching holes at unaligned location will involve truncating blocks instead of just dropping the extents, it shall involve reserving data and metadata space for delalloc and data alloc fails as the FS is full. btrfs_punch_hole() btrfs_truncate_block() btrfs_check_data_free_space() <-- ENOSPC We don't fail punch hole if the holes are aligned with the file extent boundaries as it shall involve just dropping the related extents. Signed-off-by: Anand Jain --- v4->v5: Update the change log Drop the directio option for xfs_io v3->v4: Add to the group punch v2->v3: Add _require_xfs_io_command "fpunch" Add more logs to $seqfull.full mount options and group profile info Add sync after dd upto ENOSPC Drop fallocate -p and use xfs_io punch to create holes Use a testfile instead of filler file so that easy to trace v1->v2: Use at least 256MB to test. This test case fails on btrfs as of now. tests/btrfs/172 | 74 + tests/btrfs/172.out | 2 ++ tests/btrfs/group | 1 + 3 files changed, 77 insertions(+) create mode 100755 tests/btrfs/172 create mode 100644 tests/btrfs/172.out diff --git a/tests/btrfs/172 b/tests/btrfs/172 new file mode 100755 index ..1ecf01d862a2 --- /dev/null +++ b/tests/btrfs/172 @@ -0,0 +1,74 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2018 Oracle. All Rights Reserved. +# +# FS QA Test 172 +# +# Test if the unaligned (by size and offset) punch hole is successful when FS +# is at ENOSPC. +# +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_xfs_io_command "fpunch" + +_scratch_mkfs_sized $((256 * 1024 *1024)) >> $seqres.full + +# max_inline helps to create regular extent +_scratch_mount "-o max_inline=0,nodatacow" + +cat /proc/self/mounts | grep $SCRATCH_DEV >> $seqres.full +$BTRFS_UTIL_PROG filesystem df $SCRATCH_MNT >> $seqres.full + +extent_size=$(_scratch_btrfs_sectorsize) +unalign_by=512 +echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full + +$XFS_IO_PROG -f -c "pwrite -S 0xab 0 $((extent_size * 10))" \ + $SCRATCH_MNT/testfile >> $seqres.full + +echo "Fill fs upto ENOSPC" >> $seqres.full +dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full 2>&1 +sync + +hole_offset=0 +hole_len=$unalign_by +$XFS_IO_PROG -f -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile + +hole_offset=$(($extent_size + $unalign_by)) +hole_len=$(($extent_size - $unalign_by)) +$XFS_IO_PROG -f -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile + +hole_offset=$(($extent_size * 2 + $unalign_by)) +hole_len=$(($extent_size * 5)) +$XFS_IO_PROG -f -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile + +# success, all done +echo "Silence is golden" +status=0 +exit diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out new file mode 100644 index ..ce2de3f0d107 --- /dev/null +++ b/tests/btrfs/172.out @@ -0,0 +1,2 @@ +QA output created by 172 +Silence is golden diff --git a/tests/btrfs/group b/tests/btrfs/group index feffc45b6564..45782565c3b7 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -174,3 +174,4 @@ 169 auto quick send 170 auto quick snapshot 171 auto quick qgroup +172 auto quick punch -- 1.8.3.1
Re: [PATCH v3] test unaligned punch hole at ENOSPC
Thanks for the comments more below.. On 09/29/2018 01:12 AM, Filipe Manana wrote: On Fri, Sep 28, 2018 at 6:08 PM Filipe Manana wrote: On Fri, Sep 28, 2018 at 3:51 PM Anand Jain wrote: Try to punch hole with unaligned size and offset when the FS returns ENOSPC The FS returns ENOSPC is confusing. It's more clear to say when the filesystem doesn't have more space available for data allocation. Will fix. Signed-off-by: Anand Jain --- v2->v3: add _require_xfs_io_command "fpunch" add more logs to $seqfull.full mount options and group profile info add sync after dd upto ENOSPC drop fallocate -p and use xfs_io punch to create holes v1->v2: Use at least 256MB to test. This test case fails on btrfs as of now. tests/btrfs/172 | 74 + tests/btrfs/172.out | 2 ++ tests/btrfs/group | 1 + 3 files changed, 77 insertions(+) create mode 100755 tests/btrfs/172 create mode 100644 tests/btrfs/172.out diff --git a/tests/btrfs/172 b/tests/btrfs/172 new file mode 100755 index ..59413a5de12f --- /dev/null +++ b/tests/btrfs/172 @@ -0,0 +1,74 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2018 Oracle. All Rights Reserved. +# +# FS QA Test 172 +# +# Test if the unaligned (by size and offset) punch hole is successful when FS +# is at ENOSPC. +# +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_xfs_io_command "fpunch" + +_scratch_mkfs_sized $((256 * 1024 *1024)) >> $seqres.full + +# max_inline helps to create regular extent max_inline ensures data is not inlined within metadata extents +_scratch_mount "-o max_inline=0,nodatacow" + +cat /proc/self/mounts | grep $SCRATCH_DEV >> $seqres.full +$BTRFS_UTIL_PROG filesystem df $SCRATCH_MNT >> $seqres.full + +extent_size=$(_scratch_btrfs_sectorsize) +unalign_by=512 +echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full + +$XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 $((extent_size * 10))" \ + $SCRATCH_MNT/testfile >> $seqres.full Also missing _require_odirect. Why is direct IO needed? If not needed (which I don't see why), it can be avoided. You caught me direct is not required, will drop it. Thanks, Anand + +echo "Fill fs upto ENOSPC" >> $seqres.full Fill all space available for data and all unallocated space. +dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full 2>&1 Why do you use dd here and not xfs_io? +sync Why is the sync needed? + +hole_offset=0 +hole_len=$unalign_by +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile No need to pass -f anymore. No need for -d either. + +hole_offset=$(($extent_size + $unalign_by)) +hole_len=$(($extent_size - $unalign_by)) +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile No need to pass -f anymore. No need for -d either. + +hole_offset=$(($extent_size * 2 + $unalign_by)) +hole_len=$(($extent_size * 5)) +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile No need to pass -f anymore. No need for -d either. + +# success, all done +echo "Silence is golden" +status=0 +exit diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out new file mode 100644 index ..ce2de3f0d107 --- /dev/null +++ b/tests/btrfs/172.out @@ -0,0 +1,2 @@ +QA output created by 172 +Silence is golden diff --git a/tests/btrfs/group b/tests/btrfs/group index feffc45b6564..7e1a638ab7e1 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -174,3 +174,4 @@ 169 auto quick send 170 auto quick snapshot 171 auto quick qgroup +172 auto quick -- 1.8.3.1 -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
Re: DUP dev_extent might overlap something next to it
On 09/29/2018 01:30 AM, Hans van Kranenburg wrote: > [...] > > I didn't try filling it up and see what happens yet. Also, this can > probably done with a DUP chunk, but it's a bit harder to quickly prove. DUP metadata chunk ^^ -- Hans van Kranenburg
Re: DUP dev_extent might overlap something next to it
On 09/25/2018 02:05 AM, Hans van Kranenburg wrote: > (I'm using v4.19-rc5 code here.) > > Imagine allocating a DATA|DUP chunk. > > [blub, see previous message] Steps to reproduce DUP chunk beyond end of device: First create a 6302M block device and fill it up. mkdir bork cd bork dd if=/dev/zero of=image bs=1 count=0 seek=6302M mkfs.btrfs -d dup -m dup image losetup -f image mkdir mountpoint mount -o space_cache=v2 /dev/loop0 mountpoint cp -a /usr mountpoint After a while, this starts throwing: No space left on device Now we extend the size of the image to 7880MiB so that the next dup data chunk allocation will exactly try to use the 1578MiB free raw disk space to trigger the bug. dd if=/dev/zero of=image bs=1 count=0 seek=7880M losetup -c /dev/loop0 btrfs fi resize max mountpoint/ Now we trigger the new DUP data chunk allocation: cp /vmlinuz mountpoint/ Now I have a dev extent starting at 7446986752 on devid 1, with length 838860800. This means it ends at 8285847552, which is 7902, exactly 22MiB beyond the size of the device. The only nice thing about this is that df shows me I still have more than 8 exabyte of space in this image file. Label: none uuid: e711eea6-5332-44cf-9704-998a7a939970 Total devices 1 FS bytes used 2.81GiB devid1 size 7.70GiB used 7.72GiB path /dev/loop0 I didn't try filling it up and see what happens yet. Also, this can probably done with a DUP chunk, but it's a bit harder to quickly prove. And making this happen in the middle of a block device instead of at the end is also a bit harder. -- Hans van Kranenburg
Re: python-btrfs v10 preview... detailed usage reporting and a tutorial
On 09/24/2018 10:08 AM, Nikolay Borisov wrote: >> >> The bugs are all related to repeated kernel code all over the place >> containing a lot of if statements dealing with different kind of >> allocation profiles and their exceptions. What I ended up doing is >> making a few helper functions instead, see the commit "Add volumes.py, >> handling device / chunk logic". It would probably be nice to do the same >> in the kernel code, which would also solve the mentioned bugs and >> prevent new similar ones from happening. > > Would you care to report each bug separately so they can be triaged and > fixed? In case of the RAID10 5GiB thing I think I was mixing up things. When doing mkfs you end up with a RAID10 chunk of 5GiB (dunno why, didn't research), when mounting and pointing balance at it, I get a 10GiB for it back, so that's ok. For the DUP thing, I sent an explanation ("DUP dev_extent might overlap something next to it"), which doesn't seem to attract much attention yet. I'm preparing a pile of patches to volumes.[ch] to fix this, clean up things that I ran into and make the logic a bit less convoluted. -- Hans van Kranenburg
Re: [PATCH v3] test unaligned punch hole at ENOSPC
On Fri, Sep 28, 2018 at 6:08 PM Filipe Manana wrote: > > On Fri, Sep 28, 2018 at 3:51 PM Anand Jain wrote: > > > > Try to punch hole with unaligned size and offset when the FS > > returns ENOSPC > > The FS returns ENOSPC is confusing. It's more clear to say when the > filesystem doesn't have more space available for data allocation. > > > > Signed-off-by: Anand Jain > > --- > > v2->v3: > > add _require_xfs_io_command "fpunch" > > add more logs to $seqfull.full > >mount options and > >group profile info > > add sync after dd upto ENOSPC > > drop fallocate -p and use xfs_io punch to create holes > > v1->v2: Use at least 256MB to test. > > This test case fails on btrfs as of now. > > > > tests/btrfs/172 | 74 > > + > > tests/btrfs/172.out | 2 ++ > > tests/btrfs/group | 1 + > > 3 files changed, 77 insertions(+) > > create mode 100755 tests/btrfs/172 > > create mode 100644 tests/btrfs/172.out > > > > diff --git a/tests/btrfs/172 b/tests/btrfs/172 > > new file mode 100755 > > index ..59413a5de12f > > --- /dev/null > > +++ b/tests/btrfs/172 > > @@ -0,0 +1,74 @@ > > +#! /bin/bash > > +# SPDX-License-Identifier: GPL-2.0 > > +# Copyright (c) 2018 Oracle. All Rights Reserved. > > +# > > +# FS QA Test 172 > > +# > > +# Test if the unaligned (by size and offset) punch hole is successful when > > FS > > +# is at ENOSPC. > > +# > > +seq=`basename $0` > > +seqres=$RESULT_DIR/$seq > > +echo "QA output created by $seq" > > + > > +here=`pwd` > > +tmp=/tmp/$$ > > +status=1 # failure is the default! > > +trap "_cleanup; exit \$status" 0 1 2 3 15 > > + > > +_cleanup() > > +{ > > + cd / > > + rm -f $tmp.* > > +} > > + > > +# get standard environment, filters and checks > > +. ./common/rc > > +. ./common/filter > > + > > +# remove previous $seqres.full before test > > +rm -f $seqres.full > > + > > +# real QA test starts here > > + > > +# Modify as appropriate. > > +_supported_fs btrfs > > +_supported_os Linux > > +_require_scratch > > +_require_xfs_io_command "fpunch" > > + > > +_scratch_mkfs_sized $((256 * 1024 *1024)) >> $seqres.full > > + > > +# max_inline helps to create regular extent > max_inline ensures data is not inlined within metadata extents > > > +_scratch_mount "-o max_inline=0,nodatacow" > > + > > +cat /proc/self/mounts | grep $SCRATCH_DEV >> $seqres.full > > +$BTRFS_UTIL_PROG filesystem df $SCRATCH_MNT >> $seqres.full > > + > > +extent_size=$(_scratch_btrfs_sectorsize) > > +unalign_by=512 > > +echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full > > + > > +$XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 $((extent_size * 10))" \ > > + $SCRATCH_MNT/testfile >> > > $seqres.full Also missing _require_odirect. Why is direct IO needed? If not needed (which I don't see why), it can be avoided. > > + > > +echo "Fill fs upto ENOSPC" >> $seqres.full > Fill all space available for data and all unallocated space. > > > +dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full > > 2>&1 > Why do you use dd here and not xfs_io? > > > +sync > Why is the sync needed? > > > + > > +hole_offset=0 > > +hole_len=$unalign_by > > +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile > > No need to pass -f anymore. No need for -d either. > > > + > > +hole_offset=$(($extent_size + $unalign_by)) > > +hole_len=$(($extent_size - $unalign_by)) > > +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile > > No need to pass -f anymore. No need for -d either. > > > + > > +hole_offset=$(($extent_size * 2 + $unalign_by)) > > +hole_len=$(($extent_size * 5)) > > +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile > > No need to pass -f anymore. No need for -d either. > > + > > +# success, all done > > +echo "Silence is golden" > > +status=0 > > +exit > > diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out > > new file mode 100644 > > index ..ce2de3f0d107 > > --- /dev/null > > +++ b/tests/btrfs/172.out > > @@ -0,0 +1,2 @@ > > +QA output created by 172 > > +Silence is golden > > diff --git a/tests/btrfs/group b/tests/btrfs/group > > index feffc45b6564..7e1a638ab7e1 100644 > > --- a/tests/btrfs/group > > +++ b/tests/btrfs/group > > @@ -174,3 +174,4 @@ > > 169 auto quick send > > 170 auto quick snapshot > > 171 auto quick qgroup > > +172 auto quick > > -- > > 1.8.3.1 > > > > > -- > Filipe David Manana, > > “Whether you think you can, or you think you can't — you're right.” -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
Re: [PATCH v3] test unaligned punch hole at ENOSPC
On Fri, Sep 28, 2018 at 3:51 PM Anand Jain wrote: > > Try to punch hole with unaligned size and offset when the FS > returns ENOSPC The FS returns ENOSPC is confusing. It's more clear to say when the filesystem doesn't have more space available for data allocation. > > Signed-off-by: Anand Jain > --- > v2->v3: > add _require_xfs_io_command "fpunch" > add more logs to $seqfull.full >mount options and >group profile info > add sync after dd upto ENOSPC > drop fallocate -p and use xfs_io punch to create holes > v1->v2: Use at least 256MB to test. > This test case fails on btrfs as of now. > > tests/btrfs/172 | 74 > + > tests/btrfs/172.out | 2 ++ > tests/btrfs/group | 1 + > 3 files changed, 77 insertions(+) > create mode 100755 tests/btrfs/172 > create mode 100644 tests/btrfs/172.out > > diff --git a/tests/btrfs/172 b/tests/btrfs/172 > new file mode 100755 > index ..59413a5de12f > --- /dev/null > +++ b/tests/btrfs/172 > @@ -0,0 +1,74 @@ > +#! /bin/bash > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright (c) 2018 Oracle. All Rights Reserved. > +# > +# FS QA Test 172 > +# > +# Test if the unaligned (by size and offset) punch hole is successful when FS > +# is at ENOSPC. > +# > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# real QA test starts here > + > +# Modify as appropriate. > +_supported_fs btrfs > +_supported_os Linux > +_require_scratch > +_require_xfs_io_command "fpunch" > + > +_scratch_mkfs_sized $((256 * 1024 *1024)) >> $seqres.full > + > +# max_inline helps to create regular extent max_inline ensures data is not inlined within metadata extents > +_scratch_mount "-o max_inline=0,nodatacow" > + > +cat /proc/self/mounts | grep $SCRATCH_DEV >> $seqres.full > +$BTRFS_UTIL_PROG filesystem df $SCRATCH_MNT >> $seqres.full > + > +extent_size=$(_scratch_btrfs_sectorsize) > +unalign_by=512 > +echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full > + > +$XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 $((extent_size * 10))" \ > + $SCRATCH_MNT/testfile >> $seqres.full > + > +echo "Fill fs upto ENOSPC" >> $seqres.full Fill all space available for data and all unallocated space. > +dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full > 2>&1 Why do you use dd here and not xfs_io? > +sync Why is the sync needed? > + > +hole_offset=0 > +hole_len=$unalign_by > +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile No need to pass -f anymore. No need for -d either. > + > +hole_offset=$(($extent_size + $unalign_by)) > +hole_len=$(($extent_size - $unalign_by)) > +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile No need to pass -f anymore. No need for -d either. > + > +hole_offset=$(($extent_size * 2 + $unalign_by)) > +hole_len=$(($extent_size * 5)) > +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile No need to pass -f anymore. No need for -d either. > + > +# success, all done > +echo "Silence is golden" > +status=0 > +exit > diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out > new file mode 100644 > index ..ce2de3f0d107 > --- /dev/null > +++ b/tests/btrfs/172.out > @@ -0,0 +1,2 @@ > +QA output created by 172 > +Silence is golden > diff --git a/tests/btrfs/group b/tests/btrfs/group > index feffc45b6564..7e1a638ab7e1 100644 > --- a/tests/btrfs/group > +++ b/tests/btrfs/group > @@ -174,3 +174,4 @@ > 169 auto quick send > 170 auto quick snapshot > 171 auto quick qgroup > +172 auto quick > -- > 1.8.3.1 > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”
[PATCH v4] test unaligned punch hole at ENOSPC
Try to punch hole with unaligned size and offset when the FS returns ENOSPC Signed-off-by: Anand Jain --- v3->v4: add to the group punch v2->v3: add _require_xfs_io_command "fpunch" add more logs to $seqfull.full mount options and group profile info add sync after dd upto ENOSPC drop fallocate -p and use xfs_io punch to create holes v1->v2: Use at least 256MB to test. This test case fails on btrfs as of now. tests/btrfs/172 | 74 + tests/btrfs/172.out | 2 ++ tests/btrfs/group | 1 + 3 files changed, 77 insertions(+) create mode 100755 tests/btrfs/172 create mode 100644 tests/btrfs/172.out diff --git a/tests/btrfs/172 b/tests/btrfs/172 new file mode 100755 index ..59413a5de12f --- /dev/null +++ b/tests/btrfs/172 @@ -0,0 +1,74 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2018 Oracle. All Rights Reserved. +# +# FS QA Test 172 +# +# Test if the unaligned (by size and offset) punch hole is successful when FS +# is at ENOSPC. +# +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_xfs_io_command "fpunch" + +_scratch_mkfs_sized $((256 * 1024 *1024)) >> $seqres.full + +# max_inline helps to create regular extent +_scratch_mount "-o max_inline=0,nodatacow" + +cat /proc/self/mounts | grep $SCRATCH_DEV >> $seqres.full +$BTRFS_UTIL_PROG filesystem df $SCRATCH_MNT >> $seqres.full + +extent_size=$(_scratch_btrfs_sectorsize) +unalign_by=512 +echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full + +$XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 $((extent_size * 10))" \ + $SCRATCH_MNT/testfile >> $seqres.full + +echo "Fill fs upto ENOSPC" >> $seqres.full +dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full 2>&1 +sync + +hole_offset=0 +hole_len=$unalign_by +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile + +hole_offset=$(($extent_size + $unalign_by)) +hole_len=$(($extent_size - $unalign_by)) +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile + +hole_offset=$(($extent_size * 2 + $unalign_by)) +hole_len=$(($extent_size * 5)) +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile + +# success, all done +echo "Silence is golden" +status=0 +exit diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out new file mode 100644 index ..ce2de3f0d107 --- /dev/null +++ b/tests/btrfs/172.out @@ -0,0 +1,2 @@ +QA output created by 172 +Silence is golden diff --git a/tests/btrfs/group b/tests/btrfs/group index feffc45b6564..45782565c3b7 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -174,3 +174,4 @@ 169 auto quick send 170 auto quick snapshot 171 auto quick qgroup +172 auto quick punch -- 1.8.3.1
Re: [PATCH] test unaligned punch hole at ENOSPC
Oops I just realized I sent v2 only to linux-btrfs@vger.kernel.org. more below.. On 09/28/2018 08:42 PM, Eryu Guan wrote: On Mon, Sep 24, 2018 at 07:47:39PM +0800, Anand Jain wrote: Try to punch hole with unaligned size and offset when the FS returns ENOSPC Signed-off-by: Anand Jain --- This test case fails on btrfs as of now. tests/btrfs/172 | 66 + tests/btrfs/172.out | 2 ++ tests/btrfs/group | 1 + 3 files changed, 69 insertions(+) create mode 100755 tests/btrfs/172 create mode 100644 tests/btrfs/172.out diff --git a/tests/btrfs/172 b/tests/btrfs/172 new file mode 100755 index ..9c32a173f912 --- /dev/null +++ b/tests/btrfs/172 @@ -0,0 +1,66 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2018 Oracle. All Rights Reserved. +# +# FS QA Test 172 +# +# Test if the unaligned (by size and offset) punch hole is successful when FS +# is at ENOSPC. +# +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs generic +_supported_os Linux +_require_scratch + +_scratch_mkfs_sized $((200 * 1024 *1024)) >> $seqres.full + +# max_inline helps to create regular extent +_scratch_mount "-o max_inline=0,nodatacow" + +echo "Fill fs upto ENOSPC" >> $seqres.full +dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full 2>&1 + +extent_size=$(_scratch_btrfs_sectorsize) +unalign_by=512 +echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full + +hole_offset=0 +hole_len=$unalign_by +run_check fallocate -p -o $hole_offset -l $hole_len $SCRATCH_MNT/filler Please don't introduce new run_check/_run_btrfs_util_prog users, just redirect output to /dev/null if the outputs don't matter. Please refer to this thread Fixed in v3. https://www.spinics.net/lists/linux-btrfs/msg80996.html And use xfs_io fpunch command instead of bare 'fallocate -p', and check xfs_io and kernel support on fpunch by calling _require_xfs_io_comand "fpunch" Fixed in v3. + +hole_offset=$(($extent_size + $unalign_by)) +hole_len=$(($extent_size - $unalign_by)) +run_check fallocate -p -o $hole_offset -l $hole_len $SCRATCH_MNT/filler + +hole_offset=$(($extent_size * 2 + $unalign_by)) +hole_len=$(($extent_size * 5)) +run_check fallocate -p -o $hole_offset -l $hole_len $SCRATCH_MNT/filler + +# success, all done +echo "Silence is golden" +status=0 +exit diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out new file mode 100644 index ..ce2de3f0d107 --- /dev/null +++ b/tests/btrfs/172.out @@ -0,0 +1,2 @@ +QA output created by 172 +Silence is golden diff --git a/tests/btrfs/group b/tests/btrfs/group index feffc45b6564..7e1a638ab7e1 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -174,3 +174,4 @@ 169 auto quick send 170 auto quick snapshot 171 auto quick qgroup +172 auto quick Add 'punch' group too. Ah there is punch group.. I was searching with the key word 'hole'. Will fix in v4. Thanks, Anand Thanks, Eryu -- 1.8.3.1
[PATCH v3] test unaligned punch hole at ENOSPC
Try to punch hole with unaligned size and offset when the FS returns ENOSPC Signed-off-by: Anand Jain --- v2->v3: add _require_xfs_io_command "fpunch" add more logs to $seqfull.full mount options and group profile info add sync after dd upto ENOSPC drop fallocate -p and use xfs_io punch to create holes v1->v2: Use at least 256MB to test. This test case fails on btrfs as of now. tests/btrfs/172 | 74 + tests/btrfs/172.out | 2 ++ tests/btrfs/group | 1 + 3 files changed, 77 insertions(+) create mode 100755 tests/btrfs/172 create mode 100644 tests/btrfs/172.out diff --git a/tests/btrfs/172 b/tests/btrfs/172 new file mode 100755 index ..59413a5de12f --- /dev/null +++ b/tests/btrfs/172 @@ -0,0 +1,74 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2018 Oracle. All Rights Reserved. +# +# FS QA Test 172 +# +# Test if the unaligned (by size and offset) punch hole is successful when FS +# is at ENOSPC. +# +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_xfs_io_command "fpunch" + +_scratch_mkfs_sized $((256 * 1024 *1024)) >> $seqres.full + +# max_inline helps to create regular extent +_scratch_mount "-o max_inline=0,nodatacow" + +cat /proc/self/mounts | grep $SCRATCH_DEV >> $seqres.full +$BTRFS_UTIL_PROG filesystem df $SCRATCH_MNT >> $seqres.full + +extent_size=$(_scratch_btrfs_sectorsize) +unalign_by=512 +echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full + +$XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 $((extent_size * 10))" \ + $SCRATCH_MNT/testfile >> $seqres.full + +echo "Fill fs upto ENOSPC" >> $seqres.full +dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full 2>&1 +sync + +hole_offset=0 +hole_len=$unalign_by +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile + +hole_offset=$(($extent_size + $unalign_by)) +hole_len=$(($extent_size - $unalign_by)) +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile + +hole_offset=$(($extent_size * 2 + $unalign_by)) +hole_len=$(($extent_size * 5)) +$XFS_IO_PROG -f -d -c "fpunch $hole_offset $hole_len" $SCRATCH_MNT/testfile + +# success, all done +echo "Silence is golden" +status=0 +exit diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out new file mode 100644 index ..ce2de3f0d107 --- /dev/null +++ b/tests/btrfs/172.out @@ -0,0 +1,2 @@ +QA output created by 172 +Silence is golden diff --git a/tests/btrfs/group b/tests/btrfs/group index feffc45b6564..7e1a638ab7e1 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -174,3 +174,4 @@ 169 auto quick send 170 auto quick snapshot 171 auto quick qgroup +172 auto quick -- 1.8.3.1
Re: [PATCH v2 1/9] fstests: btrfs: _scratch_mkfs_sized fix min size without mixed option
On 09/28/2018 04:07 AM, Omar Sandoval wrote: On Wed, Sep 26, 2018 at 09:34:27AM +0300, Nikolay Borisov wrote: On 26.09.2018 07:07, Anand Jain wrote: On 09/25/2018 06:51 PM, Nikolay Borisov wrote: On 25.09.2018 07:24, Anand Jain wrote: As of now _scratch_mkfs_sized() checks if the requested size is below 1G and forces the --mixed option for the mkfs.btrfs. Well the correct size considering all possible group profiles at which we need to force the mixed option is roughly 256Mbytes. So fix that. Signed-off-by: Anand Jain Have you considered the implications of this w.r.t commit d4da414a9a9d ("common/rc: raise btrfs mixed mode threshold to 1GB") Initially this threshold was 100mb then Omar changed it to 1g. Does this change affect generic/427? d4da414a9a9d does not explain what was the problem that Omar wanted to address, mainly what was the failure about. I just retested on upstream 4.19.0-rc3 with Omar's patch reverted (so anything above 100m for fs size is created with non-mixed block groups) and the test succeeded. So indeed your change seems to not make a difference for this test. And no it does not affect. I have verified generic/427 with kernel 4.1 and 4.19-rc5 with btrfs-progs 4.1, 4.9 and latest from kdave they all run fine. Good to integrate. I had to double check, but it only happens with -m dup. If I apply the following patch: diff --git a/common/rc b/common/rc index d5bb1fe..989b846 100644 --- a/common/rc +++ b/common/rc @@ -969,7 +969,7 @@ _scratch_mkfs_sized() ;; btrfs) local mixed_opt= - (( fssize <= 1024 * 1024 * 1024 )) && mixed_opt='--mixed' + (( fssize <= 100 * 1024 * 1024 )) && mixed_opt='--mixed' $MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV ;; jfs) diff --git a/tests/generic/427 b/tests/generic/427 index e8ebffe..206cf08 100755 --- a/tests/generic/427 +++ b/tests/generic/427 @@ -65,6 +65,7 @@ fi # start a background aio writer, which does several extending loops # internally and check data integrality $AIO_TEST -s $fsize -b 65536 $SCRATCH_MNT/tst-aio-dio-eof-race.$seq +btrfs fi usage $SCRATCH_MNT status=$? kill $open_close_pid And run with MKFS_OPTIONS="-m dup", then we don't have enough data space for the test: --- /root/linux/xfstests/tests/generic/427.out 2017-11-28 16:05:46.811435644 -0800 +++ /root/linux/xfstests/results/generic/427.out.bad2018-09-27 13:01:00.540510385 -0700 @@ -1,2 +1,24 @@ QA output created by 427 -Success, all done. +pwrite: No space left on device Thanks Omar. Unfortunately I can't reproduce with the diff as above + MKFS_OPTIONS="-m dup". In any case the objective of this patch is to ensure _scratch_mkfs_sized() provides default group profile with the minimum disk size that's actually be required. And related to that there isn't any issue in this patch. Thanks, Anand +Overall: +Device size:256.00MiB +Device allocated: 255.00MiB +Device unallocated: 1.00MiB +Device missing: 0.00B +Used: 179.03MiB +Free (estimated): 0.00B (min: 0.00B) +Data ratio: 1.00 +Metadata ratio: 2.00 +Global reserve: 16.00MiB (used: 0.00B) + +Data,single: Size:175.00MiB, Used:175.00MiB + /dev/nvme0n1p2 175.00MiB + +Metadata,DUP: Size:32.00MiB, Used:2.00MiB + /dev/nvme0n1p264.00MiB + +System,DUP: Size:8.00MiB, Used:16.00KiB + /dev/nvme0n1p216.00MiB + +Unallocated: + /dev/nvme0n1p2 1.00MiB
Re: [PATCH] test unaligned punch hole at ENOSPC
On Mon, Sep 24, 2018 at 07:47:39PM +0800, Anand Jain wrote: > Try to punch hole with unaligned size and offset when the FS > returns ENOSPC > > Signed-off-by: Anand Jain > --- > This test case fails on btrfs as of now. > > tests/btrfs/172 | 66 > + > tests/btrfs/172.out | 2 ++ > tests/btrfs/group | 1 + > 3 files changed, 69 insertions(+) > create mode 100755 tests/btrfs/172 > create mode 100644 tests/btrfs/172.out > > diff --git a/tests/btrfs/172 b/tests/btrfs/172 > new file mode 100755 > index ..9c32a173f912 > --- /dev/null > +++ b/tests/btrfs/172 > @@ -0,0 +1,66 @@ > +#! /bin/bash > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright (c) 2018 Oracle. All Rights Reserved. > +# > +# FS QA Test 172 > +# > +# Test if the unaligned (by size and offset) punch hole is successful when FS > +# is at ENOSPC. > +# > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# real QA test starts here > + > +# Modify as appropriate. > +_supported_fs generic > +_supported_os Linux > +_require_scratch > + > +_scratch_mkfs_sized $((200 * 1024 *1024)) >> $seqres.full > + > +# max_inline helps to create regular extent > +_scratch_mount "-o max_inline=0,nodatacow" > + > +echo "Fill fs upto ENOSPC" >> $seqres.full > +dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full > 2>&1 > + > +extent_size=$(_scratch_btrfs_sectorsize) > +unalign_by=512 > +echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full > + > +hole_offset=0 > +hole_len=$unalign_by > +run_check fallocate -p -o $hole_offset -l $hole_len $SCRATCH_MNT/filler Please don't introduce new run_check/_run_btrfs_util_prog users, just redirect output to /dev/null if the outputs don't matter. Please refer to this thread https://www.spinics.net/lists/linux-btrfs/msg80996.html And use xfs_io fpunch command instead of bare 'fallocate -p', and check xfs_io and kernel support on fpunch by calling _require_xfs_io_comand "fpunch" > + > +hole_offset=$(($extent_size + $unalign_by)) > +hole_len=$(($extent_size - $unalign_by)) > +run_check fallocate -p -o $hole_offset -l $hole_len $SCRATCH_MNT/filler > + > +hole_offset=$(($extent_size * 2 + $unalign_by)) > +hole_len=$(($extent_size * 5)) > +run_check fallocate -p -o $hole_offset -l $hole_len $SCRATCH_MNT/filler > + > +# success, all done > +echo "Silence is golden" > +status=0 > +exit > diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out > new file mode 100644 > index ..ce2de3f0d107 > --- /dev/null > +++ b/tests/btrfs/172.out > @@ -0,0 +1,2 @@ > +QA output created by 172 > +Silence is golden > diff --git a/tests/btrfs/group b/tests/btrfs/group > index feffc45b6564..7e1a638ab7e1 100644 > --- a/tests/btrfs/group > +++ b/tests/btrfs/group > @@ -174,3 +174,4 @@ > 169 auto quick send > 170 auto quick snapshot > 171 auto quick qgroup > +172 auto quick Add 'punch' group too. Thanks, Eryu > -- > 1.8.3.1 >
Re: [PATCH 06/42] btrfs: introduce delayed_refs_rsv
On Fri, Sep 28, 2018 at 02:51:10PM +0300, Nikolay Borisov wrote: > > > On 28.09.2018 14:17, Josef Bacik wrote: > > From: Josef Bacik > > > > Traditionally we've had voodoo in btrfs to account for the space that > > delayed refs may take up by having a global_block_rsv. This works most > > of the time, except when it doesn't. We've had issues reported and seen > > in production where sometimes the global reserve is exhausted during > > transaction commit before we can run all of our delayed refs, resulting > > in an aborted transaction. Because of this voodoo we have equally > > dubious flushing semantics around throttling delayed refs which we often > > get wrong. > > > > So instead give them their own block_rsv. This way we can always know > > exactly how much outstanding space we need for delayed refs. This > > allows us to make sure we are constantly filling that reservation up > > with space, and allows us to put more precise pressure on the enospc > > system. Instead of doing math to see if its a good time to throttle, > > the normal enospc code will be invoked if we have a lot of delayed refs > > pending, and they will be run via the normal flushing mechanism. > > > > For now the delayed_refs_rsv will hold the reservations for the delayed > > refs, the block group updates, and deleting csums. We could have a > > separate rsv for the block group updates, but the csum deletion stuff is > > still handled via the delayed_refs so that will stay there. > > > > Signed-off-by: Josef Bacik > > --- > > fs/btrfs/ctree.h | 27 +++-- > > fs/btrfs/delayed-ref.c | 28 - > > fs/btrfs/disk-io.c | 4 + > > fs/btrfs/extent-tree.c | 279 > > +++ > > fs/btrfs/inode.c | 2 +- > > fs/btrfs/transaction.c | 77 ++-- > > include/trace/events/btrfs.h | 2 + > > 7 files changed, 312 insertions(+), 107 deletions(-) > > > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > > index 66f1d3895bca..1a2c3b629af2 100644 > > --- a/fs/btrfs/ctree.h > > +++ b/fs/btrfs/ctree.h > > @@ -452,8 +452,9 @@ struct btrfs_space_info { > > #defineBTRFS_BLOCK_RSV_TRANS 3 > > #defineBTRFS_BLOCK_RSV_CHUNK 4 > > #defineBTRFS_BLOCK_RSV_DELOPS 5 > > -#defineBTRFS_BLOCK_RSV_EMPTY 6 > > -#defineBTRFS_BLOCK_RSV_TEMP7 > > +#define BTRFS_BLOCK_RSV_DELREFS6 > > +#defineBTRFS_BLOCK_RSV_EMPTY 7 > > +#defineBTRFS_BLOCK_RSV_TEMP8 > > > > struct btrfs_block_rsv { > > u64 size; > > @@ -794,6 +795,8 @@ struct btrfs_fs_info { > > struct btrfs_block_rsv chunk_block_rsv; > > /* block reservation for delayed operations */ > > struct btrfs_block_rsv delayed_block_rsv; > > + /* block reservation for delayed refs */ > > + struct btrfs_block_rsv delayed_refs_rsv; > > > > struct btrfs_block_rsv empty_block_rsv; > > > > @@ -2608,8 +2611,7 @@ static inline u64 > > btrfs_calc_trunc_metadata_size(struct btrfs_fs_info *fs_info, > > > > int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans, > >struct btrfs_fs_info *fs_info); > > -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans, > > - struct btrfs_fs_info *fs_info); > > +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info); > > void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, > > const u64 start); > > void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache > > *bg); > > @@ -2723,10 +2725,12 @@ enum btrfs_reserve_flush_enum { > > enum btrfs_flush_state { > > FLUSH_DELAYED_ITEMS_NR = 1, > > FLUSH_DELAYED_ITEMS = 2, > > - FLUSH_DELALLOC = 3, > > - FLUSH_DELALLOC_WAIT = 4, > > - ALLOC_CHUNK = 5, > > - COMMIT_TRANS= 6, > > + FLUSH_DELAYED_REFS_NR = 3, > > + FLUSH_DELAYED_REFS = 4, > > + FLUSH_DELALLOC = 5, > > + FLUSH_DELALLOC_WAIT = 6, > > + ALLOC_CHUNK = 7, > > + COMMIT_TRANS= 8, > > }; > > > > int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); > > @@ -2777,6 +2781,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info > > *fs_info, > > void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info, > > struct btrfs_block_rsv *block_rsv, > > u64 num_bytes); > > +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr); > > +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans); > > +int btrfs_throttle_delayed_refs(struct btrfs_fs_info *fs_info, > > + enum btrfs_reserve_flush_enum flush); > > +void btrfs_migrate_to_delayed_refs_rsv(struct b
Re: [PATCH 06/42] btrfs: introduce delayed_refs_rsv
On 28.09.2018 14:17, Josef Bacik wrote: > From: Josef Bacik > > Traditionally we've had voodoo in btrfs to account for the space that > delayed refs may take up by having a global_block_rsv. This works most > of the time, except when it doesn't. We've had issues reported and seen > in production where sometimes the global reserve is exhausted during > transaction commit before we can run all of our delayed refs, resulting > in an aborted transaction. Because of this voodoo we have equally > dubious flushing semantics around throttling delayed refs which we often > get wrong. > > So instead give them their own block_rsv. This way we can always know > exactly how much outstanding space we need for delayed refs. This > allows us to make sure we are constantly filling that reservation up > with space, and allows us to put more precise pressure on the enospc > system. Instead of doing math to see if its a good time to throttle, > the normal enospc code will be invoked if we have a lot of delayed refs > pending, and they will be run via the normal flushing mechanism. > > For now the delayed_refs_rsv will hold the reservations for the delayed > refs, the block group updates, and deleting csums. We could have a > separate rsv for the block group updates, but the csum deletion stuff is > still handled via the delayed_refs so that will stay there. > > Signed-off-by: Josef Bacik > --- > fs/btrfs/ctree.h | 27 +++-- > fs/btrfs/delayed-ref.c | 28 - > fs/btrfs/disk-io.c | 4 + > fs/btrfs/extent-tree.c | 279 > +++ > fs/btrfs/inode.c | 2 +- > fs/btrfs/transaction.c | 77 ++-- > include/trace/events/btrfs.h | 2 + > 7 files changed, 312 insertions(+), 107 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 66f1d3895bca..1a2c3b629af2 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -452,8 +452,9 @@ struct btrfs_space_info { > #define BTRFS_BLOCK_RSV_TRANS 3 > #define BTRFS_BLOCK_RSV_CHUNK 4 > #define BTRFS_BLOCK_RSV_DELOPS 5 > -#define BTRFS_BLOCK_RSV_EMPTY 6 > -#define BTRFS_BLOCK_RSV_TEMP7 > +#define BTRFS_BLOCK_RSV_DELREFS 6 > +#define BTRFS_BLOCK_RSV_EMPTY 7 > +#define BTRFS_BLOCK_RSV_TEMP8 > > struct btrfs_block_rsv { > u64 size; > @@ -794,6 +795,8 @@ struct btrfs_fs_info { > struct btrfs_block_rsv chunk_block_rsv; > /* block reservation for delayed operations */ > struct btrfs_block_rsv delayed_block_rsv; > + /* block reservation for delayed refs */ > + struct btrfs_block_rsv delayed_refs_rsv; > > struct btrfs_block_rsv empty_block_rsv; > > @@ -2608,8 +2611,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct > btrfs_fs_info *fs_info, > > int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans, > struct btrfs_fs_info *fs_info); > -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans, > -struct btrfs_fs_info *fs_info); > +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info); > void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, >const u64 start); > void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg); > @@ -2723,10 +2725,12 @@ enum btrfs_reserve_flush_enum { > enum btrfs_flush_state { > FLUSH_DELAYED_ITEMS_NR = 1, > FLUSH_DELAYED_ITEMS = 2, > - FLUSH_DELALLOC = 3, > - FLUSH_DELALLOC_WAIT = 4, > - ALLOC_CHUNK = 5, > - COMMIT_TRANS= 6, > + FLUSH_DELAYED_REFS_NR = 3, > + FLUSH_DELAYED_REFS = 4, > + FLUSH_DELALLOC = 5, > + FLUSH_DELALLOC_WAIT = 6, > + ALLOC_CHUNK = 7, > + COMMIT_TRANS= 8, > }; > > int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); > @@ -2777,6 +2781,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info > *fs_info, > void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info, >struct btrfs_block_rsv *block_rsv, >u64 num_bytes); > +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr); > +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans); > +int btrfs_throttle_delayed_refs(struct btrfs_fs_info *fs_info, > + enum btrfs_reserve_flush_enum flush); > +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, > +struct btrfs_block_rsv *src, > +u64 num_bytes); > int btrfs_inc_block_group_ro(struct btrfs_block_group_cach
[PATCH 25/42] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock
We don't need the trans except to get the delayed_refs_root, so just pass the delayed_refs_root into btrfs_delayed_ref_lock and call it a day. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 5 + fs/btrfs/delayed-ref.h | 2 +- fs/btrfs/extent-tree.c | 2 +- 3 files changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 96ce087747b2..87778645bf4a 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -197,12 +197,9 @@ find_ref_head(struct rb_root *root, u64 bytenr, return NULL; } -int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans, +int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs, struct btrfs_delayed_ref_head *head) { - struct btrfs_delayed_ref_root *delayed_refs; - - delayed_refs = &trans->transaction->delayed_refs; lockdep_assert_held(&delayed_refs->lock); if (mutex_trylock(&head->mutex)) return 0; diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index 7769177b489e..ee636d7a710a 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -255,7 +255,7 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head * btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs, u64 bytenr); -int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans, +int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs, struct btrfs_delayed_ref_head *head); static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head) { diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 01bfb02101c1..34105bc5eef7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2600,7 +2600,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, /* grab the lock that says we are going to process * all the refs for this head */ - ret = btrfs_delayed_ref_lock(trans, locked_ref); + ret = btrfs_delayed_ref_lock(delayed_refs, locked_ref); spin_unlock(&delayed_refs->lock); /* * we may have dropped the spin lock to get the head -- 2.14.3
[PATCH 34/42] btrfs: wait on ordered extents on abort cleanup
If we flip read-only before we initiate writeback on all dirty pages for ordered extents we've created then we'll have ordered extents left over on umount, which results in all sorts of bad things happening. Fix this by making sure we wait on ordered extents if we have to do the aborted transaction cleanup stuff. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 54fbdc944a3f..51b2a5bf25e5 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4188,6 +4188,14 @@ static void btrfs_destroy_all_ordered_extents(struct btrfs_fs_info *fs_info) spin_lock(&fs_info->ordered_root_lock); } spin_unlock(&fs_info->ordered_root_lock); + + /* +* We need this here because if we've been flipped read-only we won't +* get sync() from the umount, so we need to make sure any ordered +* extents that haven't had their dirty pages IO start writeout yet +* actually get run and error out properly. +*/ + btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1); } static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, -- 2.14.3
[PATCH 37/42] btrfs: wakeup cleaner thread when adding delayed iput
The cleaner thread usually takes care of delayed iputs, with the exception of the btrfs_end_transaction_throttle path. The cleaner thread only gets woken up every 30 seconds, so instead wake it up to do it's work so that we can free up that space as quickly as possible. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2b257d14bd3d..0a1671fb03bf 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3323,6 +3323,7 @@ void btrfs_add_delayed_iput(struct inode *inode) ASSERT(list_empty(&binode->delayed_iput)); list_add_tail(&binode->delayed_iput, &fs_info->delayed_iputs); spin_unlock(&fs_info->delayed_iput_lock); + wake_up_process(fs_info->cleaner_kthread); } void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info) -- 2.14.3
[PATCH 30/42] btrfs: just delete pending bgs if we are aborted
We still need to do all of the accounting cleanup for pending block groups if we abort. So set the ret to trans->aborted so if we aborted the cleanup happens and everybody is happy. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 91e45cb14d45..ac282eb535a8 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10359,9 +10359,15 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) struct btrfs_root *extent_root = fs_info->extent_root; struct btrfs_block_group_item item; struct btrfs_key key; - int ret = 0; + int ret; bool can_flush_pending_bgs = trans->can_flush_pending_bgs; + /* +* If we aborted the transaction with pending bg's we need to just +* cleanup the list and carry on. +*/ + ret = trans->aborted; + trans->can_flush_pending_bgs = false; while (!list_empty(&trans->new_bgs)) { block_group = list_first_entry(&trans->new_bgs, -- 2.14.3
[PATCH 42/42] btrfs: don't run delayed_iputs in commit
This could result in a really bad case where we do something like evict evict_refill_and_join btrfs_commit_transaction btrfs_run_delayed_iputs evict evict_refill_and_join btrfs_commit_transaction ... forever We have plenty of other places where we run delayed iputs that are much safer, let those do the work. Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 9 - 1 file changed, 9 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 9168efaca37e..c91dc36fccae 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2265,15 +2265,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) kmem_cache_free(btrfs_trans_handle_cachep, trans); - /* -* If fs has been frozen, we can not handle delayed iputs, otherwise -* it'll result in deadlock about SB_FREEZE_FS. -*/ - if (current != fs_info->transaction_kthread && - current != fs_info->cleaner_kthread && - !test_bit(BTRFS_FS_FROZEN, &fs_info->flags)) - btrfs_run_delayed_iputs(fs_info); - return ret; scrub_continue: -- 2.14.3
[PATCH 32/42] btrfs: only free reserved extent if we didn't insert it
When we insert the file extent once the ordered extent completes we free the reserved extent reservation as it'll have been migrated to the bytes_used counter. However if we error out after this step we'll still clear the reserved extent reservation, resulting in a negative accounting of the reserved bytes for the block group and space info. Fix this by only doing the free if we didn't successfully insert a file extent for this extent. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/inode.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5a91055a13b2..2b257d14bd3d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2992,6 +2992,7 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) bool truncated = false; bool range_locked = false; bool clear_new_delalloc_bytes = false; + bool clear_reserved_extent = true; if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags) && @@ -3095,10 +3096,12 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) logical_len, logical_len, compress_type, 0, 0, BTRFS_FILE_EXTENT_REG); - if (!ret) + if (!ret) { + clear_reserved_extent = false; btrfs_release_delalloc_bytes(fs_info, ordered_extent->start, ordered_extent->disk_len); + } } unpin_extent_cache(&BTRFS_I(inode)->extent_tree, ordered_extent->file_offset, ordered_extent->len, @@ -3159,8 +3162,13 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) * wrong we need to return the space for this ordered extent * back to the allocator. We only free the extent in the * truncated case if we didn't write out the extent at all. +* +* If we made it past insert_reserved_file_extent before we +* errored out then we don't need to do this as the accounting +* has already been done. */ if ((ret || !logical_len) && + clear_reserved_extent && !test_bit(BTRFS_ORDERED_NOCOW, &ordered_extent->flags) && !test_bit(BTRFS_ORDERED_PREALLOC, &ordered_extent->flags)) btrfs_free_reserved_extent(fs_info, -- 2.14.3
[PATCH 31/42] btrfs: cleanup pending bgs on transaction abort
We may abort the transaction during a commit and not have a chance to run the pending bgs stuff, which will leave block groups on our list and cause us accounting issues and leaked memory. Fix this by running the pending bgs when we cleanup a transaction. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 46ca775a709e..9168efaca37e 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -2280,6 +2280,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_scrub_continue(fs_info); cleanup_transaction: btrfs_trans_release_metadata(trans); + /* This cleans up the pending block groups list properly. */ + if (!trans->aborted) + trans->aborted = ret; + btrfs_create_pending_block_groups(trans); btrfs_trans_release_chunk_metadata(trans); trans->block_rsv = NULL; btrfs_warn(fs_info, "Skipping commit of aborted transaction."); -- 2.14.3
[PATCH 36/42] btrfs: wait on caching when putting the bg cache
While testing my backport I noticed there was a panic if I ran generic/416 generic/417 generic/418 all in a row. This just happened to uncover a race where we had outstanding IO after we destroy all of our workqueues, and then we'd go to queue the endio work on those free'd workqueues. This is because we aren't waiting for the caching threads to be done before freeing everything up, so to fix this make sure we wait on any outstanding caching that's being done before we free up the block group, so we're sure to be done with all IO by the time we get to btrfs_stop_all_workers(). This fixes the panic I was seeing consistently in testing. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 922dd509591a..262e0f7f2ea1 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9890,6 +9890,7 @@ void btrfs_put_block_group_cache(struct btrfs_fs_info *info) block_group = btrfs_lookup_first_block_group(info, last); while (block_group) { + wait_block_group_cache_done(block_group); spin_lock(&block_group->lock); if (block_group->iref) break; -- 2.14.3
[PATCH 39/42] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue
The throttle path doesn't take cleaner_delayed_iput_mutex, which means we could think we're done flushing iputs in the data space reservation path when we could have a throttler doing an iput. There's no real reason to serialize the delayed iput flushing, so instead of taking the cleaner_delayed_iput_mutex whenever we flush the delayed iputs just replace it with an atomic counter and a waitqueue. This removes the short (or long depending on how big the inode is) window where we think there are no more pending iputs when there really are some. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 4 +++- fs/btrfs/disk-io.c | 5 ++--- fs/btrfs/extent-tree.c | 9 + fs/btrfs/inode.c | 21 + 4 files changed, 31 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e40356ca0295..1ef0b1649cad 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -894,7 +894,8 @@ struct btrfs_fs_info { spinlock_t delayed_iput_lock; struct list_head delayed_iputs; - struct mutex cleaner_delayed_iput_mutex; + atomic_t nr_delayed_iputs; + wait_queue_head_t delayed_iputs_wait; /* this protects tree_mod_seq_list */ spinlock_t tree_mod_seq_lock; @@ -3212,6 +3213,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root); int btrfs_cont_expand(struct inode *inode, loff_t oldsize, loff_t size); void btrfs_add_delayed_iput(struct inode *inode); void btrfs_run_delayed_iputs(struct btrfs_fs_info *fs_info); +int btrfs_wait_on_delayed_iputs(struct btrfs_fs_info *fs_info); int btrfs_prealloc_file_range(struct inode *inode, int mode, u64 start, u64 num_bytes, u64 min_size, loff_t actual_len, u64 *alloc_hint); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 51b2a5bf25e5..3dce9ff72e41 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1692,9 +1692,7 @@ static int cleaner_kthread(void *arg) goto sleep; } - mutex_lock(&fs_info->cleaner_delayed_iput_mutex); btrfs_run_delayed_iputs(fs_info); - mutex_unlock(&fs_info->cleaner_delayed_iput_mutex); again = btrfs_clean_one_deleted_snapshot(root); mutex_unlock(&fs_info->cleaner_mutex); @@ -2677,7 +2675,6 @@ int open_ctree(struct super_block *sb, mutex_init(&fs_info->delete_unused_bgs_mutex); mutex_init(&fs_info->reloc_mutex); mutex_init(&fs_info->delalloc_root_mutex); - mutex_init(&fs_info->cleaner_delayed_iput_mutex); seqlock_init(&fs_info->profiles_lock); INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots); @@ -2699,6 +2696,7 @@ int open_ctree(struct super_block *sb, atomic_set(&fs_info->defrag_running, 0); atomic_set(&fs_info->qgroup_op_seq, 0); atomic_set(&fs_info->reada_works_cnt, 0); + atomic_set(&fs_info->nr_delayed_iputs, 0); atomic64_set(&fs_info->tree_mod_seq, 0); fs_info->sb = sb; fs_info->max_inline = BTRFS_DEFAULT_MAX_INLINE; @@ -2776,6 +2774,7 @@ int open_ctree(struct super_block *sb, init_waitqueue_head(&fs_info->transaction_wait); init_waitqueue_head(&fs_info->transaction_blocked_wait); init_waitqueue_head(&fs_info->async_submit_wait); + init_waitqueue_head(&fs_info->delayed_iputs_wait); INIT_LIST_HEAD(&fs_info->pinned_chunks); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a7ba0d0e8de1..77bc53ad84e9 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4258,8 +4258,9 @@ int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes) * operations. Wait for it to finish so that * more space is released. */ - mutex_lock(&fs_info->cleaner_delayed_iput_mutex); - mutex_unlock(&fs_info->cleaner_delayed_iput_mutex); + ret = btrfs_wait_on_delayed_iputs(fs_info); + if (ret) + return ret; goto again; } else { btrfs_end_transaction(trans); @@ -4829,9 +4830,9 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, * pinned space, so make sure we run the iputs before we do our pinned * bytes check below. */ - mutex_lock(&fs_info->cleaner_delayed_iput_mutex); btrfs_run_delayed_iputs(fs_info); - mutex_unlock(&fs_info->cleaner_delayed_iput_mutex); + wait_event(fs_info->delayed_iputs_wait, + atomic_read(&fs_info->nr_delayed_iputs) == 0); trans = btrfs_join_transaction(fs_info->extent_root); if (IS_ERR(trans)) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
[PATCH 38/42] btrfs: be more explicit about allowed flush states
For FLUSH_LIMIT flushers we really can only allocate chunks and flush delayed inode items, everything else is problematic. I added a bunch of new states and it lead to weirdness in the FLUSH_LIMIT case because I forgot about how it worked. So instead explicitly declare the states that are ok for flushing with FLUSH_LIMIT and use that for our state machine. Then as we add new things that are safe we can just add them to this list. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 262e0f7f2ea1..a7ba0d0e8de1 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5110,12 +5110,18 @@ void btrfs_init_async_reclaim_work(struct work_struct *work) INIT_WORK(work, btrfs_async_reclaim_metadata_space); } +static const enum btrfs_flush_state priority_flush_states[] = { + FLUSH_DELAYED_ITEMS_NR, + FLUSH_DELAYED_ITEMS, + ALLOC_CHUNK, +}; + static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, struct reserve_ticket *ticket) { u64 to_reclaim; - int flush_state = FLUSH_DELAYED_ITEMS_NR; + int flush_state = 0; spin_lock(&space_info->lock); to_reclaim = btrfs_calc_reclaim_metadata_size(fs_info, space_info, @@ -5127,7 +5133,8 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, spin_unlock(&space_info->lock); do { - flush_space(fs_info, space_info, to_reclaim, flush_state); + flush_space(fs_info, space_info, to_reclaim, + priority_flush_states[flush_state]); flush_state++; spin_lock(&space_info->lock); if (ticket->bytes == 0) { @@ -5135,15 +5142,7 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, return; } spin_unlock(&space_info->lock); - - /* -* Priority flushers can't wait on delalloc without -* deadlocking. -*/ - if (flush_state == FLUSH_DELALLOC || - flush_state == FLUSH_DELALLOC_WAIT) - flush_state = ALLOC_CHUNK; - } while (flush_state < COMMIT_TRANS); + } while (flush_state < ARRAY_SIZE(priority_flush_states)); } static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, -- 2.14.3
[PATCH 35/42] MAINTAINERS: update my email address for btrfs
My work email is completely useless, switch it to my personal address so I get emails on a account I actually pay attention to. Signed-off-by: Josef Bacik --- MAINTAINERS | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 32fbc6f732d4..7723dc958e99 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -3095,7 +3095,7 @@ F:drivers/gpio/gpio-bt8xx.c BTRFS FILE SYSTEM M: Chris Mason -M: Josef Bacik +M: Josef Bacik M: David Sterba L: linux-btrfs@vger.kernel.org W: http://btrfs.wiki.kernel.org/ -- 2.14.3
[PATCH 33/42] btrfs: fix insert_reserved error handling
We were not handling the reserved byte accounting properly for data references. Metadata was fine, if it errored out the error paths would free the bytes_reserved count and pin the extent, but it even missed one of the error cases. So instead move this handling up into run_one_delayed_ref so we are sure that both cases are properly cleaned up in case of a transaction abort. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ac282eb535a8..922dd509591a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2405,6 +2405,9 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, insert_reserved); else BUG(); + if (ret && insert_reserved) + btrfs_pin_extent(trans->fs_info, node->bytenr, +node->num_bytes, 1); return ret; } @@ -8253,21 +8256,14 @@ static int alloc_reserved_tree_block(struct btrfs_trans_handle *trans, } path = btrfs_alloc_path(); - if (!path) { - btrfs_free_and_pin_reserved_extent(fs_info, - extent_key.objectid, - fs_info->nodesize); + if (!path) return -ENOMEM; - } path->leave_spinning = 1; ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path, &extent_key, size); if (ret) { btrfs_free_path(path); - btrfs_free_and_pin_reserved_extent(fs_info, - extent_key.objectid, - fs_info->nodesize); return ret; } -- 2.14.3
[PATCH 26/42] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock
We have this open coded in btrfs_destroy_delayed_refs, use the helper instead. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 11 ++- 1 file changed, 2 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 39bd158466cd..121ab180a78a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4214,16 +4214,9 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, head = rb_entry(node, struct btrfs_delayed_ref_head, href_node); - if (!mutex_trylock(&head->mutex)) { - refcount_inc(&head->refs); - spin_unlock(&delayed_refs->lock); - - mutex_lock(&head->mutex); - mutex_unlock(&head->mutex); - btrfs_put_delayed_ref_head(head); - spin_lock(&delayed_refs->lock); + if (btrfs_delayed_ref_lock(delayed_refs, head)) continue; - } + spin_lock(&head->lock); while ((n = rb_first(&head->ref_tree)) != NULL) { ref = rb_entry(n, struct btrfs_delayed_ref_node, -- 2.14.3
[PATCH 41/42] btrfs: reserve extra space during evict()
We could generate a lot of delayed refs in evict but never have any left over space from our block rsv to make up for that fact. So reserve some extra space and give it to the transaction so it can be used to refill the delayed refs rsv every loop through the truncate path. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index dbcca915e681..9f7da5e3c741 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5343,13 +5343,15 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, { struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv; + u64 delayed_refs_extra = btrfs_calc_trans_metadata_size(fs_info, 1); int failures = 0; for (;;) { struct btrfs_trans_handle *trans; int ret; - ret = btrfs_block_rsv_refill(root, rsv, rsv->size, + ret = btrfs_block_rsv_refill(root, rsv, +rsv->size + delayed_refs_extra, BTRFS_RESERVE_FLUSH_LIMIT); if (ret && ++failures > 2) { @@ -5358,9 +5360,28 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, return ERR_PTR(-ENOSPC); } + /* +* Evict can generate a large amount of delayed refs without +* having a way to add space back since we exhaust our temporary +* block rsv. We aren't allowed to do FLUSH_ALL in this case +* because we could deadlock with so many things in the flushing +* code, so we have to try and hold some extra space to +* compensate for our delayed ref generation. If we can't get +* that space then we need see if we can steal our minimum from +* the global reserve. We will be ratelimited by the amount of +* space we have for the delayed refs rsv, so we'll end up +* committing and trying again. +*/ trans = btrfs_join_transaction(root); - if (IS_ERR(trans) || !ret) + if (IS_ERR(trans) || !ret) { + if (!IS_ERR(trans)) { + trans->block_rsv = &fs_info->trans_block_rsv; + trans->bytes_reserved = delayed_refs_extra; + btrfs_block_rsv_migrate(rsv, trans->block_rsv, + delayed_refs_extra, 1); + } return trans; + } /* * Try to steal from the global reserve if there is space for -- 2.14.3
[PATCH 40/42] btrfs: drop min_size from evict_refill_and_join
We don't need it, rsv->size is set once and never changes throughout its lifetime, so just use that for the reserve size. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ab8242b10601..dbcca915e681 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5339,8 +5339,7 @@ static void evict_inode_truncate_pages(struct inode *inode) } static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, - struct btrfs_block_rsv *rsv, - u64 min_size) + struct btrfs_block_rsv *rsv) { struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv; @@ -5350,7 +5349,7 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, struct btrfs_trans_handle *trans; int ret; - ret = btrfs_block_rsv_refill(root, rsv, min_size, + ret = btrfs_block_rsv_refill(root, rsv, rsv->size, BTRFS_RESERVE_FLUSH_LIMIT); if (ret && ++failures > 2) { @@ -5368,7 +5367,7 @@ static struct btrfs_trans_handle *evict_refill_and_join(struct btrfs_root *root, * it. */ if (!btrfs_check_space_for_delayed_refs(fs_info) && - !btrfs_block_rsv_migrate(global_rsv, rsv, min_size, 0)) + !btrfs_block_rsv_migrate(global_rsv, rsv, rsv->size, 0)) return trans; /* If not, commit and try again. */ @@ -5384,7 +5383,6 @@ void btrfs_evict_inode(struct inode *inode) struct btrfs_trans_handle *trans; struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_block_rsv *rsv; - u64 min_size; int ret; trace_btrfs_inode_evict(inode); @@ -5394,8 +5392,6 @@ void btrfs_evict_inode(struct inode *inode) return; } - min_size = btrfs_calc_trunc_metadata_size(fs_info, 1); - evict_inode_truncate_pages(inode); if (inode->i_nlink && @@ -5428,13 +5424,13 @@ void btrfs_evict_inode(struct inode *inode) rsv = btrfs_alloc_block_rsv(fs_info, BTRFS_BLOCK_RSV_TEMP); if (!rsv) goto no_delete; - rsv->size = min_size; + rsv->size = btrfs_calc_trunc_metadata_size(fs_info, 1); rsv->failfast = 1; btrfs_i_size_write(BTRFS_I(inode), 0); while (1) { - trans = evict_refill_and_join(root, rsv, min_size); + trans = evict_refill_and_join(root, rsv); if (IS_ERR(trans)) goto free_rsv; @@ -5459,7 +5455,7 @@ void btrfs_evict_inode(struct inode *inode) * If it turns out that we are dropping too many of these, we might want * to add a mechanism for retrying these after a commit. */ - trans = evict_refill_and_join(root, rsv, min_size); + trans = evict_refill_and_join(root, rsv); if (!IS_ERR(trans)) { trans->block_rsv = rsv; btrfs_orphan_del(trans, BTRFS_I(inode)); -- 2.14.3
[PATCH 01/42] btrfs: add btrfs_delete_ref_head helper
From: Josef Bacik We do this dance in cleanup_ref_head and check_ref_cleanup, unify it into a helper and cleanup the calling functions. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/delayed-ref.c | 14 ++ fs/btrfs/delayed-ref.h | 3 ++- fs/btrfs/extent-tree.c | 22 +++--- 3 files changed, 19 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 62ff545ba1f7..3a9e4ac21794 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans) return head; } +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) +{ + lockdep_assert_held(&delayed_refs->lock); + lockdep_assert_held(&head->lock); + + rb_erase(&head->href_node, &delayed_refs->href_root); + RB_CLEAR_NODE(&head->href_node); + atomic_dec(&delayed_refs->num_entries); + delayed_refs->num_heads--; + if (head->processing == 0) + delayed_refs->num_heads_ready--; +} + /* * Helper to insert the ref_node to the tail or merge with tail. * diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index d9f2a4ebd5db..7769177b489e 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head *head) { mutex_unlock(&head->mutex); } - +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head); struct btrfs_delayed_ref_head * btrfs_select_ref_head(struct btrfs_trans_handle *trans); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f77226d8020a..d24a0de4a2e7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(&delayed_refs->lock); return 1; } - delayed_refs->num_heads--; - rb_erase(&head->href_node, &delayed_refs->href_root); - RB_CLEAR_NODE(&head->href_node); + btrfs_delete_ref_head(delayed_refs, head); spin_unlock(&head->lock); spin_unlock(&delayed_refs->lock); - atomic_dec(&delayed_refs->num_entries); trace_run_delayed_ref_head(fs_info, head, 0); @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!mutex_trylock(&head->mutex)) goto out; - /* -* at this point we have a head with no other entries. Go -* ahead and process it. -*/ - rb_erase(&head->href_node, &delayed_refs->href_root); - RB_CLEAR_NODE(&head->href_node); - atomic_dec(&delayed_refs->num_entries); - - /* -* we don't take a ref on the node because we're removing it from the -* tree, so we just steal the ref the tree was holding. -*/ - delayed_refs->num_heads--; - if (head->processing == 0) - delayed_refs->num_heads_ready--; + btrfs_delete_ref_head(delayed_refs, head); head->processing = 0; + spin_unlock(&head->lock); spin_unlock(&delayed_refs->lock); -- 2.14.3
[PATCH 24/42] btrfs: assert on non-empty delayed iputs
I ran into an issue where there was some reference being held on an inode that I couldn't track. This assert wasn't triggered, but it at least rules out we're doing something stupid. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 377ad9c1cb17..39bd158466cd 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3979,6 +3979,7 @@ void close_ctree(struct btrfs_fs_info *fs_info) kthread_stop(fs_info->transaction_kthread); kthread_stop(fs_info->cleaner_kthread); + ASSERT(list_empty(&fs_info->delayed_iputs)); set_bit(BTRFS_FS_CLOSING_DONE, &fs_info->flags); btrfs_free_qgroup_config(fs_info); -- 2.14.3
[PATCH 08/42] btrfs: dump block_rsv whe dumping space info
For enospc_debug having the block rsvs is super helpful to see if we've done something wrong. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index da73b3e5bc39..c9913c59686b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7918,6 +7918,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, return ret; } +#define DUMP_BLOCK_RSV(fs_info, rsv_name) \ +do { \ + struct btrfs_block_rsv *__rsv = &(fs_info)->rsv_name; \ + spin_lock(&__rsv->lock);\ + btrfs_info(fs_info, #rsv_name ": size %llu reserved %llu", \ + __rsv->size, __rsv->reserved); \ + spin_unlock(&__rsv->lock); \ +} while (0) + static void dump_space_info(struct btrfs_fs_info *fs_info, struct btrfs_space_info *info, u64 bytes, int dump_block_groups) @@ -7937,6 +7946,12 @@ static void dump_space_info(struct btrfs_fs_info *fs_info, info->bytes_readonly); spin_unlock(&info->lock); + DUMP_BLOCK_RSV(fs_info, global_block_rsv); + DUMP_BLOCK_RSV(fs_info, trans_block_rsv); + DUMP_BLOCK_RSV(fs_info, chunk_block_rsv); + DUMP_BLOCK_RSV(fs_info, delayed_block_rsv); + DUMP_BLOCK_RSV(fs_info, delayed_refs_rsv); + if (!dump_block_groups) return; -- 2.14.3
[PATCH 06/42] btrfs: introduce delayed_refs_rsv
From: Josef Bacik Traditionally we've had voodoo in btrfs to account for the space that delayed refs may take up by having a global_block_rsv. This works most of the time, except when it doesn't. We've had issues reported and seen in production where sometimes the global reserve is exhausted during transaction commit before we can run all of our delayed refs, resulting in an aborted transaction. Because of this voodoo we have equally dubious flushing semantics around throttling delayed refs which we often get wrong. So instead give them their own block_rsv. This way we can always know exactly how much outstanding space we need for delayed refs. This allows us to make sure we are constantly filling that reservation up with space, and allows us to put more precise pressure on the enospc system. Instead of doing math to see if its a good time to throttle, the normal enospc code will be invoked if we have a lot of delayed refs pending, and they will be run via the normal flushing mechanism. For now the delayed_refs_rsv will hold the reservations for the delayed refs, the block group updates, and deleting csums. We could have a separate rsv for the block group updates, but the csum deletion stuff is still handled via the delayed_refs so that will stay there. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 27 +++-- fs/btrfs/delayed-ref.c | 28 - fs/btrfs/disk-io.c | 4 + fs/btrfs/extent-tree.c | 279 +++ fs/btrfs/inode.c | 2 +- fs/btrfs/transaction.c | 77 ++-- include/trace/events/btrfs.h | 2 + 7 files changed, 312 insertions(+), 107 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 66f1d3895bca..1a2c3b629af2 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -452,8 +452,9 @@ struct btrfs_space_info { #defineBTRFS_BLOCK_RSV_TRANS 3 #defineBTRFS_BLOCK_RSV_CHUNK 4 #defineBTRFS_BLOCK_RSV_DELOPS 5 -#defineBTRFS_BLOCK_RSV_EMPTY 6 -#defineBTRFS_BLOCK_RSV_TEMP7 +#define BTRFS_BLOCK_RSV_DELREFS6 +#defineBTRFS_BLOCK_RSV_EMPTY 7 +#defineBTRFS_BLOCK_RSV_TEMP8 struct btrfs_block_rsv { u64 size; @@ -794,6 +795,8 @@ struct btrfs_fs_info { struct btrfs_block_rsv chunk_block_rsv; /* block reservation for delayed operations */ struct btrfs_block_rsv delayed_block_rsv; + /* block reservation for delayed refs */ + struct btrfs_block_rsv delayed_refs_rsv; struct btrfs_block_rsv empty_block_rsv; @@ -2608,8 +2611,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct btrfs_fs_info *fs_info, int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info); -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans, - struct btrfs_fs_info *fs_info); +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info); void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info, const u64 start); void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg); @@ -2723,10 +2725,12 @@ enum btrfs_reserve_flush_enum { enum btrfs_flush_state { FLUSH_DELAYED_ITEMS_NR = 1, FLUSH_DELAYED_ITEMS = 2, - FLUSH_DELALLOC = 3, - FLUSH_DELALLOC_WAIT = 4, - ALLOC_CHUNK = 5, - COMMIT_TRANS= 6, + FLUSH_DELAYED_REFS_NR = 3, + FLUSH_DELAYED_REFS = 4, + FLUSH_DELALLOC = 5, + FLUSH_DELALLOC_WAIT = 6, + ALLOC_CHUNK = 7, + COMMIT_TRANS= 8, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); @@ -2777,6 +2781,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info *fs_info, void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info, struct btrfs_block_rsv *block_rsv, u64 num_bytes); +void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr); +void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans); +int btrfs_throttle_delayed_refs(struct btrfs_fs_info *fs_info, + enum btrfs_reserve_flush_enum flush); +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info, + struct btrfs_block_rsv *src, + u64 num_bytes); int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache); void btrfs_put_block_group_cache(struct btrfs_fs_info *info); diff --git a/fs/btrfs/delayed-ref
[PATCH 27/42] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head
Instead of open coding this stuff use the helper instead. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/disk-io.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 121ab180a78a..fe1f229320ef 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4232,12 +4232,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, if (head->must_insert_reserved) pin_bytes = true; btrfs_free_delayed_extent_op(head->extent_op); - delayed_refs->num_heads--; - if (head->processing == 0) - delayed_refs->num_heads_ready--; - atomic_dec(&delayed_refs->num_entries); - rb_erase(&head->href_node, &delayed_refs->href_root); - RB_CLEAR_NODE(&head->href_node); + btrfs_delete_ref_head(delayed_refs, head); spin_unlock(&head->lock); spin_unlock(&delayed_refs->lock); mutex_unlock(&head->mutex); -- 2.14.3
[PATCH 02/42] btrfs: add cleanup_ref_head_accounting helper
From: Josef Bacik We were missing some quota cleanups in check_ref_cleanup, so break the ref head accounting cleanup into a helper and call that from both check_ref_cleanup and cleanup_ref_head. This will hopefully ensure that we don't screw up accounting in the future for other things that we add. Reviewed-by: Omar Sandoval Reviewed-by: Liu Bo Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 67 +- 1 file changed, 39 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d24a0de4a2e7..a44d55e36e11 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head) +{ + struct btrfs_fs_info *fs_info = trans->fs_info; + struct btrfs_delayed_ref_root *delayed_refs = + &trans->transaction->delayed_refs; + + if (head->total_ref_mod < 0) { + struct btrfs_space_info *space_info; + u64 flags; + + if (head->is_data) + flags = BTRFS_BLOCK_GROUP_DATA; + else if (head->is_system) + flags = BTRFS_BLOCK_GROUP_SYSTEM; + else + flags = BTRFS_BLOCK_GROUP_METADATA; + space_info = __find_space_info(fs_info, flags); + ASSERT(space_info); + percpu_counter_add_batch(&space_info->total_bytes_pinned, + -head->num_bytes, + BTRFS_TOTAL_BYTES_PINNED_BATCH); + + if (head->is_data) { + spin_lock(&delayed_refs->lock); + delayed_refs->pending_csums -= head->num_bytes; + spin_unlock(&delayed_refs->lock); + } + } + + /* Also free its reserved qgroup space */ + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, + head->qgroup_reserved); +} + static int cleanup_ref_head(struct btrfs_trans_handle *trans, struct btrfs_delayed_ref_head *head) { @@ -2496,31 +2531,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, spin_unlock(&head->lock); spin_unlock(&delayed_refs->lock); - trace_run_delayed_ref_head(fs_info, head, 0); - - if (head->total_ref_mod < 0) { - struct btrfs_space_info *space_info; - u64 flags; - - if (head->is_data) - flags = BTRFS_BLOCK_GROUP_DATA; - else if (head->is_system) - flags = BTRFS_BLOCK_GROUP_SYSTEM; - else - flags = BTRFS_BLOCK_GROUP_METADATA; - space_info = __find_space_info(fs_info, flags); - ASSERT(space_info); - percpu_counter_add_batch(&space_info->total_bytes_pinned, - -head->num_bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH); - - if (head->is_data) { - spin_lock(&delayed_refs->lock); - delayed_refs->pending_csums -= head->num_bytes; - spin_unlock(&delayed_refs->lock); - } - } - if (head->must_insert_reserved) { btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); @@ -2530,9 +2540,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, } } - /* Also free its reserved qgroup space */ - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root, - head->qgroup_reserved); + cleanup_ref_head_accounting(trans, head); + + trace_run_delayed_ref_head(fs_info, head, 0); btrfs_delayed_ref_unlock(head); btrfs_put_delayed_ref_head(head); return 0; @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (head->must_insert_reserved) ret = 1; + cleanup_ref_head_accounting(trans, head); mutex_unlock(&head->mutex); btrfs_put_delayed_ref_head(head); return ret; -- 2.14.3
[PATCH 18/42] btrfs: move the dio_sem higher up the callchain
We're getting a lockdep splat because we take the dio_sem under the log_mutex. What we really need is to protect fsync() from logging an extent map for an extent we never waited on higher up, so just guard the whole thing with dio_sem. Signed-off-by: Josef Bacik --- fs/btrfs/file.c | 12 fs/btrfs/tree-log.c | 2 -- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 095f0bb86bb7..c07110edb9de 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2079,6 +2079,14 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) goto out; inode_lock(inode); + + /* +* We take the dio_sem here because the tree log stuff can race with +* lockless dio writes and get an extent map logged for an extent we +* never waited on. We need it this high up for lockdep reasons. +*/ + down_write(&BTRFS_I(inode)->dio_sem); + atomic_inc(&root->log_batch); /* @@ -2087,6 +2095,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) */ ret = btrfs_wait_ordered_range(inode, start, len); if (ret) { + up_write(&BTRFS_I(inode)->dio_sem); inode_unlock(inode); goto out; } @@ -2110,6 +2119,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) * checked called fsync. */ ret = filemap_check_wb_err(inode->i_mapping, file->f_wb_err); + up_write(&BTRFS_I(inode)->dio_sem); inode_unlock(inode); goto out; } @@ -2128,6 +2138,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) { ret = PTR_ERR(trans); + up_write(&BTRFS_I(inode)->dio_sem); inode_unlock(inode); goto out; } @@ -2149,6 +2160,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) * file again, but that will end up using the synchronization * inside btrfs_sync_log to keep things safe. */ + up_write(&BTRFS_I(inode)->dio_sem); inode_unlock(inode); /* diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 1650dc44a5e3..66b7e059b765 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4374,7 +4374,6 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans, INIT_LIST_HEAD(&extents); - down_write(&inode->dio_sem); write_lock(&tree->lock); test_gen = root->fs_info->last_trans_committed; logged_start = start; @@ -4440,7 +4439,6 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans, } WARN_ON(!list_empty(&extents)); write_unlock(&tree->lock); - up_write(&inode->dio_sem); btrfs_release_path(path); if (!ret) -- 2.14.3
[PATCH 22/42] btrfs: only run delayed refs if we're committing
I noticed in a giant dbench run that we spent a lot of time on lock contention while running transaction commit. This is because dbench results in a lot of fsync()'s that do a btrfs_transaction_commit(), and they all run the delayed refs first thing, so they all contend with each other. This leads to seconds of 0 throughput. Change this to only run the delayed refs if we're the ones committing the transaction. This makes the latency go away and we get no more lock contention. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 24 +--- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index a0f19ca0bd6c..39a2bddb0b29 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1925,15 +1925,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - /* make a pass through all the delayed refs we have so far -* any runnings procs may add more while we are here -*/ - ret = btrfs_run_delayed_refs(trans, 0); - if (ret) { - btrfs_end_transaction(trans); - return ret; - } - cur_trans = trans->transaction; /* @@ -1946,12 +1937,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) if (!list_empty(&trans->new_bgs)) btrfs_create_pending_block_groups(trans); - ret = btrfs_run_delayed_refs(trans, 0); - if (ret) { - btrfs_end_transaction(trans); - return ret; - } - if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, &cur_trans->flags)) { int run_it = 0; @@ -2022,6 +2007,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) spin_unlock(&fs_info->trans_lock); } + /* +* We are now the only one in the commit area, we can run delayed refs +* without hitting a bunch of lock contention from a lot of people +* trying to commit the transaction at once. +*/ + ret = btrfs_run_delayed_refs(trans, 0); + if (ret) + goto cleanup_transaction; + extwriter_counter_dec(cur_trans, trans->type); ret = btrfs_start_delalloc_flush(fs_info); -- 2.14.3
[PATCH 29/42] btrfs: call btrfs_create_pending_block_groups unconditionally
The first thing we do is loop through the list, this if (!list_empty()) btrfs_create_pending_block_groups(); thing is just wasted space. Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 3 +-- fs/btrfs/transaction.c | 6 ++ 2 files changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7245a198ad31..91e45cb14d45 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2978,8 +2978,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, } if (run_all) { - if (!list_empty(&trans->new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); spin_lock(&delayed_refs->lock); node = rb_first(&delayed_refs->href_root); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 39a2bddb0b29..46ca775a709e 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -846,8 +846,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, btrfs_trans_release_metadata(trans); trans->block_rsv = NULL; - if (!list_empty(&trans->new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); btrfs_trans_release_chunk_metadata(trans); @@ -1934,8 +1933,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) cur_trans->delayed_refs.flushing = 1; smp_wmb(); - if (!list_empty(&trans->new_bgs)) - btrfs_create_pending_block_groups(trans); + btrfs_create_pending_block_groups(trans); if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, &cur_trans->flags)) { int run_it = 0; -- 2.14.3
[PATCH 00/42][v3] My current patch queue
v2->v3: - reworked the truncate/evict throttling, we were still occasionally hitting enospc aborts in production in these paths because we were too aggressive with space usage. - reworked the delayed iput stuff to be a little less racey and less deadlocky. - Addressed the comments from Dave and Omar. - A lot of production testing. v1->v2: - addressed all of the issues brought up. - added more comments. - split up some patches. original message: This is the current queue of things that I've been working on. The main thing these patches are doing is separating out the delayed refs reservations from the global reserve into their own block rsv. We have been consistently hitting issues in production where we abort a transaction because we run out of the global reserve either while running delayed refs or while updating dirty block groups. This is because the math around global reserves is made up bullshit magic that has been tweaked more and more throughout the years. The result is something that is inconsistent across the board and sometimes wrong. So instead we need a way to know exactly how much space we need to keep around in order to satisfy our outstanding delayed refs and our dirty block groups. Since we don't know how many delayed refs we need at the start of any modification we simply use the nr_items passed into btrfs_start_transaction() as a guess for what we may need. This has the side effect of putting more pressure on the ENOSPC system, but it's pressure we can deal with more intelligently because we always know how much space we have outstanding, instead of guessing with weird global reserve math. This works similar to every other reservation we have, we reserve the worst case up front, and then at transaction end time we free up any space we didn't actually use for delayed refs. My performance tests show that we are bit faster now since we can do more intelligent flushing and don't have to fall back on simply committing the transaction in hopes that we have enough space for everything we need to do. That leads me to the 2nd part of this pull, there's a bunch of fixes around ENOSPC. Because we are a bit faster now there were a bunch of things uncovered in testing, but they seem to be all resolved now. The final chunk of fixes are around transaction aborts. There were a lot of accounting bugs I was running into while running generic/435, so I fixed a bunch of those up so now it runs cleanly. I have been running these patches through xfstests on multiple machines for a while, they are pretty solid and ready for wider testing and review. Thanks, Josef
[PATCH 28/42] btrfs: handle delayed ref head accounting cleanup in abort
We weren't doing any of the accounting cleanup when we aborted transactions. Fix this by making cleanup_ref_head_accounting global and calling it from the abort code, this fixes the issue where our accounting was all wrong after the fs aborts. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 5 + fs/btrfs/disk-io.c | 1 + fs/btrfs/extent-tree.c | 13 ++--- 3 files changed, 12 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 29db902511c1..e40356ca0295 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -35,6 +35,7 @@ struct btrfs_trans_handle; struct btrfs_transaction; struct btrfs_pending_snapshot; +struct btrfs_delayed_ref_root; extern struct kmem_cache *btrfs_trans_handle_cachep; extern struct kmem_cache *btrfs_bit_radix_cachep; extern struct kmem_cache *btrfs_path_cachep; @@ -2623,6 +2624,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, unsigned long count); int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info, unsigned long count, u64 transid, int wait); +void +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, + struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head); int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 len); int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, u64 bytenr, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index fe1f229320ef..54fbdc944a3f 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4240,6 +4240,7 @@ static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans, if (pin_bytes) btrfs_pin_extent(fs_info, head->bytenr, head->num_bytes, 1); + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); btrfs_put_delayed_ref_head(head); cond_resched(); spin_lock(&delayed_refs->lock); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 34105bc5eef7..7245a198ad31 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2475,12 +2475,11 @@ static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans, return ret ? ret : 1; } -static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans, - struct btrfs_delayed_ref_head *head) +void +btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info, + struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) { - struct btrfs_fs_info *fs_info = trans->fs_info; - struct btrfs_delayed_ref_root *delayed_refs = - &trans->transaction->delayed_refs; int nr_items = 1; if (head->total_ref_mod < 0) { @@ -2558,7 +2557,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, } } - cleanup_ref_head_accounting(trans, head); + btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head); trace_run_delayed_ref_head(fs_info, head, 0); btrfs_delayed_ref_unlock(head); @@ -7223,7 +7222,7 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (head->must_insert_reserved) ret = 1; - cleanup_ref_head_accounting(trans, head); + btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head); mutex_unlock(&head->mutex); btrfs_put_delayed_ref_head(head); return ret; -- 2.14.3
[PATCH 13/42] btrfs: add ALLOC_CHUNK_FORCE to the flushing code
With my change to no longer take into account the global reserve for metadata allocation chunks we have this side-effect for mixed block group fs'es where we are no longer allocating enough chunks for the data/metadata requirements. To deal with this add a ALLOC_CHUNK_FORCE step to the flushing state machine. This will only get used if we've already made a full loop through the flushing machinery and tried committing the transaction. If we have then we can try and force a chunk allocation since we likely need it to make progress. This resolves the issues I was seeing with the mixed bg tests in xfstests with my previous patch. Signed-off-by: Josef Bacik --- fs/btrfs/ctree.h | 3 ++- fs/btrfs/extent-tree.c | 18 +- include/trace/events/btrfs.h | 1 + 3 files changed, 20 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1a2c3b629af2..29db902511c1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2730,7 +2730,8 @@ enum btrfs_flush_state { FLUSH_DELALLOC = 5, FLUSH_DELALLOC_WAIT = 6, ALLOC_CHUNK = 7, - COMMIT_TRANS= 8, + ALLOC_CHUNK_FORCE = 8, + COMMIT_TRANS= 9, }; int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c0f6110419b2..cd2280962c8c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4914,6 +4914,7 @@ static void flush_space(struct btrfs_fs_info *fs_info, btrfs_end_transaction(trans); break; case ALLOC_CHUNK: + case ALLOC_CHUNK_FORCE: trans = btrfs_join_transaction(root); if (IS_ERR(trans)) { ret = PTR_ERR(trans); @@ -4921,7 +4922,9 @@ static void flush_space(struct btrfs_fs_info *fs_info, } ret = do_chunk_alloc(trans, btrfs_metadata_alloc_profile(fs_info), -CHUNK_ALLOC_NO_FORCE); +(state == ALLOC_CHUNK) ? +CHUNK_ALLOC_NO_FORCE : +CHUNK_ALLOC_FORCE); btrfs_end_transaction(trans); if (ret > 0 || ret == -ENOSPC) ret = 0; @@ -5057,6 +5060,19 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) commit_cycles--; } + /* +* We don't want to force a chunk allocation until we've tried +* pretty hard to reclaim space. Think of the case where we +* free'd up a bunch of space and so have a lot of pinned space +* to reclaim. We would rather use that than possibly create a +* underutilized metadata chunk. So if this is our first run +* through the flushing state machine skip ALLOC_CHUNK_FORCE and +* commit the transaction. If nothing has changed the next go +* around then we can force a chunk allocation. +*/ + if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles) + flush_state++; + if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index 7d205e50b09c..fdb23181b5b7 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush, { FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"}, \ { FLUSH_DELAYED_REFS, "FLUSH_ELAYED_REFS"}, \ { ALLOC_CHUNK, "ALLOC_CHUNK"}, \ + { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"}, \ { COMMIT_TRANS, "COMMIT_TRANS"}) TRACE_EVENT(btrfs_flush_space, -- 2.14.3
[PATCH 03/42] btrfs: cleanup extent_op handling
From: Josef Bacik The cleanup_extent_op function actually would run the extent_op if it needed running, which made the name sort of a misnomer. Change it to run_and_cleanup_extent_op, and move the actual cleanup work to cleanup_extent_op so it can be used by check_ref_cleanup() in order to unify the extent op handling. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 36 +++- 1 file changed, 23 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a44d55e36e11..98f36dfeccb0 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2442,19 +2442,33 @@ static void unselect_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_ref btrfs_delayed_ref_unlock(head); } -static int cleanup_extent_op(struct btrfs_trans_handle *trans, -struct btrfs_delayed_ref_head *head) +static struct btrfs_delayed_extent_op * +cleanup_extent_op(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_head *head) { struct btrfs_delayed_extent_op *extent_op = head->extent_op; - int ret; if (!extent_op) - return 0; - head->extent_op = NULL; + return NULL; + if (head->must_insert_reserved) { + head->extent_op = NULL; btrfs_free_delayed_extent_op(extent_op); - return 0; + return NULL; } + return extent_op; +} + +static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans, +struct btrfs_delayed_ref_head *head) +{ + struct btrfs_delayed_extent_op *extent_op = + cleanup_extent_op(trans, head); + int ret; + + if (!extent_op) + return 0; + head->extent_op = NULL; spin_unlock(&head->lock); ret = run_delayed_extent_op(trans, head, extent_op); btrfs_free_delayed_extent_op(extent_op); @@ -2506,7 +2520,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle *trans, delayed_refs = &trans->transaction->delayed_refs; - ret = cleanup_extent_op(trans, head); + ret = run_and_cleanup_extent_op(trans, head); if (ret < 0) { unselect_delayed_ref_head(delayed_refs, head); btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret); @@ -6977,12 +6991,8 @@ static noinline int check_ref_cleanup(struct btrfs_trans_handle *trans, if (!RB_EMPTY_ROOT(&head->ref_tree)) goto out; - if (head->extent_op) { - if (!head->must_insert_reserved) - goto out; - btrfs_free_delayed_extent_op(head->extent_op); - head->extent_op = NULL; - } + if (cleanup_extent_op(trans, head) != NULL) + goto out; /* * waiting for the lock here would deadlock. If someone else has it -- 2.14.3
[PATCH 14/42] btrfs: reset max_extent_size properly
If we use up our block group before allocating a new one we'll easily get a max_extent_size that's set really really low, which will result in a lot of fragmentation. We need to make sure we're resetting the max_extent_size when we add a new chunk or add new space. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index cd2280962c8c..f84537a1d7eb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4573,6 +4573,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags, goto out; } else { ret = 1; + space_info->max_extent_size = 0; } space_info->force_alloc = CHUNK_ALLOC_NO_FORCE; @@ -6671,6 +6672,7 @@ static int btrfs_free_reserved_bytes(struct btrfs_block_group_cache *cache, space_info->bytes_readonly += num_bytes; cache->reserved -= num_bytes; space_info->bytes_reserved -= num_bytes; + space_info->max_extent_size = 0; if (delalloc) cache->delalloc_bytes -= num_bytes; -- 2.14.3
[PATCH 21/42] btrfs: reset max_extent_size on clear in a bitmap
From: Josef Bacik We need to clear the max_extent_size when we clear bits from a bitmap since it could have been from the range that contains the max_extent_size. Reviewed-by: Liu Bo Signed-off-by: Josef Bacik --- fs/btrfs/free-space-cache.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 2e96ee7da3ec..d2a863a2ee24 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1687,6 +1687,8 @@ static inline void __bitmap_clear_bits(struct btrfs_free_space_ctl *ctl, bitmap_clear(info->bitmap, start, count); info->bytes -= bytes; + if (info->max_extent_size > ctl->unit) + info->max_extent_size = 0; } static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl, -- 2.14.3
[PATCH 10/42] btrfs: protect space cache inode alloc with nofs
If we're allocating a new space cache inode it's likely going to be under a transaction handle, so we need to use memalloc_nofs_save() in order to avoid deadlocks, and more importantly lockdep messages that make xfstests fail. Reviewed-by: Omar Sandoval Signed-off-by: Josef Bacik --- fs/btrfs/free-space-cache.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index c3888c113d81..e077ad3b4549 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -10,6 +10,7 @@ #include #include #include +#include #include "ctree.h" #include "free-space-cache.h" #include "transaction.h" @@ -47,6 +48,7 @@ static struct inode *__lookup_free_space_inode(struct btrfs_root *root, struct btrfs_free_space_header *header; struct extent_buffer *leaf; struct inode *inode = NULL; + unsigned nofs_flag; int ret; key.objectid = BTRFS_FREE_SPACE_OBJECTID; @@ -68,7 +70,13 @@ static struct inode *__lookup_free_space_inode(struct btrfs_root *root, btrfs_disk_key_to_cpu(&location, &disk_key); btrfs_release_path(path); + /* +* We are often under a trans handle at this point, so we need to make +* sure NOFS is set to keep us from deadlocking. +*/ + nofs_flag = memalloc_nofs_save(); inode = btrfs_iget(fs_info->sb, &location, root, NULL); + memalloc_nofs_restore(nofs_flag); if (IS_ERR(inode)) return inode; -- 2.14.3
[PATCH 16/42] btrfs: loop in inode_rsv_refill
With severe fragmentation we can end up with our inode rsv size being huge during writeout, which would cause us to need to make very large metadata reservations. However we may not actually need that much once writeout is complete. So instead try to make our reservation, and if we couldn't make it re-calculate our new reservation size and try again. If our reservation size doesn't change between tries then we know we are actually out of space and can error out. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 26 -- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7a53f6a29ebc..461b8076928b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5781,10 +5781,11 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, { struct btrfs_root *root = inode->root; struct btrfs_block_rsv *block_rsv = &inode->block_rsv; - u64 num_bytes = 0; + u64 num_bytes = 0, last = 0; u64 qgroup_num_bytes = 0; int ret = -ENOSPC; +again: spin_lock(&block_rsv->lock); if (block_rsv->reserved < block_rsv->size) num_bytes = block_rsv->size - block_rsv->reserved; @@ -5796,6 +5797,13 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, if (num_bytes == 0) return 0; + /* +* If our reservation size hasn't changed since the last time we tried +* to make an allocation we can just bail. +*/ + if (last && last == num_bytes) + return -ENOSPC; + ret = btrfs_qgroup_reserve_meta_prealloc(root, qgroup_num_bytes, true); if (ret) return ret; @@ -5809,8 +5817,22 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode *inode, spin_lock(&block_rsv->lock); block_rsv->qgroup_rsv_reserved += qgroup_num_bytes; spin_unlock(&block_rsv->lock); - } else + } else { btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes); + + /* +* If we are fragmented we can end up with a lot of outstanding +* extents which will make our size be much larger than our +* reserved amount. If we happen to try to do a reservation +* here that may result in us trying to do a pretty hefty +* reservation, which we may not need once delalloc flushing +* happens. If this is the case try and do the reserve again. +*/ + if (flush == BTRFS_RESERVE_FLUSH_ALL) { + last = num_bytes; + goto again; + } + } return ret; } -- 2.14.3
[PATCH 15/42] btrfs: don't enospc all tickets on flush failure
With the introduction of the per-inode block_rsv it became possible to have really really large reservation requests made because of data fragmentation. Since the ticket stuff assumed that we'd always have relatively small reservation requests it just killed all tickets if we were unable to satisfy the current request. However this is generally not the case anymore. So fix this logic to instead see if we had a ticket that we were able to give some reservation to, and if we were continue the flushing loop again. Likewise we make the tickets use the space_info_add_old_bytes() method of returning what reservation they did receive in hopes that it could satisfy reservations down the line. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 45 + 1 file changed, 25 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index f84537a1d7eb..7a53f6a29ebc 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4779,6 +4779,7 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info, u64 to_reclaim, } struct reserve_ticket { + u64 orig_bytes; u64 bytes; int error; struct list_head list; @@ -5000,7 +5001,7 @@ static inline int need_do_async_reclaim(struct btrfs_fs_info *fs_info, !test_bit(BTRFS_FS_STATE_REMOUNTING, &fs_info->fs_state)); } -static void wake_all_tickets(struct list_head *head) +static bool wake_all_tickets(struct list_head *head) { struct reserve_ticket *ticket; @@ -5009,7 +5010,10 @@ static void wake_all_tickets(struct list_head *head) list_del_init(&ticket->list); ticket->error = -ENOSPC; wake_up(&ticket->wait); + if (ticket->bytes != ticket->orig_bytes) + return true; } + return false; } /* @@ -5077,8 +5081,12 @@ static void btrfs_async_reclaim_metadata_space(struct work_struct *work) if (flush_state > COMMIT_TRANS) { commit_cycles++; if (commit_cycles > 2) { - wake_all_tickets(&space_info->tickets); - space_info->flush = 0; + if (wake_all_tickets(&space_info->tickets)) { + flush_state = FLUSH_DELAYED_ITEMS_NR; + commit_cycles--; + } else { + space_info->flush = 0; + } } else { flush_state = FLUSH_DELAYED_ITEMS_NR; } @@ -5130,10 +5138,11 @@ static void priority_reclaim_metadata_space(struct btrfs_fs_info *fs_info, static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, struct btrfs_space_info *space_info, - struct reserve_ticket *ticket, u64 orig_bytes) + struct reserve_ticket *ticket) { DEFINE_WAIT(wait); + u64 reclaim_bytes = 0; int ret = 0; spin_lock(&space_info->lock); @@ -5154,14 +5163,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info *fs_info, ret = ticket->error; if (!list_empty(&ticket->list)) list_del_init(&ticket->list); - if (ticket->bytes && ticket->bytes < orig_bytes) { - u64 num_bytes = orig_bytes - ticket->bytes; - space_info->bytes_may_use -= num_bytes; - trace_btrfs_space_reservation(fs_info, "space_info", - space_info->flags, num_bytes, 0); - } + if (ticket->bytes && ticket->bytes < ticket->orig_bytes) + reclaim_bytes = ticket->orig_bytes - ticket->bytes; spin_unlock(&space_info->lock); + if (reclaim_bytes) + space_info_add_old_bytes(fs_info, space_info, reclaim_bytes); return ret; } @@ -5187,6 +5194,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, { struct reserve_ticket ticket; u64 used; + u64 reclaim_bytes = 0; int ret = 0; ASSERT(orig_bytes); @@ -5222,6 +5230,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, * the list and we will do our own flushing further down. */ if (ret && flush != BTRFS_RESERVE_NO_FLUSH) { + ticket.orig_bytes = orig_bytes; ticket.bytes = orig_bytes; ticket.error = 0; init_waitqueue_head(&ticket.wait); @@ -5262,25 +5271,21 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info *fs_info, return ret; if (flush == BTRFS_RESERVE_FLUSH_ALL) - return wait_reserve_ticket(fs_info, space_info, &ticket, -
[PATCH 11/42] btrfs: fix truncate throttling
We have a bunch of magic to make sure we're throttling delayed refs when truncating a file. Now that we have a delayed refs rsv and a mechanism for refilling that reserve simply use that instead of all of this magic. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 79 1 file changed, 17 insertions(+), 62 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index cd00ec869c96..5a91055a13b2 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4493,31 +4493,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry *dentry) return err; } -static int truncate_space_check(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - u64 bytes_deleted) -{ - struct btrfs_fs_info *fs_info = root->fs_info; - int ret; - - /* -* This is only used to apply pressure to the enospc system, we don't -* intend to use this reservation at all. -*/ - bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted); - bytes_deleted *= fs_info->nodesize; - ret = btrfs_block_rsv_add(root, &fs_info->trans_block_rsv, - bytes_deleted, BTRFS_RESERVE_NO_FLUSH); - if (!ret) { - trace_btrfs_space_reservation(fs_info, "transaction", - trans->transid, - bytes_deleted, 1); - trans->bytes_reserved += bytes_deleted; - } - return ret; - -} - /* * Return this if we need to call truncate_block for the last bit of the * truncate. @@ -4562,7 +4537,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, u64 bytes_deleted = 0; bool be_nice = false; bool should_throttle = false; - bool should_end = false; BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY); @@ -4775,15 +4749,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, btrfs_abort_transaction(trans, ret); break; } - if (btrfs_should_throttle_delayed_refs(trans, fs_info)) - btrfs_async_run_delayed_refs(fs_info, - trans->delayed_ref_updates * 2, - trans->transid, 0); if (be_nice) { - if (truncate_space_check(trans, root, -extent_num_bytes)) { - should_end = true; - } if (btrfs_should_throttle_delayed_refs(trans, fs_info)) should_throttle = true; @@ -4795,7 +4761,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, if (path->slots[0] == 0 || path->slots[0] != pending_del_slot || - should_throttle || should_end) { + should_throttle) { if (pending_del_nr) { ret = btrfs_del_items(trans, root, path, pending_del_slot, @@ -4807,23 +4773,24 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans, pending_del_nr = 0; } btrfs_release_path(path); - if (should_throttle) { - unsigned long updates = trans->delayed_ref_updates; - if (updates) { - trans->delayed_ref_updates = 0; - ret = btrfs_run_delayed_refs(trans, - updates * 2); - if (ret) - break; - } - } + /* -* if we failed to refill our space rsv, bail out -* and let the transaction restart +* We can generate a lot of delayed refs, so we need to +* throttle every once and a while and make sure we're +* adding enough space to keep up with the work we are +* generating. Since we hold a transaction here we +* can't flush, and we don't want to FLUSH_LIMIT because +* we could have generated too many delayed refs to +* actually allocate, so just bail if we're short and +* let th
[PATCH 04/42] btrfs: only track ref_heads in delayed_ref_updates
From: Josef Bacik We use this number to figure out how many delayed refs to run, but __btrfs_run_delayed_refs really only checks every time we need a new delayed ref head, so we always run at least one ref head completely no matter what the number of items on it. Fix the accounting to only be adjusted when we add/remove a ref head. Signed-off-by: Josef Bacik --- fs/btrfs/delayed-ref.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 3a9e4ac21794..27f7dd4e3d52 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -234,8 +234,6 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans, ref->in_tree = 0; btrfs_put_delayed_ref(ref); atomic_dec(&delayed_refs->num_entries); - if (trans->delayed_ref_updates) - trans->delayed_ref_updates--; } static bool merge_ref(struct btrfs_trans_handle *trans, @@ -460,7 +458,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle *trans, if (ref->action == BTRFS_ADD_DELAYED_REF) list_add_tail(&ref->add_list, &href->ref_add_list); atomic_inc(&root->num_entries); - trans->delayed_ref_updates++; spin_unlock(&href->lock); return ret; } -- 2.14.3
[PATCH 23/42] btrfs: make sure we create all new bgs
Allocating new chunks modifies both the extent and chunk tree, which can trigger new chunk allocations. So instead of doing list_for_each_safe, just do while (!list_empty()) so we make sure we don't exit with other pending bg's still on our list. Reviewed-by: Omar Sandoval Reviewed-by: Liu Bo Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 40503438ef6c..01bfb02101c1 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10357,7 +10357,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) { struct btrfs_fs_info *fs_info = trans->fs_info; - struct btrfs_block_group_cache *block_group, *tmp; + struct btrfs_block_group_cache *block_group; struct btrfs_root *extent_root = fs_info->extent_root; struct btrfs_block_group_item item; struct btrfs_key key; @@ -10365,7 +10365,10 @@ void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans) bool can_flush_pending_bgs = trans->can_flush_pending_bgs; trans->can_flush_pending_bgs = false; - list_for_each_entry_safe(block_group, tmp, &trans->new_bgs, bg_list) { + while (!list_empty(&trans->new_bgs)) { + block_group = list_first_entry(&trans->new_bgs, + struct btrfs_block_group_cache, + bg_list); if (ret) goto next; -- 2.14.3
[PATCH 20/42] btrfs: don't use ctl->free_space for max_extent_size
From: Josef Bacik max_extent_size is supposed to be the largest contiguous range for the space info, and ctl->free_space is the total free space in the block group. We need to keep track of these separately and _only_ use the max_free_space if we don't have a max_extent_size, as that means our original request was too large to search any of the block groups for and therefore wouldn't have a max_extent_size set. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index eb64b28196b8..40503438ef6c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7492,6 +7492,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, struct btrfs_block_group_cache *block_group = NULL; u64 search_start = 0; u64 max_extent_size = 0; + u64 max_free_space = 0; u64 empty_cluster = 0; struct btrfs_space_info *space_info; int loop = 0; @@ -7787,8 +7788,8 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, spin_lock(&ctl->tree_lock); if (ctl->free_space < num_bytes + empty_cluster + empty_size) { - if (ctl->free_space > max_extent_size) - max_extent_size = ctl->free_space; + max_free_space = max(max_free_space, +ctl->free_space); spin_unlock(&ctl->tree_lock); goto loop; } @@ -7955,6 +7956,8 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, } out: if (ret == -ENOSPC) { + if (!max_extent_size) + max_extent_size = max_free_space; spin_lock(&space_info->lock); space_info->max_extent_size = max_extent_size; spin_unlock(&space_info->lock); -- 2.14.3
[PATCH 07/42] btrfs: check if free bgs for commit
may_commit_transaction will skip committing the transaction if we don't have enough pinned space or if we're trying to find space for a SYSTEM chunk. However if we have pending free block groups in this transaction we still want to commit as we may be able to allocate a chunk to make our reservation. So instead of just returning ENOSPC, check if we have free block groups pending, and if so commit the transaction to allow us to use that free space. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 33 +++-- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 1213f573eea2..da73b3e5bc39 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4830,10 +4830,18 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (!bytes) return 0; - /* See if there is enough pinned space to make this reservation */ - if (__percpu_counter_compare(&space_info->total_bytes_pinned, - bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) + trans = btrfs_join_transaction(fs_info->extent_root); + if (IS_ERR(trans)) + return -ENOSPC; + + /* +* See if there is enough pinned space to make this reservation, or if +* we have bg's that are going to be freed, allowing us to possibly do a +* chunk allocation the next loop through. +*/ + if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &trans->transaction->flags) || + __percpu_counter_compare(&space_info->total_bytes_pinned, bytes, +BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0) goto commit; /* @@ -4841,7 +4849,7 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, * this reservation. */ if (space_info != delayed_rsv->space_info) - return -ENOSPC; + goto enospc; spin_lock(&delayed_rsv->lock); reclaim_bytes += delayed_rsv->reserved; @@ -4855,17 +4863,14 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, bytes -= reclaim_bytes; if (__percpu_counter_compare(&space_info->total_bytes_pinned, - bytes, - BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) { - return -ENOSPC; - } - +bytes, +BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) + goto enospc; commit: - trans = btrfs_join_transaction(fs_info->extent_root); - if (IS_ERR(trans)) - return -ENOSPC; - return btrfs_commit_transaction(trans); +enospc: + btrfs_end_transaction(trans); + return -ENOSPC; } /* -- 2.14.3
[PATCH 17/42] btrfs: run delayed iputs before committing
Delayed iputs means we can have final iputs of deleted inodes in the queue, which could potentially generate a lot of pinned space that could be free'd. So before we decide to commit the transaction for ENOPSC reasons, run the delayed iputs so that any potential space is free'd up. If there is and we freed enough we can then commit the transaction and potentially be able to make our reservation. Signed-off-by: Josef Bacik Reviewed-by: Omar Sandoval --- fs/btrfs/extent-tree.c | 9 + 1 file changed, 9 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 461b8076928b..eb64b28196b8 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4823,6 +4823,15 @@ static int may_commit_transaction(struct btrfs_fs_info *fs_info, if (!bytes) return 0; + /* +* If we have pending delayed iputs then we could free up a bunch of +* pinned space, so make sure we run the iputs before we do our pinned +* bytes check below. +*/ + mutex_lock(&fs_info->cleaner_delayed_iput_mutex); + btrfs_run_delayed_iputs(fs_info); + mutex_unlock(&fs_info->cleaner_delayed_iput_mutex); + trans = btrfs_join_transaction(fs_info->extent_root); if (IS_ERR(trans)) return -ENOSPC; -- 2.14.3
[PATCH 12/42] btrfs: don't use global rsv for chunk allocation
We've done this forever because of the voodoo around knowing how much space we have. However we have better ways of doing this now, and on normal file systems we'll easily have a global reserve of 512MiB, and since metadata chunks are usually 1GiB that means we'll allocate metadata chunks more readily. Instead use the actual used amount when determining if we need to allocate a chunk or not. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 9 - 1 file changed, 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c9913c59686b..c0f6110419b2 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4374,21 +4374,12 @@ static inline u64 calc_global_rsv_need_space(struct btrfs_block_rsv *global) static int should_alloc_chunk(struct btrfs_fs_info *fs_info, struct btrfs_space_info *sinfo, int force) { - struct btrfs_block_rsv *global_rsv = &fs_info->global_block_rsv; u64 bytes_used = btrfs_space_info_used(sinfo, false); u64 thresh; if (force == CHUNK_ALLOC_FORCE) return 1; - /* -* We need to take into account the global rsv because for all intents -* and purposes it's used space. Don't worry about locking the -* global_rsv, it doesn't change except when the transaction commits. -*/ - if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA) - bytes_used += calc_global_rsv_need_space(global_rsv); - /* * in limited mode, we want to have some free space up to * about 1% of the FS size. -- 2.14.3
[PATCH 09/42] btrfs: release metadata before running delayed refs
We want to release the unused reservation we have since it refills the delayed refs reserve, which will make everything go smoother when running the delayed refs if we're short on our reservation. Reviewed-by: Omar Sandoval Reviewed-by: Liu Bo Reviewed-by: Nikolay Borisov Signed-off-by: Josef Bacik --- fs/btrfs/transaction.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 117e0c4a914a..a0f19ca0bd6c 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1922,6 +1922,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) return ret; } + btrfs_trans_release_metadata(trans); + trans->block_rsv = NULL; + /* make a pass through all the delayed refs we have so far * any runnings procs may add more while we are here */ @@ -1931,9 +1934,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) return ret; } - btrfs_trans_release_metadata(trans); - trans->block_rsv = NULL; - cur_trans = trans->transaction; /* -- 2.14.3
[PATCH 19/42] btrfs: set max_extent_size properly
From: Josef Bacik We can't use entry->bytes if our entry is a bitmap entry, we need to use entry->max_extent_size in that case. Fix up all the logic to make this consistent. Signed-off-by: Josef Bacik --- fs/btrfs/free-space-cache.c | 29 +++-- 1 file changed, 19 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index e077ad3b4549..2e96ee7da3ec 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -1770,6 +1770,18 @@ static int search_bitmap(struct btrfs_free_space_ctl *ctl, return -1; } +static void set_max_extent_size(struct btrfs_free_space *entry, + u64 *max_extent_size) +{ + if (entry->bitmap) { + if (entry->max_extent_size > *max_extent_size) + *max_extent_size = entry->max_extent_size; + } else { + if (entry->bytes > *max_extent_size) + *max_extent_size = entry->bytes; + } +} + /* Cache the size of the max extent in bytes */ static struct btrfs_free_space * find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, @@ -1791,8 +1803,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, for (node = &entry->offset_index; node; node = rb_next(node)) { entry = rb_entry(node, struct btrfs_free_space, offset_index); if (entry->bytes < *bytes) { - if (entry->bytes > *max_extent_size) - *max_extent_size = entry->bytes; + set_max_extent_size(entry, max_extent_size); continue; } @@ -1810,8 +1821,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, } if (entry->bytes < *bytes + align_off) { - if (entry->bytes > *max_extent_size) - *max_extent_size = entry->bytes; + set_max_extent_size(entry, max_extent_size); continue; } @@ -1823,8 +1833,8 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes, *offset = tmp; *bytes = size; return entry; - } else if (size > *max_extent_size) { - *max_extent_size = size; + } else { + set_max_extent_size(entry, max_extent_size); } continue; } @@ -2684,8 +2694,7 @@ static u64 btrfs_alloc_from_bitmap(struct btrfs_block_group_cache *block_group, err = search_bitmap(ctl, entry, &search_start, &search_bytes, true); if (err) { - if (search_bytes > *max_extent_size) - *max_extent_size = search_bytes; + set_max_extent_size(entry, max_extent_size); return 0; } @@ -2722,8 +2731,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group, entry = rb_entry(node, struct btrfs_free_space, offset_index); while (1) { - if (entry->bytes < bytes && entry->bytes > *max_extent_size) - *max_extent_size = entry->bytes; + if (entry->bytes < bytes) + set_max_extent_size(entry, max_extent_size); if (entry->bytes < bytes || (!entry->bitmap && entry->offset < min_start)) { -- 2.14.3
[PATCH 05/42] btrfs: only count ref heads run in __btrfs_run_delayed_refs
We pick the number of ref's to run based on the number of ref heads, and only make the decision to stop once we've processed entire ref heads, so only count the ref heads we've run and bail once we've hit the number of ref heads we wanted to process. Signed-off-by: Josef Bacik --- fs/btrfs/extent-tree.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 98f36dfeccb0..b32bd38390dd 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2592,6 +2592,7 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, spin_unlock(&delayed_refs->lock); break; } + count++; /* grab the lock that says we are going to process * all the refs for this head */ @@ -2605,7 +2606,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, */ if (ret == -EAGAIN) { locked_ref = NULL; - count++; continue; } } @@ -2633,7 +2633,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, unselect_delayed_ref_head(delayed_refs, locked_ref); locked_ref = NULL; cond_resched(); - count++; continue; } @@ -2651,7 +2650,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, return ret; } locked_ref = NULL; - count++; continue; } @@ -2702,7 +2700,6 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, } btrfs_put_delayed_ref(ref); - count++; cond_resched(); } -- 2.14.3
Re: [PATCH v3 06/10] writeback: introduce super_operations->write_metadata
On Monday, January 29, 2018 2:36:15 PM IST Chandan Rajendra wrote: > On Wednesday, January 3, 2018 9:59:24 PM IST Josef Bacik wrote: > > On Wed, Jan 03, 2018 at 05:26:03PM +0100, Jan Kara wrote: > > > > > Oh ok well if that's the case then I'll fix this up to be a ratio, test > > everything, and send it along probably early next week. Thanks, > > > > Hi Josef, > > Did you get a chance to work on the next version of this patchset? > > > Josef, Any updates on this and the "Kill Btree inode" patchset? -- chandan