On 2019/4/25 下午9:25, Josef Bacik wrote: [snip] >>> >>> What if the commit is reverted, if the problem is otherwise hard to fix? >>> This seems to break the semantics of fallocate so the performance should >>> not the main concern here. >> > > Are we sure the ENOSPC is coming from the data reservation? That change makes > us fall back on the old behavior, which means we should still succeed at > making > the data reservation. > > However it fallocate() _does not_ guarantee you won't fail the metadata > reservation, I suspect that may be what you are running into.
For this script, we only needs 4 file extents at most. Even the initial 8M metadata should be pretty enough, thus I don't think it's metadata causing the problem. --- #!/bin/bash dev=/dev/test/test mnt=/mnt/btrfs mkfs.btrfs -f $dev -b 512M mount $dev $mnt fallocate -l 384M $mnt/file1 echo "fallocate success" sync dd if=/dev/zero bs=512K oflag=direct conv=notrunc count=768 of=$mnt/file2 umount $mnt --- > >> My blur memory of the underflow case is something like below: (failed to >> locate the old thread) >> >> - fallocate >> - pwrite in to the reallocated range >> At this timing, we can do nocow, thus no data space is reserved. >> >> - Something happened to make that preallocated extent shared, without >> writing back dirty pages. >> Some possible causes are snapshot and reflink. >> However nowadays, snapshots will write all dirty inodes, and reflink >> will write the source range to disk. >> >> Maybe it's a small window inside create_snapshot() between >> btrfs_start_delalloc_snapshot() and btrfs_commit_transaction() calls? >> >> - dirty pages get written back >> We created ordered extent, but at this timing, we can't do nocow any >> more, we need to fallback to cow. >> However at the buffered write timing, we didn't reserved data space. >> Now we will underflow data space reservation. >> >> However nowadays there are some new mechanism to handle this case more >> gracefully, like btrfs_root::will_be_snapshotted. >> >> I'll double check if reverting that patch in latest kernel still cause >> problem. >> But any idea on the possible problem is welcomed. >> > > Reading the code there's two scenarios that happen. All of our down stream > stuff assumes that we've updated ->bytes_may_use for our data write. So if we > fail our reservation and do the nocow thing of skipping our reservation we can > overflow if we > > 1) Need to allocate an extent anyway because of reflink/snapshot. > btrfs_add_reserved_space() expects that space_info->bytes_may_use has our > region > in it, so in this case it doesn't and we underflow here. I think you are > right > in that we do all dirty writeback nowadays so this is less of an issue, buuuut > > 2) In run_delalloc_nocow we do EXTENT_CLEAR_DATA_RESV unconditionally if we > did > manage to do a nocow. If we fell back on the no reserve case then this would > underflow our ->bytes_may_use counter here. Right, I missed this case. Thanks for pointing this out. > > Off the top of my head I say we just add our write_bytes to ->bytes_may_use if > we use the nocow path. If we're already failing to reserve data space as it > is > then there's no harm in making it appear like we have less space by inflating > ->bytes_may_use. This is the straightforward fix for the underflow, and we > could come up with something more crafty later, like setting the range with > EXTENT_NO_DATA_RESERVE and doing magic later with ->bytes_may_use. Thanks, Sounds pretty valid to me. Thanks for the idea, Qu > > Josef >
signature.asc
Description: OpenPGP digital signature