On Thu, Apr 25, 2019 at 01:49:23PM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/4/23 下午7:33, David Sterba wrote:
> > On Tue, Apr 23, 2019 at 10:16:32AM +0800, Qu Wenruo wrote:
> >> On 2019/4/23 上午5:09, Jakob Unterwurzacher wrote:
> >>> I have a user who is reporting ENOSPC errors when running gocryptfs on
> >>> top of btrfs (ticket: https://github.com/rfjakob/gocryptfs/issues/395 ).
> >>>
> >>> What is interesting is that the error gets thrown at write time. This
> >>> is not supposed to happen, because gocryptfs does
> >>>
> >>>     fallocate(..., FALLOC_FL_KEEP_SIZE, ...)
> >>>
> >>> before writing.
> >>>
> >>> I wrote a minimal reproducer in C: 
> >>> https://github.com/rfjakob/fallocate_write
> >>> This is what it looks like on ext4:
> >>>
> >>>     $ ../fallocate_write/fallocate_write
> >>>     reading from /dev/urandom
> >>>     writing to ./blob.379Q8P
> >>>     writing blocks of 132096 bytes each
> >>>     [...]
> >>>     fallocate failed: No space left on device
> >>>
> >>> On btrfs, it will instead look like this:
> >>>
> >>>     [...]
> >>>     pwrite failed: No space left on device
> >>>
> >>> Is this a bug in btrfs' fallocate implementation or am I reading the
> >>> guarantees that fallocate gives me wrong?
> >>
> >> Since v4.7, this commit changed the how btrfs do NodataCOW check:
> >> c6887cd11149 ("Btrfs: don't do nocow check unless we have to").
> >>
> >> Before that commit, btrfs always check if they need to reserve space for
> >> COW, while after that patch, btrfs never checks unless we have no space.
> >>
> >> However this screws up other nodatacow space check.
> >> And due to its age and deep changeset, it's pretty hard to fix it.
> >> I have tried several times, but it will only cause more problems.
> > 
> > What if the commit is reverted, if the problem is otherwise hard to fix?
> > This seems to break the semantics of fallocate so the performance should
> > not the main concern here.
> 

Are we sure the ENOSPC is coming from the data reservation?  That change makes
us fall back on the old behavior, which means we should still succeed at making
the data reservation.

However it fallocate() _does not_ guarantee you won't fail the metadata
reservation, I suspect that may be what you are running into.

> My blur memory of the underflow case is something like below: (failed to
> locate the old thread)
> 
> - fallocate
> - pwrite in to the reallocated range
>   At this timing, we can do nocow, thus no data space is reserved.
> 
> - Something happened to make that preallocated extent shared, without
>   writing back dirty pages.
>   Some possible causes are snapshot and reflink.
>   However nowadays, snapshots will write all dirty inodes, and reflink
>   will write the source range to disk.
> 
>   Maybe it's a small window inside create_snapshot() between
>   btrfs_start_delalloc_snapshot() and btrfs_commit_transaction() calls?
> 
> - dirty pages get written back
>   We created ordered extent, but at this timing, we can't do nocow any
>   more, we need to fallback to cow.
>   However at the buffered write timing, we didn't reserved data space.
>   Now we will underflow data space reservation.
> 
> However nowadays there are some new mechanism to handle this case more
> gracefully, like btrfs_root::will_be_snapshotted.
> 
> I'll double check if reverting that patch in latest kernel still cause
> problem.
> But any idea on the possible problem is welcomed.
> 

Reading the code there's two scenarios that happen.  All of our down stream
stuff assumes that we've updated ->bytes_may_use for our data write.  So if we
fail our reservation and do the nocow thing of skipping our reservation we can
overflow if we

1) Need to allocate an extent anyway because of reflink/snapshot.
btrfs_add_reserved_space() expects that space_info->bytes_may_use has our region
in it, so in this case it doesn't and we underflow here.  I think you are right
in that we do all dirty writeback nowadays so this is less of an issue, buuuut

2) In run_delalloc_nocow we do EXTENT_CLEAR_DATA_RESV unconditionally if we did
manage to do a nocow.  If we fell back on the no reserve case then this would
underflow our ->bytes_may_use counter here.

Off the top of my head I say we just add our write_bytes to ->bytes_may_use if
we use the nocow path.  If we're already failing to reserve data space as it is
then there's no harm in making it appear like we have less space by inflating
->bytes_may_use.  This is the straightforward fix for the underflow, and we
could come up with something more crafty later, like setting the range with
EXTENT_NO_DATA_RESERVE and doing magic later with ->bytes_may_use.  Thanks,

Josef

Reply via email to