On 2019/4/23 下午7:33, David Sterba wrote: > On Tue, Apr 23, 2019 at 10:16:32AM +0800, Qu Wenruo wrote: >> On 2019/4/23 上午5:09, Jakob Unterwurzacher wrote: >>> I have a user who is reporting ENOSPC errors when running gocryptfs on >>> top of btrfs (ticket: https://github.com/rfjakob/gocryptfs/issues/395 ). >>> >>> What is interesting is that the error gets thrown at write time. This >>> is not supposed to happen, because gocryptfs does >>> >>> fallocate(..., FALLOC_FL_KEEP_SIZE, ...) >>> >>> before writing. >>> >>> I wrote a minimal reproducer in C: >>> https://github.com/rfjakob/fallocate_write >>> This is what it looks like on ext4: >>> >>> $ ../fallocate_write/fallocate_write >>> reading from /dev/urandom >>> writing to ./blob.379Q8P >>> writing blocks of 132096 bytes each >>> [...] >>> fallocate failed: No space left on device >>> >>> On btrfs, it will instead look like this: >>> >>> [...] >>> pwrite failed: No space left on device >>> >>> Is this a bug in btrfs' fallocate implementation or am I reading the >>> guarantees that fallocate gives me wrong? >> >> Since v4.7, this commit changed the how btrfs do NodataCOW check: >> c6887cd11149 ("Btrfs: don't do nocow check unless we have to"). >> >> Before that commit, btrfs always check if they need to reserve space for >> COW, while after that patch, btrfs never checks unless we have no space. >> >> However this screws up other nodatacow space check. >> And due to its age and deep changeset, it's pretty hard to fix it. >> I have tried several times, but it will only cause more problems. > > What if the commit is reverted, if the problem is otherwise hard to fix? > This seems to break the semantics of fallocate so the performance should > not the main concern here.
My blur memory of the underflow case is something like below: (failed to locate the old thread) - fallocate - pwrite in to the reallocated range At this timing, we can do nocow, thus no data space is reserved. - Something happened to make that preallocated extent shared, without writing back dirty pages. Some possible causes are snapshot and reflink. However nowadays, snapshots will write all dirty inodes, and reflink will write the source range to disk. Maybe it's a small window inside create_snapshot() between btrfs_start_delalloc_snapshot() and btrfs_commit_transaction() calls? - dirty pages get written back We created ordered extent, but at this timing, we can't do nocow any more, we need to fallback to cow. However at the buffered write timing, we didn't reserved data space. Now we will underflow data space reservation. However nowadays there are some new mechanism to handle this case more gracefully, like btrfs_root::will_be_snapshotted. I'll double check if reverting that patch in latest kernel still cause problem. But any idea on the possible problem is welcomed. Thanks, Qu
signature.asc
Description: OpenPGP digital signature