On Thu, Apr 25, 2019 at 9:17 AM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: > > > > On 2019/4/23 下午7:33, David Sterba wrote: > > On Tue, Apr 23, 2019 at 10:16:32AM +0800, Qu Wenruo wrote: > >> On 2019/4/23 上午5:09, Jakob Unterwurzacher wrote: > >>> I have a user who is reporting ENOSPC errors when running gocryptfs on > >>> top of btrfs (ticket: https://github.com/rfjakob/gocryptfs/issues/395 ). > >>> > >>> What is interesting is that the error gets thrown at write time. This > >>> is not supposed to happen, because gocryptfs does > >>> > >>> fallocate(..., FALLOC_FL_KEEP_SIZE, ...) > >>> > >>> before writing. > >>> > >>> I wrote a minimal reproducer in C: > >>> https://github.com/rfjakob/fallocate_write > >>> This is what it looks like on ext4: > >>> > >>> $ ../fallocate_write/fallocate_write > >>> reading from /dev/urandom > >>> writing to ./blob.379Q8P > >>> writing blocks of 132096 bytes each > >>> [...] > >>> fallocate failed: No space left on device > >>> > >>> On btrfs, it will instead look like this: > >>> > >>> [...] > >>> pwrite failed: No space left on device > >>> > >>> Is this a bug in btrfs' fallocate implementation or am I reading the > >>> guarantees that fallocate gives me wrong? > >> > >> Since v4.7, this commit changed the how btrfs do NodataCOW check: > >> c6887cd11149 ("Btrfs: don't do nocow check unless we have to"). > >> > >> Before that commit, btrfs always check if they need to reserve space for > >> COW, while after that patch, btrfs never checks unless we have no space. > >> > >> However this screws up other nodatacow space check. > >> And due to its age and deep changeset, it's pretty hard to fix it. > >> I have tried several times, but it will only cause more problems. > > > > What if the commit is reverted, if the problem is otherwise hard to fix? > > This seems to break the semantics of fallocate so the performance should > > not the main concern here. > > My blur memory of the underflow case is something like below: (failed to > locate the old thread) > > - fallocate > - pwrite in to the reallocated range > At this timing, we can do nocow, thus no data space is reserved. > > - Something happened to make that preallocated extent shared, without > writing back dirty pages. > Some possible causes are snapshot and reflink. > However nowadays, snapshots will write all dirty inodes, and reflink > will write the source range to disk.
Nowadays? It's like that for 11 years now. It's been like that in clone since it was introduced (2008, commit f2eb0a241f0e5c135d93243b0236cb1f14c305e0) and in the snapshot creation ioctl since 2008 as well (commit dc17ff8f11d129db9e83ab7244769e4eae05e14d). > > Maybe it's a small window inside create_snapshot() between > btrfs_start_delalloc_snapshot() and btrfs_commit_transaction() calls? > > - dirty pages get written back > We created ordered extent, but at this timing, we can't do nocow any > more, we need to fallback to cow. > However at the buffered write timing, we didn't reserved data space. > Now we will underflow data space reservation. > > However nowadays there are some new mechanism to handle this case more > gracefully, like btrfs_root::will_be_snapshotted. > > I'll double check if reverting that patch in latest kernel still cause > problem. > But any idea on the possible problem is welcomed. To me it seems the problem is not yet well formulated, therefore it's hard to give ideas/suggestions. The one you pointed isn't related to the issue reported by Jakob, since it involves only a single file (I couldn't reproduce it anyway). So what's your explanation for Jakob's test case, which happens for him on a fresh filesystem with a single file? I could only see the potential bytes_may_use counter leak issue I mentioned previously. Perhaps creating a test case for fstests will make it clear and avoid so many replies back and forth in this thread and others. Thanks > > Thanks, > Qu > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”