On 2019/4/23 下午10:50, Filipe Manana wrote: > On Tue, Apr 23, 2019 at 1:14 PM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: >> >> >> >> On 2019/4/23 下午7:33, David Sterba wrote: >>> On Tue, Apr 23, 2019 at 10:16:32AM +0800, Qu Wenruo wrote: >>>> On 2019/4/23 上午5:09, Jakob Unterwurzacher wrote: >>>>> I have a user who is reporting ENOSPC errors when running gocryptfs on >>>>> top of btrfs (ticket: https://github.com/rfjakob/gocryptfs/issues/395 ). >>>>> >>>>> What is interesting is that the error gets thrown at write time. This >>>>> is not supposed to happen, because gocryptfs does >>>>> >>>>> fallocate(..., FALLOC_FL_KEEP_SIZE, ...) >>>>> >>>>> before writing. >>>>> >>>>> I wrote a minimal reproducer in C: >>>>> https://github.com/rfjakob/fallocate_write >>>>> This is what it looks like on ext4: >>>>> >>>>> $ ../fallocate_write/fallocate_write >>>>> reading from /dev/urandom >>>>> writing to ./blob.379Q8P >>>>> writing blocks of 132096 bytes each >>>>> [...] >>>>> fallocate failed: No space left on device >>>>> >>>>> On btrfs, it will instead look like this: >>>>> >>>>> [...] >>>>> pwrite failed: No space left on device >>>>> >>>>> Is this a bug in btrfs' fallocate implementation or am I reading the >>>>> guarantees that fallocate gives me wrong? >>>> >>>> Since v4.7, this commit changed the how btrfs do NodataCOW check: >>>> c6887cd11149 ("Btrfs: don't do nocow check unless we have to"). >>>> >>>> Before that commit, btrfs always check if they need to reserve space for >>>> COW, while after that patch, btrfs never checks unless we have no space. >>>> >>>> However this screws up other nodatacow space check. >>>> And due to its age and deep changeset, it's pretty hard to fix it. >>>> I have tried several times, but it will only cause more problems. >>> >>> What if the commit is reverted, if the problem is otherwise hard to fix? >> >> Tried reverted, but all other problems came up. > > I haven't seen an explanation on why that patch causes ENOSPC or what > nodatacow space check screw ups it causes. > > It seems fine to me, and what we currently do: > > 1) For any buffered write, check if there's enough free data space; > 2) If not try to allocate a new data chunk; > 3) If that fails check if the file has the "have prealloc extents" > flag or has the nodatacow flag set > 4) If any of those conditions is true, check if we can write to the > existing extent - if it's not shared or no checksums exist in its > range, meaning it's an unwritten (prealloc) extent, return success to > userspace > > So what's wrong with it? And how does it cause the ENOSPC?
E.g. We have a 128Mb preallocated file extent. And assume the fs only have 128M free data space, meaning 0 remaining space at all. Then we try to buffer write, which means buffered will just fail as it will need data space. The idea is always here for fallocate/pwrite, just the timing where the ENOSPC happens. We have btrfs/153 for the same reason to fail for a long time, although it's from quota, but the reason the completely the same. Thanks, Qu > > Trying the reproducer, at least on a 5.0 kernel, does never fail on a > pwrite for me, but always on fallocate: > > $ mkfs.btrfs -f -b $((4 * 1024 * 1024 * 1024)) /dev/sdi > $ mount /dev/sdi /mnt/sdi > $ cd /mnt/sdi > $ /path/to/reproducer > reading from /dev/urandom > writing to ./blob.IIa6tH > writing blocks of 132096 bytes each > total 125 MiB, 65.52 MiB/s > total 251 MiB, 44.59 MiB/s > total 377 MiB, 55.23 MiB/s > total 503 MiB, 66.21 MiB/s > total 629 MiB, 59.97 MiB/s > total 755 MiB, 3.70 MiB/s > total 881 MiB, 50.24 MiB/s > total 1007 MiB, 64.51 MiB/s > total 1133 MiB, 50.70 MiB/s > total 1259 MiB, 49.29 MiB/s > total 1385 MiB, 47.93 MiB/s > total 1511 MiB, 4.00 MiB/s > total 1637 MiB, 49.85 MiB/s > total 1763 MiB, 48.11 MiB/s > total 1889 MiB, 66.62 MiB/s > total 2015 MiB, 5.60 MiB/s > total 2141 MiB, 19.58 MiB/s > total 2267 MiB, 64.80 MiB/s > total 2393 MiB, 13.23 MiB/s > total 2519 MiB, 14.95 MiB/s > fallocate failed: No space left on device > > So either that was tested on a rather old kernel or: > > 1) we had snapshotting happening between a fallocate and a pwrite (or > at the same time as the pwrite) > 2) before the pwrite (or during) the unwritten/prealloc extent was > reflinked (cp --reflink, clone or dedupe ioctls) > > What did I miss here? > > Thanks. > >> >> E.g. reserved space underflow. >> >> I'll find the old thread and retry again. >> >> Thanks, >> Qu >> >>> This seems to break the semantics of fallocate so the performance should >>> not the main concern here. >>> >> > >
signature.asc
Description: OpenPGP digital signature