On Tue, Apr 23, 2019 at 09:21:09PM +0200, Jakob Unterwurzacher wrote:

> > Trying the reproducer, at least on a 5.0 kernel, does never fail on a
> > pwrite for me, but always on fallocate:
> [...]
> > So either that was tested on a rather old kernel or:
> >
> > 1) we had snapshotting happening between a fallocate and a pwrite (or
> > at the same time as the pwrite)
> > 2) before the pwrite (or during) the unwritten/prealloc extent was
> > reflinked (cp --reflink, clone or dedupe ioctls)
> 
> I am at Linux 5.0.4-200.fc29.x86_64, the user in the github ticket is
> at Linux 5.0.7-arch1-1-ARCH, so pretty recent.
> There should be no snapshot or reflink or really any other activity on
> the test filesystem.
> 
> Maybe the difference is that I am testing on a file and you on a raw
> block device?
> This is how things look at 4GB size:
> 
> $ dd if=/dev/zero of=img bs=1M count=5000
> $ mkfs.btrfs -f -b $((4 * 1024 * 1024 * 1024)) img
> $ mkdir mnt
> $ sudo mount img mnt
> $ sudo chmod 777 mnt
> $ cd mnt
> $ ../fallocate_write/fallocate_write
> reading from /dev/urandom
> writing to ./blob.qEaSZl
> writing blocks of 132096 bytes each

132096 is 129 * 1024, which is not a multiple of 4K.  There will be a CoW
operation in cases where one 4K block from each pwrite is written twice
in separate transactions (or with fsync between).

Also, fallocate only works _once_ on btrfs.  After the first write,
prealloc extents are replaced with ordinary CoW extent (ref)s, and the
fallocate no-ENOSPC guarantee is gone:

        # fallocate -l 1m foo
        # sync
        # fiewalk foo 
        File: foo
        Extent { begin = 0x0, end = 0x100000, physical = 0x4aedc01000, flags = 
Extent::PREALLOC|FIEMAP_EXTENT_LAST, physical_len = 0x100000, logical_len = 
0x100000 }
        # head -c 128k /dev/urandom | dd conv=notrunc of=foo 
        256+0 records in
        256+0 records out
        131072 bytes (131 kB, 128 KiB) copied, 0.00201152 s, 65.2 MB/s
        # sync
        # fiewalk foo 
        File: foo
        Extent { begin = 0x0, end = 0x20000, physical = 0x4aedc01000, flags = 
0, physical_len = 0x100000, logical_len = 0x20000 }
        Extent { begin = 0x20000, end = 0x100000, physical = 0x4aedc21000, 
flags = Extent::PREALLOC|FIEMAP_EXTENT_LAST, physical_len = 0x100000, 
logical_len = 0xe0000, offset = 0x20000 }

Here we see the first block is overwriting the same physical address,
but it loses the PREALLOC attribute.  A second write will trigger CoW,
and a new data extent will be allocated:

        # head -c 128k /dev/urandom | dd conv=notrunc of=foo 
        256+0 records in
        256+0 records out
        131072 bytes (131 kB, 128 KiB) copied, 0.00187461 s, 69.9 MB/s
        # sync
        # fiewalk foo 
        File: foo
        Extent { begin = 0x0, end = 0x20000, physical = 0x4ae5f00000, flags = 
0, physical_len = 0x20000, logical_len = 0x20000 }
        Extent { begin = 0x20000, end = 0x100000, physical = 0x4aedc21000, 
flags = Extent::PREALLOC|FIEMAP_EXTENT_LAST, physical_len = 0x100000, 
logical_len = 0xe0000, offset = 0x20000 }

Note that the physical address of the first extent changed, indicating
CoW.  Also, all of the space allocated to the PREALLOC extent remains
allocated until the entire PREALLOC extent is overwritten (i.e. this
uses 128K of _additional_ space, the partial overwrite doesn't free the
first 128K of prealloc space).

> total    125 MiB, 162.06 MiB/s
> total    251 MiB, 162.92 MiB/s
> pwrite failed: No space left on device
> 
> Is your /dev/sdi an SSD? I noticed that mkfs.btrfs does NOT think that
> the disk image file is an SSD,
> despite the file residing on an SSD.

fallocate is only going to behave the way posix_fallocate specifies on
files with datacow turned off.

> Thanks,
> Jakob

Attachment: signature.asc
Description: PGP signature

Reply via email to