On Tue, Apr 23, 2019 at 09:21:09PM +0200, Jakob Unterwurzacher wrote: > > Trying the reproducer, at least on a 5.0 kernel, does never fail on a > > pwrite for me, but always on fallocate: > [...] > > So either that was tested on a rather old kernel or: > > > > 1) we had snapshotting happening between a fallocate and a pwrite (or > > at the same time as the pwrite) > > 2) before the pwrite (or during) the unwritten/prealloc extent was > > reflinked (cp --reflink, clone or dedupe ioctls) > > I am at Linux 5.0.4-200.fc29.x86_64, the user in the github ticket is > at Linux 5.0.7-arch1-1-ARCH, so pretty recent. > There should be no snapshot or reflink or really any other activity on > the test filesystem. > > Maybe the difference is that I am testing on a file and you on a raw > block device? > This is how things look at 4GB size: > > $ dd if=/dev/zero of=img bs=1M count=5000 > $ mkfs.btrfs -f -b $((4 * 1024 * 1024 * 1024)) img > $ mkdir mnt > $ sudo mount img mnt > $ sudo chmod 777 mnt > $ cd mnt > $ ../fallocate_write/fallocate_write > reading from /dev/urandom > writing to ./blob.qEaSZl > writing blocks of 132096 bytes each
132096 is 129 * 1024, which is not a multiple of 4K. There will be a CoW operation in cases where one 4K block from each pwrite is written twice in separate transactions (or with fsync between). Also, fallocate only works _once_ on btrfs. After the first write, prealloc extents are replaced with ordinary CoW extent (ref)s, and the fallocate no-ENOSPC guarantee is gone: # fallocate -l 1m foo # sync # fiewalk foo File: foo Extent { begin = 0x0, end = 0x100000, physical = 0x4aedc01000, flags = Extent::PREALLOC|FIEMAP_EXTENT_LAST, physical_len = 0x100000, logical_len = 0x100000 } # head -c 128k /dev/urandom | dd conv=notrunc of=foo 256+0 records in 256+0 records out 131072 bytes (131 kB, 128 KiB) copied, 0.00201152 s, 65.2 MB/s # sync # fiewalk foo File: foo Extent { begin = 0x0, end = 0x20000, physical = 0x4aedc01000, flags = 0, physical_len = 0x100000, logical_len = 0x20000 } Extent { begin = 0x20000, end = 0x100000, physical = 0x4aedc21000, flags = Extent::PREALLOC|FIEMAP_EXTENT_LAST, physical_len = 0x100000, logical_len = 0xe0000, offset = 0x20000 } Here we see the first block is overwriting the same physical address, but it loses the PREALLOC attribute. A second write will trigger CoW, and a new data extent will be allocated: # head -c 128k /dev/urandom | dd conv=notrunc of=foo 256+0 records in 256+0 records out 131072 bytes (131 kB, 128 KiB) copied, 0.00187461 s, 69.9 MB/s # sync # fiewalk foo File: foo Extent { begin = 0x0, end = 0x20000, physical = 0x4ae5f00000, flags = 0, physical_len = 0x20000, logical_len = 0x20000 } Extent { begin = 0x20000, end = 0x100000, physical = 0x4aedc21000, flags = Extent::PREALLOC|FIEMAP_EXTENT_LAST, physical_len = 0x100000, logical_len = 0xe0000, offset = 0x20000 } Note that the physical address of the first extent changed, indicating CoW. Also, all of the space allocated to the PREALLOC extent remains allocated until the entire PREALLOC extent is overwritten (i.e. this uses 128K of _additional_ space, the partial overwrite doesn't free the first 128K of prealloc space). > total 125 MiB, 162.06 MiB/s > total 251 MiB, 162.92 MiB/s > pwrite failed: No space left on device > > Is your /dev/sdi an SSD? I noticed that mkfs.btrfs does NOT think that > the disk image file is an SSD, > despite the file residing on an SSD. fallocate is only going to behave the way posix_fallocate specifies on files with datacow turned off. > Thanks, > Jakob
signature.asc
Description: PGP signature