On 2018-07-18 09:07, Chris Murphy wrote:
On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
<ahferro...@gmail.com> wrote:

If you're doing a training presentation, it may be worth mentioning that
preallocation with fallocate() does not behave the same on BTRFS as it does
on other filesystems.  For example, the following sequence of commands:

     fallocate -l X ./tmp
     dd if=/dev/zero of=./tmp bs=1 count=X

Will always work on ext4, XFS, and most other filesystems, for any value of
X between zero and just below the total amount of free space on the
filesystem.  On BTRFS though, it will reliably fail with ENOSPC for values
of X that are greater than _half_ of the total amount of free space on the
filesystem (actually, greater than just short of half).  In essence,
preallocating space does not prevent COW semantics for the first write
unless the file is marked NOCOW.

Is this a bug, or is it suboptimal behavior, or is it intentional?
It's been discussed before, though I can't find the email thread right now. Pretty much, this is _technically_ not incorrect behavior, as the documentation for fallocate doesn't say that subsequent writes can't fail due to lack of space. I personally consider it a bug though because it breaks from existing behavior in a way that is avoidable and defies user expectations.

There are two issues here:

1. Regions preallocated with fallocate still do COW on the first write to any given block in that region. This can be handled by either treating the first write to each block as NOCOW, or by allocating a bit of extra space and doing a rotating approach like this for writes:
    - Write goes into the extra space.
    - Once the write is done, convert the region covered by the write
      into a new block of extra space.
    - When the final block of the preallocated region is written,
      deallocate the extra space.
2. Preallocation does not completely account for necessary metadata space that will be needed to store the data there. This may not be necessary if the first issue is addressed properly.

And then I wonder what happens with XFS COW:

      fallocate -l X ./tmp
      cp --reflink ./tmp ./tmp2
      dd if=/dev/zero of=./tmp bs=1 count=X
I'm not sure. In this particular case, this will fail on BTRFS for any X larger than just short of one third of the total free space. I would expect it to fail for any X larger than just short of half instead.

ZFS gets around this by not supporting fallocate (well, kind of, if you're using glibc and call posix_fallocate, that _will_ work, but it will take forever because it works by writing out each block of space that's being allocated, which, ironically, means that that still suffers from the same issue potentially that we have).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to