On 2017-08-03 12:37, Goffredo Baroncelli wrote:
On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
On 2017-08-02 17:05, Goffredo Baroncelli wrote:
On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
On 2017-08-02 13:52, Goffredo Baroncelli wrote:
Hi,
[...]
consider the following scenario:
a) create a 2GB file
b) fallocate -o 1GB -l 2GB
c) write from 1GB to 3GB
after b), the expectation is that c) always succeed [1]: i.e. there is enough
space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the
already allocated space because there could be a small time window where both
the old and the new data exists on the disk.
There is also an expectation based on pretty much every other FS in existence
that calling fallocate() on a range that is already in use is a (possibly
expensive) no-op, and by extension using fallocate() with an offset of 0 like a
ftruncate() call will succeed as long as the new size will fit.
The man page of fallocate doesn't guarantee that.
Unfortunately in a COW filesystem the assumption that an allocate area may be
simply overwritten is not true.
Let me to say it with others words: as general rule if you want to _write_
something in a cow filesystem, you need space. Doesn't matter if you are
*over-writing* existing data or you are *appending* to a file.
Yes, you need space, but you don't need _all_ the space. For a file that
already has data in it, you only _need_ as much space as the largest chunk of
data that can be written at once at a low level, because the moment that first
write finishes, the space that was used in the file for that region is freed,
and the next write can go there. Put a bit differently, you only need to
allocate what isn't allocated in the region, and then a bit more to handle the
initial write to the file.
Also, as I said below, _THIS WORKS ON ZFS_. That immediately means that a CoW
filesystem _does not_ need to behave like BTRFS is.
It seems that ZFS on linux doesn't support fallocate
see https://github.com/zfsonlinux/zfs/issues/326
So I think that you are referring to a posix_fallocate and ZFS on solaris,
which I can't test so I can't comment.
Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).
That said, I'm starting to wonder if just failing fallocate() calls to
allocate space is actually the right thing to do here after all. Aside
from this, we don't reserve metadata space for checksums and similar
things for the eventual writes (so it's possible to get -ENOSPC on a
write to an fallocate'ed region anyway because of metadata exhaustion),
and splitting extents can also cause it to fail, so it's perfectly
possible for the fallocate assumption to not hole on BTRFS. The irony
of this is that if you're in a situation where you actually need to
reserve space, you're more likely to fail (because if you actually
_need_ to reserve the space, your filesystem may already be mostly full,
and therefore any of the above issues may occur).
On the specific note of splitting extents, the following will probably
fail on BTRFS as well when done with a large enough FS (the turn over
point ends up being the point at which 256MiB isn't enough space to
account for all the extents), but will succeed with :
1. Create filesystem and mount it. On BTRFS, make sure autodefrag is
off (this makes it fail more reliably, but is not essential for it to fail).
2. Use fallocate to allocate as large a file as possible (in the BTRFS
case, try for the size of the filesystem - 544MiB (512 MiB for the
metadata chunk, 32 for the system chunk).
3. Write half the file using 1MB blocks and skipping 1MB of space
between each block (so every other 1MB of space is actually written to.
4. Write the other half of the file by filling in the holes.
The net effect of this is to split the single large fallocat'ed extent
into a very large number of 1MB extents, which in turn eats up lots of
metadata space and will eventually exhaust it. While this specific
exercise requires a large filesystem, more generic real world situations
exist where this can happen (and I have had this happen before).
[...]
In terms of a COW filesystem, you need the space of a) + the space of b)
No, that is only required if the entire file needs to be written atomically.
There is some maximal size atomic write that BTRFS can perform as a single
operation at a low level (I'm not sure if this is equal to the block size, or
larger, but it doesn't matter much, either way, I'm talking the largest chunk
of data it will write to a disk in a single operation before updating metadata
to point to that new data).
On the best of my knowledge there is only a time limit: IIRC every 30seconds a
transaction is closed. If you are able to fill the filesystem in this time
window you are in trouble.
Even with that, it's still possible to implement the method I outlined
by defining such a limit and forcing a transaction commit when that
limit is hit. I'm also not entirely convinced that the transaction is
the limiting factor here (I was under the impression that the
transaction just updates the top level metadata to point to the new tree
of metadata).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html