On 2017-08-04 10:45, Goffredo Baroncelli wrote:
On 2017-08-03 19:23, Austin S. Hemmelgarn wrote:
On 2017-08-03 12:37, Goffredo Baroncelli wrote:
On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
[...]

Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that a CoW 
filesystem _does not_ need to behave like BTRFS is.

It seems that ZFS on linux doesn't support fallocate

see https://github.com/zfsonlinux/zfs/issues/326

So I think that you are referring to a posix_fallocate and ZFS on solaris, 
which I can't test so I can't comment.
Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).

For fun I checked the freebsd source and zfs source. To me it seems that ZFS on 
freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), 
but instead relies on the freebsd default one.

        
http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212

Following the chain of function pointers

        http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110

it seems that the freebsd vop_allocate() is implemented in vop_stdallocate()

        http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912

which simply calls read() and write() on the range [offset...offset+len), which for a 
"conventional" filesystem ensure the block allocation. Of course it is an 
expensive solution.

So I think (but I am not familiar with freebsd) that ZFS doesn't implement a 
real posix_allocate but it try to simulate it. Of course this don't
From a practical perspective though, posix_fallocate() doesn't matter, because almost everything uses the native fallocate call if at all possible. As you mention, FreeBSD is emulating it, but that 'emulation' provides behavior that is close enough to what is required that it doesn't matter. As a matter of perspective, posix_fallocate() is emulated on Linux too, see my reply below to your later comment about posix_fallocate() on BTRFS.

Internally ZFS also keeps _some_ space reserved so it doesn't get wedged like BTRFS does when near full, and they don't do the whole data versus metadata segregation crap, so from a practical perspective, what FreeBSD's ZFS implementation does is sufficient because of the internal structure and handling of writes in ZFS.



That said, I'm starting to wonder if just failing fallocate() calls to allocate 
space is actually the right thing to do here after all.  Aside from this, we 
don't reserve metadata space for checksums and similar things for the eventual 
writes (so it's possible to get -ENOSPC on a write to an fallocate'ed region 
anyway because of metadata exhaustion), and splitting extents can also cause it 
to fail, so it's perfectly possible for the fallocate assumption to not hole on 
BTRFS.

posix_fallocate in BTRFS is not reliable for another reason. This syscall 
guarantees that a BG is allocated, but I think that the allocated BG is 
available to all processes, so a parallel process my exhaust all the available 
space before the first process uses it.
As mentioned above, posix_fallocate() is emulated in libc on Linux by calling the regular fallocate() if the FS supports it (which BTRFS does), or by writing out data like FreeBSD does in the kernel if the FS doesn't support fallocate(). IOW, posix_fallocate() has the exact same issues on BTRFS as Linux's fallocate() syscall does.

My opinion is that BTRFS is not reliable when the space is exhausted, so it 
needs to work with an amount of disk space free. The size of this disk space 
should be O(2*size_of_biggest_write), and for operation like fallocate this 
means O(2*length).
Again, this arises from how we handle writes. If we were to track blocks that have had fallocate called on them and only use those (for the first write at least) for writes to the file that had fallocate called on them (as well as breaking reflinks on them when fallocate is called), then we can get away with just using the size of the biggest write plus a little bit more space for _data_, but even then we need space for metadata (which we don't appear to track right now).

I think that is not casual that the fallocate implemented by ZFSONLINUX works 
with the flag FALLOC_FL_PUNCH_HOLE mode.

https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662
[...]
/*
  * The only flag combination which matches the behavior of zfs_space()
  * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE.  The FALLOC_FL_PUNCH_HOLE
  * flag was introduced in the 2.6.38 kernel.
  */
#if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE)
long
zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len)
{
        int error = -EOPNOTSUPP;

#if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE)
        cred_t *cr = CRED();
        flock64_t bf;
        loff_t olen;
        fstrans_cookie_t cookie;

        if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
                return (error);

[...]


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to