Re: Massive loss of disk space

Goffredo Baroncelli Fri, 04 Aug 2017 07:46:15 -0700

On 2017-08-03 19:23, Austin S. Hemmelgarn wrote:
> On 2017-08-03 12:37, Goffredo Baroncelli wrote:
>> On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
[...]


>>> Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that a 
>>> CoW filesystem _does not_ need to behave like BTRFS is.
>>
>> It seems that ZFS on linux doesn't support fallocate
>>
>> see https://github.com/zfsonlinux/zfs/issues/326
>>
>> So I think that you are referring to a posix_fallocate and ZFS on solaris, 
>> which I can't test so I can't comment.
> Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).

For fun I checked the freebsd source and zfs source. To me it seems that ZFS on 
freebsd doesn't implement posix_fallocate() (VOP_ALLOCATE in freebas jargon), 
but instead relies on the freebsd default one.

        
http://fxr.watson.org/fxr/source/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L7212

Following the chain of function pointers

        http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=10#L110

it seems that the freebsd vop_allocate() is implemented in vop_stdallocate()

        http://fxr.watson.org/fxr/source/kern/vfs_default.c?im=excerpts#L912

which simply calls read() and write() on the range [offset...offset+len), which 
for a "conventional" filesystem ensure the block allocation. Of course it is an 
expensive solution.

So I think (but I am not familiar with freebsd) that ZFS doesn't implement a 
real posix_allocate but it try to simulate it. Of course this don't


> 
> That said, I'm starting to wonder if just failing fallocate() calls to 
> allocate space is actually the right thing to do here after all.  Aside from 
> this, we don't reserve metadata space for checksums and similar things for 
> the eventual writes (so it's possible to get -ENOSPC on a write to an 
> fallocate'ed region anyway because of metadata exhaustion), and splitting 
> extents can also cause it to fail, so it's perfectly possible for the 
> fallocate assumption to not hole on BTRFS.  

posix_fallocate in BTRFS is not reliable for another reason. This syscall 
guarantees that a BG is allocated, but I think that the allocated BG is 
available to all processes, so a parallel process my exhaust all the available 
space before the first process uses it.

My opinion is that BTRFS is not reliable when the space is exhausted, so it 
needs to work with an amount of disk space free. The size of this disk space 
should be O(2*size_of_biggest_write), and for operation like fallocate this 
means O(2*length).

I think that is not casual that the fallocate implemented by ZFSONLINUX works 
with the flag FALLOC_FL_PUNCH_HOLE mode.

https://github.com/zfsonlinux/zfs/blob/master/module/zfs/zpl_file.c#L662
[...]
/*
 * The only flag combination which matches the behavior of zfs_space()
 * is FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE.  The FALLOC_FL_PUNCH_HOLE
 * flag was introduced in the 2.6.38 kernel.
 */
#if defined(HAVE_FILE_FALLOCATE) || defined(HAVE_INODE_FALLOCATE)
long
zpl_fallocate_common(struct inode *ip, int mode, loff_t offset, loff_t len)
{
        int error = -EOPNOTSUPP;

#if defined(FALLOC_FL_PUNCH_HOLE) && defined(FALLOC_FL_KEEP_SIZE)
        cred_t *cr = CRED();
        flock64_t bf;
        loff_t olen;
        fstrans_cookie_t cookie;

        if (mode != (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
                return (error);

[...]

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Massive loss of disk space

Reply via email to