On 2019/5/4 上午5:56, Zygo Blaxell wrote:
> On Fri, May 03, 2019 at 09:08:52AM +0800, Qu Wenruo wrote:
>> [BUG]
>> The following command can lead to unexpected data COW:
>>
>>   #!/bin/bash
>>
>>   dev=/dev/test/test
>>   mnt=/mnt/btrfs
>>
>>   mkfs.btrfs -f $dev -b 1G > /dev/null
>>   mount $dev $mnt -o nospace_cache
>>
>>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
>>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
>>   umount $dev
>>
>> The result extent will be
>>
>>      item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
>>              generation 6 type 2 (prealloc)
>>              prealloc data disk byte 13631488 nr 28672
>>      item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
>>              generation 6 type 1 (regular)
>>              extent data disk byte 13660160 nr 12288 <<< COW
>>      item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
>>              generation 6 type 2 (prealloc)
>>              prealloc data disk byte 13631488 nr 28672
>>
>> Currently we always reserve space even for NOCOW buffered write, thus
>> under most case it shouldn't cause anything wrong even we fall back to
>> COW.
>>
>> However when we're out of data space, we fall back to skip data space if
>> we can do NOCOW write.
>>
>> If such behavior happens under that case, we could hit the following
>> problems:
>> - data space bytes_may_use underflow
>>   This will cause kernel warning.
>>
>> - ENOSPC at delalloc time
>>   This will lead to transaction abort and fs forced to RO.
>>
>> [CAUSE]
>> This is due to the fact that btrfs can only do extent level share check.
>>
>> Btrfs can only tell if an extent is shared, no matter if only part of the
>> extent is shared or not.
>>
>> So for above script we have:
>> - fallocate
>> - buffered write
>>   If we don't have enough data space, we fall back to NOCOW check.
>>   At this timming, the extent is not shared, we can skip data
>>   reservation.
>> - reflink
>>   Now part of the large preallocated extent is shared.
>> - delalloc kicks in
>>   For the NOCOW range, as the preallocated extent is shared, we need
>>   to fall back to COW.
>>
>> [WORKAROUND]
>> The workaround is to ensure any buffered write in the related extents
>> (not the reflink source range) get flushed before reflink.
>>
>> However it's pretty expensive to do a comprehensive check.
>> In the reproducer, the reflink source is just a part of a larger
>> preallocated extent, we need to flush all buffered write of that extent
>> before reflink.
>> Such backward search can be complex and we may not get much benefit from
>> it.
>>
>> So this patch will just try to flush the whole inode before reflink.
> 
> Does that mean that if a large file is being written and deduped
> simultaneously, that the dedupes would now trigger flushes over the
> entire file?  That seems like it could be slow.

Yes, also my reason for RFC.

But it shouldn't be that heavy, as after the first dedupe/reflink, most
IO should be flushed back, later dedupe has much less work to do.


> 
> e.g. if the file is a big VM image file and it is used src and for dedupe
> (which is quite common in VM image files), we'd be hammering the disk
> with writes similar to hitting it with fsync() in a tight loop?

The original behavior also flush the target and source range, so we're
not completely creating some new overhead.

Thanks,
Qu

> 
>> Signed-off-by: Qu Wenruo <w...@suse.com>
>> ---
>> Reason for RFC:
>> Flushing an inode just because it's a reflink source is definitely
>> overkilling, but I don't have any better way to handle it.
>>
>> Any comment on this is welcomed.
>> ---
>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 7755b503b348..8caa0edb6fbf 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file 
>> *file, struct file *file_src,
>>                      return ret;
>>      }
>>  
>> +    /*
>> +     * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>> +     *
>> +     * Due to the limit of btrfs extent tree design, we can only have
>> +     * extent level share view. Any part of an extent is shared then the
>> +     * whole extent is shared and any write into that extent needs to fall
>> +     * back to COW.
>> +     *
>> +     * NOCOW buffered write without data space reserved could to lead to
>> +     * either data space bytes_may_use underflow (kernel warning) or ENOSPC
>> +     * at delalloc time (transaction abort).
>> +     *
>> +     * Here we take a shortcut by flush the whole inode. We could do better
>> +     * by finding all extents in that range and flush the space referring
>> +     * all those extents.
>> +     * But that's too complex for such corner case.
>> +     */
>> +    filemap_flush(src->i_mapping);
>> +    if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>> +                 &BTRFS_I(src)->runtime_flags))
>> +            filemap_flush(src->i_mapping);
>> +
>>      /*
>>       * Lock destination range to serialize with concurrent readpages() and
>>       * source range to serialize with relocation.
>> -- 
>> 2.21.0
>>

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to