On 2019/5/3 下午5:21, Filipe Manana wrote:
> On Fri, May 3, 2019 at 2:46 AM Qu Wenruo <w...@suse.com> wrote:
> 
> What a great subject. The "reflink:" part is unnecessary, since the
> rest of the subject already mentions it, that makes it a bit shorter.
> 
>>
>> [BUG]
>> The following command can lead to unexpected data COW:
>>
>>   #!/bin/bash
>>
>>   dev=/dev/test/test
>>   mnt=/mnt/btrfs
>>
>>   mkfs.btrfs -f $dev -b 1G > /dev/null
>>   mount $dev $mnt -o nospace_cache
>>
>>   xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1
>>   xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
>>   umount $dev
>>
>> The result extent will be
>>
>>         item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
>>                 generation 6 type 2 (prealloc)
>>                 prealloc data disk byte 13631488 nr 28672
>>         item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53
>>                 generation 6 type 1 (regular)
>>                 extent data disk byte 13660160 nr 12288 <<< COW
>>         item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53
>>                 generation 6 type 2 (prealloc)
>>                 prealloc data disk byte 13631488 nr 28672
>>
>> Currently we always reserve space even for NOCOW buffered write, thus
> 
> I would add 'data' between 'reserve' and 'space', to be clear.
> 
>> under most case it shouldn't cause anything wrong even we fall back to
>> COW.
> 
> even we ... -> even if we fallback to COW when running delalloc /
> starting writeback.
> 
>>
>> However when we're out of data space, we fall back to skip data space if
>> we can do NOCOW write.
> 
> we fall back to skip data space ... -> we fallback to NOCOW write
> without reserving data space.
> 
>>
>> If such behavior happens under that case, we could hit the following
>> problems:
> 
>> - data space bytes_may_use underflow
>>   This will cause kernel warning.
> 
> Ok.
> 
>>
>> - ENOSPC at delalloc time
> 
> at delalloc time - that is an ambiguous term you use through the change log.

In fact, I have a lot of uncertain terminology through kernel.

Things like flush get referred multiple times in different context (e.g.
filemap flush, flushoncommit, super block flush).

If we have a terminology list, we can't be more happy to follow.

> You mean when running/starting delalloc, which happens when starting 
> writeback,
> but that could be confused with creating delalloc, which happens
> during the buffered write path.

Another confusion due to different terminology.

My understanding of the write path is:
buffered write -> delalloc (start delalloc) -> ordered extent -> finish
ordered io.

Thus I missed the delalloc creating part.

> 
> So I would always replace 'dealloc time' with 'when running delalloc'
> (or when starting writeback).

I will change use running delalloc, with extra comment reference to the
function name (btrfs_run_delalloc_range()).

> 
>>   This will lead to transaction abort and fs forced to RO.
> 
> Where does that happen exactly?
My bad, I got confused with metadata writeback path.

For data writeback, it should only cause sync related failure.

> I don't recall starting transactions when running dealloc, and failed
> to see where after a quick glance to cow_file_range()
> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
> when starting writeback.
> 
>>
>> [CAUSE]
>> This is due to the fact that btrfs can only do extent level share check.
>>
>> Btrfs can only tell if an extent is shared, no matter if only part of the
>> extent is shared or not.
>>
>> So for above script we have:
>> - fallocate
>> - buffered write
>>   If we don't have enough data space, we fall back to NOCOW check.
>>   At this timming, the extent is not shared, we can skip data
>>   reservation.
> 
> But in the above example we don't fall to nocow mode when doing the
> buffered write, as there's plenty of data space available (1Gb -
> 24Kb).
> You need to update the example.
I have to admit that the core part is mostly based on the worst case
*assumption*.

I'll try to make the case convincing by making it fail directly.

> 
> 
>> - reflink
>>   Now part of the large preallocated extent is shared.
>> - delalloc kicks in
> 
> writeback kicks in
> 
>>   For the NOCOW range, as the preallocated extent is shared, we need
>>   to fall back to COW.
>>
>> [WORKAROUND]
>> The workaround is to ensure any buffered write in the related extents
>> (not the reflink source range) get flushed before reflink.
> 
> not the reflink source range -> not just the reflink source range
> 
>>
>> However it's pretty expensive to do a comprehensive check.
>> In the reproducer, the reflink source is just a part of a larger
> 
> Again, the reproducer needs to be fixed (yes, I tested it even if it's
> clear by looking at it that it doesn't trigger the nocow case).
> 
>> preallocated extent, we need to flush all buffered write of that extent
>> before reflink.
>> Such backward search can be complex and we may not get much benefit from
>> it.
>>
>> So this patch will just try to flush the whole inode before reflink.
> 
> 
>>
>> Signed-off-by: Qu Wenruo <w...@suse.com>
>> ---
>> Reason for RFC:
>> Flushing an inode just because it's a reflink source is definitely
>> overkilling, but I don't have any better way to handle it.
>>
>> Any comment on this is welcomed.
>> ---
>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 7755b503b348..8caa0edb6fbf 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file 
>> *file, struct file *file_src,
>>                         return ret;
>>         }
>>
>> +       /*
>> +        * Workaround to make sure NOCOW buffered write reach disk as NOCOW.
>> +        *
>> +        * Due to the limit of btrfs extent tree design, we can only have
>> +        * extent level share view. Any part of an extent is shared then the
> 
> Any -> If any
> 
>> +        * whole extent is shared and any write into that extent needs to 
>> fall
> 
> is -> is considered
> 
>> +        * back to COW.
> 
> I would add, something like:
> 
> That is, btrfs' back references do not have a block level granularity,
> they work at the whole extent level.
> 
>> +        *
>> +        * NOCOW buffered write without data space reserved could to lead to
>> +        * either data space bytes_may_use underflow (kernel warning) or 
>> ENOSPC
>> +        * at delalloc time (transaction abort).
> 
> I would omit the warning and transaction abort parts, that can change
> any time. And we have that information in the changelog, so it's not
> lost.
> 
>> +        *
>> +        * Here we take a shortcut by flush the whole inode. We could do 
>> better
>> +        * by finding all extents in that range and flush the space referring
>> +        * all those extents.
>> +        * But that's too complex for such corner case.
>> +        */
>> +       filemap_flush(src->i_mapping);
>> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
>> +                    &BTRFS_I(src)->runtime_flags))
>> +               filemap_flush(src->i_mapping);
> 
> So a few comments here:
> 
> - why just in the clone part? The dedupe side has the same problem, doesn't 
> it?

Right.

> 
> - I would move such flushing to btrfs_remap_file_range_prep - this is
> where we do the source and target range flush and wait.
> 
> Can you turn the reproducer into an fstests case?

Sure.

Thanks for the info and all the comment,
Qu

> 
> Thanks.
> 
>> +
>>         /*
>>          * Lock destination range to serialize with concurrent readpages() 
>> and
>>          * source range to serialize with relocation.
>> --
>> 2.21.0
>>
> 
> 

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to