On 2019/5/9 下午10:49, David Sterba wrote:
> On Wed, May 08, 2019 at 06:49:58PM +0800, Qu Wenruo wrote:
>> [BUG]
>> The following script can cause unexpected fsync failure:
>>
>>   #!/bin/bash
>>
>>   dev=/dev/test/test
>>   mnt=/mnt/btrfs
>>
>>   mkfs.btrfs -f $dev -b 512M > /dev/null
>>   mount $dev $mnt -o nospace_cache
>>
>>   # Prealloc one extent
>>   xfs_io -f -c "falloc 8k 64m" $mnt/file1
>>   # Fill the remaining data space
>>   xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
>>   sync
>>
>>   # Write into the prealloc extent
>>   xfs_io -c "pwrite 1m 16m" $mnt/file1
>>
>>   # Reflink then fsync, fsync would fail due to ENOSPC
>>   xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
>>   umount $dev
>>
>> The fsync fails with ENOSPC, and the last page of the buffered write is
>> lost.
>>
>> [CAUSE]
>> This is caused by:
>> - Btrfs' back reference only has extent level granularity
>>   So write into shared extent must be CoWed even only part of the extent
>>   is shared.
>>
>> So for above script we have:
>> - fallocate
>>   Create a preallocated extent where we can do NOCOW write.
>>
>> - fill all the remaining data and unallocated space
>>
>> - buffered write into preallocated space
>>   As we have not enough space available for data and the extent is not
>>   shared (yet) we fall into NOCOW mode.
>>
>> - reflink
>>   Now part of the large preallocated extent is shared, later write
>>   into that extent must be CoWed.
>>
>> - fsync triggers writeback
>>   But now the extent is shared and therefore we must fallback into COW
>>   mode, which fails with ENOSPC since there's not enough space to
>>   allocate data extents.
>>
>> [WORKAROUND]
>> The workaround is to ensure any buffered write in the related extents
>> (not just the reflink source range) get flushed before reflink/dedupe,
>> so that NOCOW writes succeed that happened before reflinking succeed.
>>
>> The workaround is expensive
> 
> Can you please quantify that, how big the performance drop is going to
> be?

Depends on how many dirty pages there are at the timing of reflink/dedupe.

If there are a lot, then it would be a delay for reflink/dedupe.

> 
> If the fsync comes soon after reflink, then it's effectively no change.
> In case the buffered writes happen on a different range than reflink and
> fsync comes later, the buffered writes will stall reflink, right?

Fsync doesn't make much difference, it mostly depends on how many dirty
pages are.

Thus the most impacted use case is concurrent buffered write with
reflink/dedupe.

> 
> If there are other similar corner cases we'd better know them in advance
> and estimate the impact, that'll be something to look for when we get
> complaints that reflink is suddenly slow.
> 
>> NOCOW range, but that needs extra accounting for NOCOW range.
>> For now, fix the possible data loss first.
> 
> filemap_flush says
> 
>  437 /**
>  438  * filemap_flush - mostly a non-blocking flush
>  439  * @mapping:    target address_space
>  440  *
>  441  * This is a mostly non-blocking flush.  Not suitable for data-integrity
>  442  * purposes - I/O may not be started against all dirty pages.
>  443  *
>  444  * Return: %0 on success, negative error code otherwise.
>  445  */
> 
> so how does this work together with the statement about preventing data
> loss?

The data loss is caused by the fact that we can start buffered write
without reserving data space, but after reflink/dedupe we have to do CoW.
Without enough space, CoW will fail due to ENOSPC.

The fix here is, we ensure all dirties pages start their writeback
(start btrfs_run_delalloc_range()) before reflink.

At btrfs_run_delalloc_range() we determine whether a range goes through
NOCOW or COW, and submit ordered extent to do real write back/csum
calculation/etc.

As long as the whole inode goes through btrfs_run_delalloc_range(), any
NOCOW write will go NOCOW on-disk.
We don't need to wait for the ordered extent to finish, just ensure all
pages goes through delalloc is enough.
Waiting for ordered extent will cause even more latency for reflink.

Thus the filemap_flush() is enough, as the point is to ensure delalloc
is started before reflink.

Thanks,
Qu

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to