Re: [PATCH v3] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to CoW without data reservation

Qu Wenruo Fri, 31 May 2019 03:57:15 -0700

Gentle ping?

I didn't see this patch included in misc-next.
Anything went wrong?


Thanks,
Qu

On 2019/5/15 下午10:56, Filipe Manana wrote:
> On Fri, May 10, 2019 at 12:30 AM Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
>>
>>
>>
>> On 2019/5/9 下午10:49, David Sterba wrote:
>>> On Wed, May 08, 2019 at 06:49:58PM +0800, Qu Wenruo wrote:
>>>> [BUG]
>>>> The following script can cause unexpected fsync failure:
>>>>
>>>>   #!/bin/bash
>>>>
>>>>   dev=/dev/test/test
>>>>   mnt=/mnt/btrfs
>>>>
>>>>   mkfs.btrfs -f $dev -b 512M > /dev/null
>>>>   mount $dev $mnt -o nospace_cache
>>>>
>>>>   # Prealloc one extent
>>>>   xfs_io -f -c "falloc 8k 64m" $mnt/file1
>>>>   # Fill the remaining data space
>>>>   xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
>>>>   sync
>>>>
>>>>   # Write into the prealloc extent
>>>>   xfs_io -c "pwrite 1m 16m" $mnt/file1
>>>>
>>>>   # Reflink then fsync, fsync would fail due to ENOSPC
>>>>   xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
>>>>   umount $dev
>>>>
>>>> The fsync fails with ENOSPC, and the last page of the buffered write is
>>>> lost.
>>>>
>>>> [CAUSE]
>>>> This is caused by:
>>>> - Btrfs' back reference only has extent level granularity
>>>>   So write into shared extent must be CoWed even only part of the extent
>>>>   is shared.
>>>>
>>>> So for above script we have:
>>>> - fallocate
>>>>   Create a preallocated extent where we can do NOCOW write.
>>>>
>>>> - fill all the remaining data and unallocated space
>>>>
>>>> - buffered write into preallocated space
>>>>   As we have not enough space available for data and the extent is not
>>>>   shared (yet) we fall into NOCOW mode.
>>>>
>>>> - reflink
>>>>   Now part of the large preallocated extent is shared, later write
>>>>   into that extent must be CoWed.
>>>>
>>>> - fsync triggers writeback
>>>>   But now the extent is shared and therefore we must fallback into COW
>>>>   mode, which fails with ENOSPC since there's not enough space to
>>>>   allocate data extents.
>>>>
>>>> [WORKAROUND]
>>>> The workaround is to ensure any buffered write in the related extents
>>>> (not just the reflink source range) get flushed before reflink/dedupe,
>>>> so that NOCOW writes succeed that happened before reflinking succeed.
>>>>
>>>> The workaround is expensive
>>>
>>> Can you please quantify that, how big the performance drop is going to
>>> be?
>>
>> Depends on how many dirty pages there are at the timing of reflink/dedupe.
>>
>> If there are a lot, then it would be a delay for reflink/dedupe.
>>
>>>
>>> If the fsync comes soon after reflink, then it's effectively no change.
>>> In case the buffered writes happen on a different range than reflink and
>>> fsync comes later, the buffered writes will stall reflink, right?
>>
>> Fsync doesn't make much difference, it mostly depends on how many dirty
>> pages are.
>>
>> Thus the most impacted use case is concurrent buffered write with
>> reflink/dedupe.
>>
>>>
>>> If there are other similar corner cases we'd better know them in advance
>>> and estimate the impact, that'll be something to look for when we get
>>> complaints that reflink is suddenly slow.
>>>
>>>> NOCOW range, but that needs extra accounting for NOCOW range.
>>>> For now, fix the possible data loss first.
>>>
>>> filemap_flush says
>>>
>>>  437 /**
>>>  438  * filemap_flush - mostly a non-blocking flush
>>>  439  * @mapping:    target address_space
>>>  440  *
>>>  441  * This is a mostly non-blocking flush.  Not suitable for 
>>> data-integrity
>>>  442  * purposes - I/O may not be started against all dirty pages.
>>>  443  *
>>>  444  * Return: %0 on success, negative error code otherwise.
>>>  445  */
>>>
>>> so how does this work together with the statement about preventing data
>>> loss?
>>
>> The data loss is caused by the fact that we can start buffered write
>> without reserving data space, but after reflink/dedupe we have to do CoW.
>> Without enough space, CoW will fail due to ENOSPC.
>>
>> The fix here is, we ensure all dirties pages start their writeback
>> (start btrfs_run_delalloc_range()) before reflink.
>>
>> At btrfs_run_delalloc_range() we determine whether a range goes through
>> NOCOW or COW, and submit ordered extent to do real write back/csum
>> calculation/etc.
>>
>> As long as the whole inode goes through btrfs_run_delalloc_range(), any
>> NOCOW write will go NOCOW on-disk.
>> We don't need to wait for the ordered extent to finish, just ensure all
>> pages goes through delalloc is enough.
>> Waiting for ordered extent will cause even more latency for reflink.
>>
>> Thus the filemap_flush() is enough, as the point is to ensure delalloc
>> is started before reflink.
> 
> I believe that David's comment is related to this part of the comment
> on filemap_flush():
> 
> "I/O may not be started against all dirty pages."
> 
> I.e., his concern being that writeback may not be started and
> therefore we end up with the data loss due to ENOSPC later, and not to
> the technical details of why the ENOSPC failure happens, which is
> already described in the changelog and discussed during the review of
> previous versions of the patch.
> 
> However btrfs has its own writepages() implementation, which even for
> WB_SYNC_NONE (used by filemap_flush) starts writeback (and doesn't
> wait for it to finish if it started before, which is fine for this use
> case).
> Anyway, just my interpretation of the doubt/comment.
> 
> Thanks.
> 
>>
>> Thanks,
>> Qu
>>
> 
>

signature.asc
Description: OpenPGP digital signature

Re: [PATCH v3] btrfs: Flush before reflinking any extent to prevent NOCOW write falling back to CoW without data reservation

Reply via email to