On Wed, May 08, 2019 at 06:49:58PM +0800, Qu Wenruo wrote:
> [BUG]
> The following script can cause unexpected fsync failure:
> 
>   #!/bin/bash
> 
>   dev=/dev/test/test
>   mnt=/mnt/btrfs
> 
>   mkfs.btrfs -f $dev -b 512M > /dev/null
>   mount $dev $mnt -o nospace_cache
> 
>   # Prealloc one extent
>   xfs_io -f -c "falloc 8k 64m" $mnt/file1
>   # Fill the remaining data space
>   xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
>   sync
> 
>   # Write into the prealloc extent
>   xfs_io -c "pwrite 1m 16m" $mnt/file1
> 
>   # Reflink then fsync, fsync would fail due to ENOSPC
>   xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
>   umount $dev
> 
> The fsync fails with ENOSPC, and the last page of the buffered write is
> lost.
> 
> [CAUSE]
> This is caused by:
> - Btrfs' back reference only has extent level granularity
>   So write into shared extent must be CoWed even only part of the extent
>   is shared.
> 
> So for above script we have:
> - fallocate
>   Create a preallocated extent where we can do NOCOW write.
> 
> - fill all the remaining data and unallocated space
> 
> - buffered write into preallocated space
>   As we have not enough space available for data and the extent is not
>   shared (yet) we fall into NOCOW mode.
> 
> - reflink
>   Now part of the large preallocated extent is shared, later write
>   into that extent must be CoWed.
> 
> - fsync triggers writeback
>   But now the extent is shared and therefore we must fallback into COW
>   mode, which fails with ENOSPC since there's not enough space to
>   allocate data extents.
> 
> [WORKAROUND]
> The workaround is to ensure any buffered write in the related extents
> (not just the reflink source range) get flushed before reflink/dedupe,
> so that NOCOW writes succeed that happened before reflinking succeed.
> 
> The workaround is expensive

Can you please quantify that, how big the performance drop is going to
be?

If the fsync comes soon after reflink, then it's effectively no change.
In case the buffered writes happen on a different range than reflink and
fsync comes later, the buffered writes will stall reflink, right?

If there are other similar corner cases we'd better know them in advance
and estimate the impact, that'll be something to look for when we get
complaints that reflink is suddenly slow.

> NOCOW range, but that needs extra accounting for NOCOW range.
> For now, fix the possible data loss first.

filemap_flush says

 437 /**
 438  * filemap_flush - mostly a non-blocking flush
 439  * @mapping:    target address_space
 440  *
 441  * This is a mostly non-blocking flush.  Not suitable for data-integrity
 442  * purposes - I/O may not be started against all dirty pages.
 443  *
 444  * Return: %0 on success, negative error code otherwise.
 445  */

so how does this work together with the statement about preventing data
loss?

Reply via email to