Gentle ping? I didn't see this patch included in misc-next. Anything went wrong?
Thanks, Qu On 2019/5/15 下午10:56, Filipe Manana wrote: > On Fri, May 10, 2019 at 12:30 AM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: >> >> >> >> On 2019/5/9 下午10:49, David Sterba wrote: >>> On Wed, May 08, 2019 at 06:49:58PM +0800, Qu Wenruo wrote: >>>> [BUG] >>>> The following script can cause unexpected fsync failure: >>>> >>>> #!/bin/bash >>>> >>>> dev=/dev/test/test >>>> mnt=/mnt/btrfs >>>> >>>> mkfs.btrfs -f $dev -b 512M > /dev/null >>>> mount $dev $mnt -o nospace_cache >>>> >>>> # Prealloc one extent >>>> xfs_io -f -c "falloc 8k 64m" $mnt/file1 >>>> # Fill the remaining data space >>>> xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding >>>> sync >>>> >>>> # Write into the prealloc extent >>>> xfs_io -c "pwrite 1m 16m" $mnt/file1 >>>> >>>> # Reflink then fsync, fsync would fail due to ENOSPC >>>> xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1 >>>> umount $dev >>>> >>>> The fsync fails with ENOSPC, and the last page of the buffered write is >>>> lost. >>>> >>>> [CAUSE] >>>> This is caused by: >>>> - Btrfs' back reference only has extent level granularity >>>> So write into shared extent must be CoWed even only part of the extent >>>> is shared. >>>> >>>> So for above script we have: >>>> - fallocate >>>> Create a preallocated extent where we can do NOCOW write. >>>> >>>> - fill all the remaining data and unallocated space >>>> >>>> - buffered write into preallocated space >>>> As we have not enough space available for data and the extent is not >>>> shared (yet) we fall into NOCOW mode. >>>> >>>> - reflink >>>> Now part of the large preallocated extent is shared, later write >>>> into that extent must be CoWed. >>>> >>>> - fsync triggers writeback >>>> But now the extent is shared and therefore we must fallback into COW >>>> mode, which fails with ENOSPC since there's not enough space to >>>> allocate data extents. >>>> >>>> [WORKAROUND] >>>> The workaround is to ensure any buffered write in the related extents >>>> (not just the reflink source range) get flushed before reflink/dedupe, >>>> so that NOCOW writes succeed that happened before reflinking succeed. >>>> >>>> The workaround is expensive >>> >>> Can you please quantify that, how big the performance drop is going to >>> be? >> >> Depends on how many dirty pages there are at the timing of reflink/dedupe. >> >> If there are a lot, then it would be a delay for reflink/dedupe. >> >>> >>> If the fsync comes soon after reflink, then it's effectively no change. >>> In case the buffered writes happen on a different range than reflink and >>> fsync comes later, the buffered writes will stall reflink, right? >> >> Fsync doesn't make much difference, it mostly depends on how many dirty >> pages are. >> >> Thus the most impacted use case is concurrent buffered write with >> reflink/dedupe. >> >>> >>> If there are other similar corner cases we'd better know them in advance >>> and estimate the impact, that'll be something to look for when we get >>> complaints that reflink is suddenly slow. >>> >>>> NOCOW range, but that needs extra accounting for NOCOW range. >>>> For now, fix the possible data loss first. >>> >>> filemap_flush says >>> >>> 437 /** >>> 438 * filemap_flush - mostly a non-blocking flush >>> 439 * @mapping: target address_space >>> 440 * >>> 441 * This is a mostly non-blocking flush. Not suitable for >>> data-integrity >>> 442 * purposes - I/O may not be started against all dirty pages. >>> 443 * >>> 444 * Return: %0 on success, negative error code otherwise. >>> 445 */ >>> >>> so how does this work together with the statement about preventing data >>> loss? >> >> The data loss is caused by the fact that we can start buffered write >> without reserving data space, but after reflink/dedupe we have to do CoW. >> Without enough space, CoW will fail due to ENOSPC. >> >> The fix here is, we ensure all dirties pages start their writeback >> (start btrfs_run_delalloc_range()) before reflink. >> >> At btrfs_run_delalloc_range() we determine whether a range goes through >> NOCOW or COW, and submit ordered extent to do real write back/csum >> calculation/etc. >> >> As long as the whole inode goes through btrfs_run_delalloc_range(), any >> NOCOW write will go NOCOW on-disk. >> We don't need to wait for the ordered extent to finish, just ensure all >> pages goes through delalloc is enough. >> Waiting for ordered extent will cause even more latency for reflink. >> >> Thus the filemap_flush() is enough, as the point is to ensure delalloc >> is started before reflink. > > I believe that David's comment is related to this part of the comment > on filemap_flush(): > > "I/O may not be started against all dirty pages." > > I.e., his concern being that writeback may not be started and > therefore we end up with the data loss due to ENOSPC later, and not to > the technical details of why the ENOSPC failure happens, which is > already described in the changelog and discussed during the review of > previous versions of the patch. > > However btrfs has its own writepages() implementation, which even for > WB_SYNC_NONE (used by filemap_flush) starts writeback (and doesn't > wait for it to finish if it started before, which is fine for this use > case). > Anyway, just my interpretation of the doubt/comment. > > Thanks. > >> >> Thanks, >> Qu >> > >
signature.asc
Description: OpenPGP digital signature