On Wed, May 15, 2019 at 03:56:31PM +0100, Filipe Manana wrote:
> On Fri, May 10, 2019 at 12:30 AM Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
> > On 2019/5/9 下午10:49, David Sterba wrote:
> > > On Wed, May 08, 2019 at 06:49:58PM +0800, Qu Wenruo wrote:
> > >> [BUG]
> > >> The following script can cause unexpected fsync failure:
> > >>
> > >>   #!/bin/bash
> > >>
> > >>   dev=/dev/test/test
> > >>   mnt=/mnt/btrfs
> > >>
> > >>   mkfs.btrfs -f $dev -b 512M > /dev/null
> > >>   mount $dev $mnt -o nospace_cache
> > >>
> > >>   # Prealloc one extent
> > >>   xfs_io -f -c "falloc 8k 64m" $mnt/file1
> > >>   # Fill the remaining data space
> > >>   xfs_io -f -c "pwrite 0 -b 4k 512M" $mnt/padding
> > >>   sync
> > >>
> > >>   # Write into the prealloc extent
> > >>   xfs_io -c "pwrite 1m 16m" $mnt/file1
> > >>
> > >>   # Reflink then fsync, fsync would fail due to ENOSPC
> > >>   xfs_io -c "reflink $mnt/file1 8k 0 4k" -c "fsync" $mnt/file1
> > >>   umount $dev
> > >>
> > >> The fsync fails with ENOSPC, and the last page of the buffered write is
> > >> lost.
> > >>
> > >> [CAUSE]
> > >> This is caused by:
> > >> - Btrfs' back reference only has extent level granularity
> > >>   So write into shared extent must be CoWed even only part of the extent
> > >>   is shared.
> > >>
> > >> So for above script we have:
> > >> - fallocate
> > >>   Create a preallocated extent where we can do NOCOW write.
> > >>
> > >> - fill all the remaining data and unallocated space
> > >>
> > >> - buffered write into preallocated space
> > >>   As we have not enough space available for data and the extent is not
> > >>   shared (yet) we fall into NOCOW mode.
> > >>
> > >> - reflink
> > >>   Now part of the large preallocated extent is shared, later write
> > >>   into that extent must be CoWed.
> > >>
> > >> - fsync triggers writeback
> > >>   But now the extent is shared and therefore we must fallback into COW
> > >>   mode, which fails with ENOSPC since there's not enough space to
> > >>   allocate data extents.
> > >>
> > >> [WORKAROUND]
> > >> The workaround is to ensure any buffered write in the related extents
> > >> (not just the reflink source range) get flushed before reflink/dedupe,
> > >> so that NOCOW writes succeed that happened before reflinking succeed.
> > >>
> > >> The workaround is expensive
> > >
> > > Can you please quantify that, how big the performance drop is going to
> > > be?
> >
> > Depends on how many dirty pages there are at the timing of reflink/dedupe.
> >
> > If there are a lot, then it would be a delay for reflink/dedupe.
> >
> > >
> > > If the fsync comes soon after reflink, then it's effectively no change.
> > > In case the buffered writes happen on a different range than reflink and
> > > fsync comes later, the buffered writes will stall reflink, right?
> >
> > Fsync doesn't make much difference, it mostly depends on how many dirty
> > pages are.
> >
> > Thus the most impacted use case is concurrent buffered write with
> > reflink/dedupe.
> >
> > >
> > > If there are other similar corner cases we'd better know them in advance
> > > and estimate the impact, that'll be something to look for when we get
> > > complaints that reflink is suddenly slow.
> > >
> > >> NOCOW range, but that needs extra accounting for NOCOW range.
> > >> For now, fix the possible data loss first.
> > >
> > > filemap_flush says
> > >
> > >  437 /**
> > >  438  * filemap_flush - mostly a non-blocking flush
> > >  439  * @mapping:    target address_space
> > >  440  *
> > >  441  * This is a mostly non-blocking flush.  Not suitable for 
> > > data-integrity
> > >  442  * purposes - I/O may not be started against all dirty pages.
> > >  443  *
> > >  444  * Return: %0 on success, negative error code otherwise.
> > >  445  */
> > >
> > > so how does this work together with the statement about preventing data
> > > loss?
> >
> > The data loss is caused by the fact that we can start buffered write
> > without reserving data space, but after reflink/dedupe we have to do CoW.
> > Without enough space, CoW will fail due to ENOSPC.
> >
> > The fix here is, we ensure all dirties pages start their writeback
> > (start btrfs_run_delalloc_range()) before reflink.
> >
> > At btrfs_run_delalloc_range() we determine whether a range goes through
> > NOCOW or COW, and submit ordered extent to do real write back/csum
> > calculation/etc.
> >
> > As long as the whole inode goes through btrfs_run_delalloc_range(), any
> > NOCOW write will go NOCOW on-disk.
> > We don't need to wait for the ordered extent to finish, just ensure all
> > pages goes through delalloc is enough.
> > Waiting for ordered extent will cause even more latency for reflink.
> >
> > Thus the filemap_flush() is enough, as the point is to ensure delalloc
> > is started before reflink.
> 
> I believe that David's comment is related to this part of the comment
> on filemap_flush():
> 
> "I/O may not be started against all dirty pages."
> 
> I.e., his concern being that writeback may not be started and
> therefore we end up with the data loss due to ENOSPC later, and not to
> the technical details of why the ENOSPC failure happens, which is
> already described in the changelog and discussed during the review of
> previous versions of the patch.
> 
> However btrfs has its own writepages() implementation, which even for
> WB_SYNC_NONE (used by filemap_flush) starts writeback (and doesn't
> wait for it to finish if it started before, which is fine for this use
> case).
> Anyway, just my interpretation of the doubt/comment.

That's correct and answers my questions, thanks.

Reply via email to