On Fri, May 3, 2019 at 11:18 AM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: > > > > On 2019/5/3 下午5:21, Filipe Manana wrote: > > On Fri, May 3, 2019 at 2:46 AM Qu Wenruo <w...@suse.com> wrote: > > > > What a great subject. The "reflink:" part is unnecessary, since the > > rest of the subject already mentions it, that makes it a bit shorter. > > > >> > >> [BUG] > >> The following command can lead to unexpected data COW: > >> > >> #!/bin/bash > >> > >> dev=/dev/test/test > >> mnt=/mnt/btrfs > >> > >> mkfs.btrfs -f $dev -b 1G > /dev/null > >> mount $dev $mnt -o nospace_cache > >> > >> xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1 > >> xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1 > >> umount $dev > >> > >> The result extent will be > >> > >> item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53 > >> generation 6 type 2 (prealloc) > >> prealloc data disk byte 13631488 nr 28672 > >> item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53 > >> generation 6 type 1 (regular) > >> extent data disk byte 13660160 nr 12288 <<< COW > >> item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53 > >> generation 6 type 2 (prealloc) > >> prealloc data disk byte 13631488 nr 28672 > >> > >> Currently we always reserve space even for NOCOW buffered write, thus > > > > I would add 'data' between 'reserve' and 'space', to be clear. > > > >> under most case it shouldn't cause anything wrong even we fall back to > >> COW. > > > > even we ... -> even if we fallback to COW when running delalloc / > > starting writeback. > > > >> > >> However when we're out of data space, we fall back to skip data space if > >> we can do NOCOW write. > > > > we fall back to skip data space ... -> we fallback to NOCOW write > > without reserving data space. > > > >> > >> If such behavior happens under that case, we could hit the following > >> problems: > > > >> - data space bytes_may_use underflow > >> This will cause kernel warning. > > > > Ok. > > > >> > >> - ENOSPC at delalloc time > > > > at delalloc time - that is an ambiguous term you use through the change log. > > In fact, I have a lot of uncertain terminology through kernel. > > Things like flush get referred multiple times in different context (e.g. > filemap flush, flushoncommit, super block flush). > > If we have a terminology list, we can't be more happy to follow.
So, some is kernel wide while others are btrfs specific. A buffered write creates dealloc - copies data to pages, marks the pages as dirty and tags the range in the extent io tree as dellaloc. Running delalloc, flushes writeback (starts IO for the dirty pages and tags them as under writeback) and does other necessary things (like reserving extents). When writeback finishes, we add a task to a work queue to run btrfs_finish_ordered_io - after that happens we say that the ordered extent completed. It can get ambiguous very often. > > > You mean when running/starting delalloc, which happens when starting > > writeback, > > but that could be confused with creating delalloc, which happens > > during the buffered write path. > > Another confusion due to different terminology. > > My understanding of the write path is: > buffered write -> delalloc (start delalloc) -> ordered extent -> finish > ordered io. > > Thus I missed the delalloc creating part. > > > > > So I would always replace 'dealloc time' with 'when running delalloc' > > (or when starting writeback). > > I will change use running delalloc, with extra comment reference to the > function name (btrfs_run_delalloc_range()). > > > > >> This will lead to transaction abort and fs forced to RO. > > > > Where does that happen exactly? > My bad, I got confused with metadata writeback path. > > For data writeback, it should only cause sync related failure. Ok, so please remove the transaction abort comments for next iteration. By sync related failure, you mean running dealloc fails with ENOSPC, since when it tries to reserve a data extent it fails as it can't find any free extent. > > > I don't recall starting transactions when running dealloc, and failed > > to see where after a quick glance to cow_file_range() > > and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means > > when starting writeback. > > > >> > >> [CAUSE] > >> This is due to the fact that btrfs can only do extent level share check. > >> > >> Btrfs can only tell if an extent is shared, no matter if only part of the > >> extent is shared or not. > >> > >> So for above script we have: > >> - fallocate > >> - buffered write > >> If we don't have enough data space, we fall back to NOCOW check. > >> At this timming, the extent is not shared, we can skip data > >> reservation. > > > > But in the above example we don't fall to nocow mode when doing the > > buffered write, as there's plenty of data space available (1Gb - > > 24Kb). > > You need to update the example. > I have to admit that the core part is mostly based on the worst case > *assumption*. > > I'll try to make the case convincing by making it fail directly. Great, thanks. > > > > > > >> - reflink > >> Now part of the large preallocated extent is shared. > >> - delalloc kicks in > > > > writeback kicks in > > > >> For the NOCOW range, as the preallocated extent is shared, we need > >> to fall back to COW. > >> > >> [WORKAROUND] > >> The workaround is to ensure any buffered write in the related extents > >> (not the reflink source range) get flushed before reflink. > > > > not the reflink source range -> not just the reflink source range > > > >> > >> However it's pretty expensive to do a comprehensive check. > >> In the reproducer, the reflink source is just a part of a larger > > > > Again, the reproducer needs to be fixed (yes, I tested it even if it's > > clear by looking at it that it doesn't trigger the nocow case). > > > >> preallocated extent, we need to flush all buffered write of that extent > >> before reflink. > >> Such backward search can be complex and we may not get much benefit from > >> it. > >> > >> So this patch will just try to flush the whole inode before reflink. > > > > > >> > >> Signed-off-by: Qu Wenruo <w...@suse.com> > >> --- > >> Reason for RFC: > >> Flushing an inode just because it's a reflink source is definitely > >> overkilling, but I don't have any better way to handle it. > >> > >> Any comment on this is welcomed. > >> --- > >> fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++ > >> 1 file changed, 22 insertions(+) > >> > >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > >> index 7755b503b348..8caa0edb6fbf 100644 > >> --- a/fs/btrfs/ioctl.c > >> +++ b/fs/btrfs/ioctl.c > >> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file > >> *file, struct file *file_src, > >> return ret; > >> } > >> > >> + /* > >> + * Workaround to make sure NOCOW buffered write reach disk as > >> NOCOW. > >> + * > >> + * Due to the limit of btrfs extent tree design, we can only have > >> + * extent level share view. Any part of an extent is shared then > >> the > > > > Any -> If any > > > >> + * whole extent is shared and any write into that extent needs to > >> fall > > > > is -> is considered > > > >> + * back to COW. > > > > I would add, something like: > > > > That is, btrfs' back references do not have a block level granularity, > > they work at the whole extent level. > > > >> + * > >> + * NOCOW buffered write without data space reserved could to lead > >> to > >> + * either data space bytes_may_use underflow (kernel warning) or > >> ENOSPC > >> + * at delalloc time (transaction abort). > > > > I would omit the warning and transaction abort parts, that can change > > any time. And we have that information in the changelog, so it's not > > lost. > > > >> + * > >> + * Here we take a shortcut by flush the whole inode. We could do > >> better > >> + * by finding all extents in that range and flush the space > >> referring > >> + * all those extents. > >> + * But that's too complex for such corner case. > >> + */ > >> + filemap_flush(src->i_mapping); > >> + if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, > >> + &BTRFS_I(src)->runtime_flags)) > >> + filemap_flush(src->i_mapping); > > > > So a few comments here: > > > > - why just in the clone part? The dedupe side has the same problem, doesn't > > it? > > Right. > > > > > - I would move such flushing to btrfs_remap_file_range_prep - this is > > where we do the source and target range flush and wait. > > > > Can you turn the reproducer into an fstests case? > > Sure. > > Thanks for the info and all the comment, > Qu > > > > > Thanks. > > > >> + > >> /* > >> * Lock destination range to serialize with concurrent readpages() > >> and > >> * source range to serialize with relocation. > >> -- > >> 2.21.0 > >> > > > > > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”