On 2019/5/3 下午5:21, Filipe Manana wrote: > On Fri, May 3, 2019 at 2:46 AM Qu Wenruo <w...@suse.com> wrote: > > What a great subject. The "reflink:" part is unnecessary, since the > rest of the subject already mentions it, that makes it a bit shorter. > >> >> [BUG] >> The following command can lead to unexpected data COW: >> >> #!/bin/bash >> >> dev=/dev/test/test >> mnt=/mnt/btrfs >> >> mkfs.btrfs -f $dev -b 1G > /dev/null >> mount $dev $mnt -o nospace_cache >> >> xfs_io -f -c "falloc 8k 24k" -c "pwrite 12k 8k" $mnt/file1 >> xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1 >> umount $dev >> >> The result extent will be >> >> item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53 >> generation 6 type 2 (prealloc) >> prealloc data disk byte 13631488 nr 28672 >> item 8 key (257 EXTENT_DATA 12288) itemoff 15707 itemsize 53 >> generation 6 type 1 (regular) >> extent data disk byte 13660160 nr 12288 <<< COW >> item 9 key (257 EXTENT_DATA 24576) itemoff 15654 itemsize 53 >> generation 6 type 2 (prealloc) >> prealloc data disk byte 13631488 nr 28672 >> >> Currently we always reserve space even for NOCOW buffered write, thus > > I would add 'data' between 'reserve' and 'space', to be clear. > >> under most case it shouldn't cause anything wrong even we fall back to >> COW. > > even we ... -> even if we fallback to COW when running delalloc / > starting writeback. > >> >> However when we're out of data space, we fall back to skip data space if >> we can do NOCOW write. > > we fall back to skip data space ... -> we fallback to NOCOW write > without reserving data space. > >> >> If such behavior happens under that case, we could hit the following >> problems: > >> - data space bytes_may_use underflow >> This will cause kernel warning. > > Ok. > >> >> - ENOSPC at delalloc time > > at delalloc time - that is an ambiguous term you use through the change log.
In fact, I have a lot of uncertain terminology through kernel. Things like flush get referred multiple times in different context (e.g. filemap flush, flushoncommit, super block flush). If we have a terminology list, we can't be more happy to follow. > You mean when running/starting delalloc, which happens when starting > writeback, > but that could be confused with creating delalloc, which happens > during the buffered write path. Another confusion due to different terminology. My understanding of the write path is: buffered write -> delalloc (start delalloc) -> ordered extent -> finish ordered io. Thus I missed the delalloc creating part. > > So I would always replace 'dealloc time' with 'when running delalloc' > (or when starting writeback). I will change use running delalloc, with extra comment reference to the function name (btrfs_run_delalloc_range()). > >> This will lead to transaction abort and fs forced to RO. > > Where does that happen exactly? My bad, I got confused with metadata writeback path. For data writeback, it should only cause sync related failure. > I don't recall starting transactions when running dealloc, and failed > to see where after a quick glance to cow_file_range() > and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means > when starting writeback. > >> >> [CAUSE] >> This is due to the fact that btrfs can only do extent level share check. >> >> Btrfs can only tell if an extent is shared, no matter if only part of the >> extent is shared or not. >> >> So for above script we have: >> - fallocate >> - buffered write >> If we don't have enough data space, we fall back to NOCOW check. >> At this timming, the extent is not shared, we can skip data >> reservation. > > But in the above example we don't fall to nocow mode when doing the > buffered write, as there's plenty of data space available (1Gb - > 24Kb). > You need to update the example. I have to admit that the core part is mostly based on the worst case *assumption*. I'll try to make the case convincing by making it fail directly. > > >> - reflink >> Now part of the large preallocated extent is shared. >> - delalloc kicks in > > writeback kicks in > >> For the NOCOW range, as the preallocated extent is shared, we need >> to fall back to COW. >> >> [WORKAROUND] >> The workaround is to ensure any buffered write in the related extents >> (not the reflink source range) get flushed before reflink. > > not the reflink source range -> not just the reflink source range > >> >> However it's pretty expensive to do a comprehensive check. >> In the reproducer, the reflink source is just a part of a larger > > Again, the reproducer needs to be fixed (yes, I tested it even if it's > clear by looking at it that it doesn't trigger the nocow case). > >> preallocated extent, we need to flush all buffered write of that extent >> before reflink. >> Such backward search can be complex and we may not get much benefit from >> it. >> >> So this patch will just try to flush the whole inode before reflink. > > >> >> Signed-off-by: Qu Wenruo <w...@suse.com> >> --- >> Reason for RFC: >> Flushing an inode just because it's a reflink source is definitely >> overkilling, but I don't have any better way to handle it. >> >> Any comment on this is welcomed. >> --- >> fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++ >> 1 file changed, 22 insertions(+) >> >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >> index 7755b503b348..8caa0edb6fbf 100644 >> --- a/fs/btrfs/ioctl.c >> +++ b/fs/btrfs/ioctl.c >> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file >> *file, struct file *file_src, >> return ret; >> } >> >> + /* >> + * Workaround to make sure NOCOW buffered write reach disk as NOCOW. >> + * >> + * Due to the limit of btrfs extent tree design, we can only have >> + * extent level share view. Any part of an extent is shared then the > > Any -> If any > >> + * whole extent is shared and any write into that extent needs to >> fall > > is -> is considered > >> + * back to COW. > > I would add, something like: > > That is, btrfs' back references do not have a block level granularity, > they work at the whole extent level. > >> + * >> + * NOCOW buffered write without data space reserved could to lead to >> + * either data space bytes_may_use underflow (kernel warning) or >> ENOSPC >> + * at delalloc time (transaction abort). > > I would omit the warning and transaction abort parts, that can change > any time. And we have that information in the changelog, so it's not > lost. > >> + * >> + * Here we take a shortcut by flush the whole inode. We could do >> better >> + * by finding all extents in that range and flush the space referring >> + * all those extents. >> + * But that's too complex for such corner case. >> + */ >> + filemap_flush(src->i_mapping); >> + if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, >> + &BTRFS_I(src)->runtime_flags)) >> + filemap_flush(src->i_mapping); > > So a few comments here: > > - why just in the clone part? The dedupe side has the same problem, doesn't > it? Right. > > - I would move such flushing to btrfs_remap_file_range_prep - this is > where we do the source and target range flush and wait. > > Can you turn the reproducer into an fstests case? Sure. Thanks for the info and all the comment, Qu > > Thanks. > >> + >> /* >> * Lock destination range to serialize with concurrent readpages() >> and >> * source range to serialize with relocation. >> -- >> 2.21.0 >> > >
signature.asc
Description: OpenPGP digital signature