On 2019/5/6 上午12:24, Filipe Manana wrote: [snip] >>> >>> Yes, also my reason for RFC. >>> >>> But it shouldn't be that heavy, as after the first dedupe/reflink, most >>> IO should be flushed back, later dedupe has much less work to do. >> >> Sure, but if writes are continuously happening (e.g. writes at offset >> 10GB, dedupe at 1GB), these will get flushed out on the next dedupe. >> I'm thinking of scenarious where both writes and dedupes are happening >> continuously, e.g. a host with multiple VM images running out of raw >> image files that are deduped on the host filesystem. >> >> I'm not sure what all the conditions for this flush are. From the list >> above, it sounds like this only occurs _after_ the filesystem has found >> there is no free space. If that's true, then the extra overhead is >> negligible since it rarely happens (assuming that having no free space >> is a rare condition for filesystems). > > The problem is not that flush is done only when low on available space. > The flush would always happen on the entire source file before > reflinking, so that buffered writes that happened before the > clone/dedupe operation and "entered" nodatacow mode (because at the > time there was not enough available data space) will not fail when > their writeback starts - which would happen after the reflinking - > that's why the entire range is flushed. > > Even if btrfs' reference counts are tracked per extent and not per > block, here we could maybe do something like check each reference, > extract the inode number, root number and offset. Then use that to > find the respective file extent items, and using those extract their > length and determine exactly which parts (blocks) of an extent are > shared. That would be a lot of work to do, and would always be racy > for checks for inodes that are not the inode we have locked for the > reflink operation. Very impractical.
To add my idea on better backref (block level), it's more impractical than I thought. From extent double/triple split, to how to handle old extents in old snapshot, it's way too expensive from the developer's respect. > > So it's one more big inconvenience from the extent based back > references, other then the already known space wasting inconvenience > (even if only 1 block of an extent is really referenced, the rest of > the extent is unavailable for allocation, considered used space). Currently this patch is only to be a workaround. There is an idea of introducing new extent io tree bit, NODATACOW for this case. Buffered write with NODATACOW will set the bit, and for the specific problem described here, we only need to flush the range with NODATACOW bit set. And that bit can also be used to detect unexpected COW with no data space reserved. But that need extra work/testing (especially with my special sauce of doing NODATACOW at buffered write time). At least we have some idea on how to reduce the overhead. Thanks, Qu > > > >> >> >>>> e.g. if the file is a big VM image file and it is used src and for dedupe >>>> (which is quite common in VM image files), we'd be hammering the disk >>>> with writes similar to hitting it with fsync() in a tight loop? >>> >>> The original behavior also flush the target and source range, so we're >>> not completely creating some new overhead. >>> >>> Thanks, >>> Qu >>> >>>> >>>>> Signed-off-by: Qu Wenruo <w...@suse.com> >>>>> --- >>>>> Reason for RFC: >>>>> Flushing an inode just because it's a reflink source is definitely >>>>> overkilling, but I don't have any better way to handle it. >>>>> >>>>> Any comment on this is welcomed. >>>>> --- >>>>> fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++ >>>>> 1 file changed, 22 insertions(+) >>>>> >>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >>>>> index 7755b503b348..8caa0edb6fbf 100644 >>>>> --- a/fs/btrfs/ioctl.c >>>>> +++ b/fs/btrfs/ioctl.c >>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct file >>>>> *file, struct file *file_src, >>>>> return ret; >>>>> } >>>>> >>>>> + /* >>>>> + * Workaround to make sure NOCOW buffered write reach disk as NOCOW. >>>>> + * >>>>> + * Due to the limit of btrfs extent tree design, we can only have >>>>> + * extent level share view. Any part of an extent is shared then the >>>>> + * whole extent is shared and any write into that extent needs to fall >>>>> + * back to COW. >>>>> + * >>>>> + * NOCOW buffered write without data space reserved could to lead to >>>>> + * either data space bytes_may_use underflow (kernel warning) or ENOSPC >>>>> + * at delalloc time (transaction abort). >>>>> + * >>>>> + * Here we take a shortcut by flush the whole inode. We could do better >>>>> + * by finding all extents in that range and flush the space referring >>>>> + * all those extents. >>>>> + * But that's too complex for such corner case. >>>>> + */ >>>>> + filemap_flush(src->i_mapping); >>>>> + if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, >>>>> + &BTRFS_I(src)->runtime_flags)) >>>>> + filemap_flush(src->i_mapping); >>>>> + >>>>> /* >>>>> * Lock destination range to serialize with concurrent readpages() and >>>>> * source range to serialize with relocation. >>>>> -- >>>>> 2.21.0 >>>>> >>> >> >> >> > >
signature.asc
Description: OpenPGP digital signature