On Tue, May 7, 2019 at 12:13 PM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: > > > > On 2019/5/7 下午4:56, Filipe Manana wrote: > > On Mon, May 6, 2019 at 3:04 AM Qu Wenruo <quwenruo.bt...@gmx.com> wrote: > >> > >> [snip] > >>>> > >>>> For data writeback, it should only cause sync related failure. > >>> > >>> Ok, so please remove the transaction abort comments for next iteration. > >>> By sync related failure, you mean running dealloc fails with ENOSPC, > >>> since when it tries to reserve a data extent it fails as it can't find > >>> any free extent. > >> > >> Well, btrfs has some hidden way to fix such problem by itself already. > >> > >> At buffered write time, we have the following call chain: > >> btrfs_buffered_write() > >> |- btrfs_check_data_free_space() > >> |- btrfs_alloc_data_chunk_ondemand() > >> |- need_commit = 2; /* We have 2 chance to retry, 1 to flush */ > >> |- do_chunk_alloc() /* we have no data space available */ > >> |- if (ret < 0 && ret == -ENOSPC) > >> |- goto commit_trans; /* try to free some space */ > >> |- commit_trans: > >> |- need_commit--; > >> |- if (need_commit > 0) { > >> |- btrfs_start_delalloc_roots(); > >> |- btrfs_wait_ordered_roots(); > >> |- } > >> > >> This means, as long as we hit ENOSPC for data space, we will start write > >> back, thus NODATACOW data will still hit disk as NODATACOW. > >> > >> Such hidden behavior along with always-reserve-data-space solves the > >> problem pretty well. > > > > It doesn't solve the problem you reported in the rfc patch. > > You're right, it doesn't solve the problem at all. > In fact, another bug caused my test script to pass even with some dirty > pages unable to be flushed back. > > But it at least make sure all other pages reach disk as NODATACOW except > the last page. > > > > > What happens: > > > > 1) We have a file with a prealloc extent, that isn't shared > > > > 2) We have 0 bytes of available data space (or any amount less then > > that of the buffered write size) > > > > 3) A buffered write happens that falls within a subrange of the prealloc > > extent. > > We can't reserve space, we do all those things at > > btrfs_alloc_data_chunk_ondemand(), but we can't get any data space > > released, since it's all allocated. > > At that time, we're already flushing all previously buffered write data. > > E.g. if we're writing into one 1M preallocated extent. > The first 4K, we have no data space reserved, dirtied the page, prepare > all delalloc. > > Then the 2nd 4K, we call btrfs_check_data_free_space(), as we're at low > data free space already, we flush all inodes, including the previous 4K > we just dirtied. > Then the first 4K get written to disk NODATACOW, as expected. > > This loop happens until we reach the last page.
We fragment a buffered write into several chunks depending on its size, the size of a page, the size of a pointer, etc. We will hardly iterate for each page alone. If your point is that btrfs_check_data_free_space() can be called multiple times for a single buffered write, yes, that's true if it's large than a certain threshold, but that will hardly be for each individual page, typically it's much more than that. > > > Therefore we fall back to nodatacow mode. We dirty the pages, mark > > the range as dealloc, etc. > > > > 4) The reflink happens, for a subrange of the prealloc extent that > > does not overlap the range of the buffered write. > > Just before the reflink, we only have 1 dirty page (the last page of > that buffered write) doesn't reach disk yet. > > For the final page, we have no choice but do COW, and it fails with -ENOSPC. Yes, isn't that what I said? > > However due to some other problem, the -ENOSPC doesn't reach user space > at all. Yes, isn't that what I said? Yes... Same as the snapshotting case Robbie fixed. No news here. Same as many other silent data loss issues I fixed with fsync a few times sometime ago as well. > > > > > > 5) Some time after the reflink, writeback starts for the inode. > > During the writeback we fallback to COW mode, because the prealloc > > extent is shared, even if the subrange of the buffered write does not > > overlap the reflinked subrange. > > Now the write silently fails with -ENOSPC, and a user doesn't know > > about it unless it does an fsync after that writeback, which will > > report the error via filemap_check_wb_err(). > > > >> We either: > >> - reserve data space > >> Then no matter how it ends, we're OK, although it may end as CoW. > >> > >> - Failed to reserve data space > >> Writeback will be triggered anyway, no way to screw things around. > >> > >> Thus this workaround has nothing to fix, but only make certain NODATACOW > >> reach disk as NODATACOW. > >> > >> It makes some NODATACOW behaves more correctly but won't fix any obvious > >> bug. > >> > >> My personal take is to fix any strange behavior even it won't cause any > >> problem, but the full inode writeback can be performance heavy. > >> > >> So my question is, do we really need this anyway? > > > > Do we need what? Your patch, that logic at > > btrfs_alloc_data_chunk_ondemand(), something else? > > I meant the patch, but the deeper I dig into the problem, more problem I > found. > > The patch is still needed, but there is a more important bug, that > btrfs_run_delalloc_range() failure won't be reported in sync. How could it report a failure? sync calls sync(2), which returns void, meaning it always succeeds and has no way to report errors. See https://linux.die.net/man/2/sync Error tracking and reporting was precisely one of main goals of filemap_check_wb_err() and all the related work Jeff Layton did sometime ago. > > The script here I'm using is: > ------ > #!/bin/bash > > dev=/dev/test/test > mnt=/mnt/btrfs > > #mkfs.btrfs -f $dev -b 1G > /dev/null > #mount $dev $mnt -o nospace_cache > > umount $mnt &> /dev/null > umount $dev &> /dev/null > > dmesg -C > mkfs.btrfs -f $dev -b 512M > /dev/null > > mount $dev $mnt -o nospace_cache > > xfs_io -f -c "falloc 8k 64m" $mnt/file1 > xfs_io -f -c "pwrite 0 -b 4k 370M" $mnt/padding > > sync > btrfs fi df $mnt --raw > > xfs_io -c "pwrite 1m 16m" $mnt/file1 > echo "nodatacow write finished" >> /dev/kmsg > xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1 > echo "reflink finished" >> /dev/kmsg > sync > echo "sync finished ret=$?" >> /dev/kmsg > umount $dev > ------ > > As describe, the last write at 17821696 (17M - 4K) will fail due to ENOSPC. > But the sync succeeded without reporting any problem. > > Thanks, > Qu > > > > > Thanks. > > > >> > >> Thanks, > >> Qu > >> > >>> > >>>> > >>>>> I don't recall starting transactions when running dealloc, and failed > >>>>> to see where after a quick glance to cow_file_range() > >>>>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means > >>>>> when starting writeback. > >>>>> > >>>>>> > >>>>>> [CAUSE] > >>>>>> This is due to the fact that btrfs can only do extent level share > >>>>>> check. > >>>>>> > >>>>>> Btrfs can only tell if an extent is shared, no matter if only part of > >>>>>> the > >>>>>> extent is shared or not. > >>>>>> > >>>>>> So for above script we have: > >>>>>> - fallocate > >>>>>> - buffered write > >>>>>> If we don't have enough data space, we fall back to NOCOW check. > >>>>>> At this timming, the extent is not shared, we can skip data > >>>>>> reservation. > >>>>> > >>>>> But in the above example we don't fall to nocow mode when doing the > >>>>> buffered write, as there's plenty of data space available (1Gb - > >>>>> 24Kb). > >>>>> You need to update the example. > >>>> I have to admit that the core part is mostly based on the worst case > >>>> *assumption*. > >>>> > >>>> I'll try to make the case convincing by making it fail directly. > >>> > >>> Great, thanks. > >>> > >>>> > >>>>> > >>>>> > >>>>>> - reflink > >>>>>> Now part of the large preallocated extent is shared. > >>>>>> - delalloc kicks in > >>>>> > >>>>> writeback kicks in > >>>>> > >>>>>> For the NOCOW range, as the preallocated extent is shared, we need > >>>>>> to fall back to COW. > >>>>>> > >>>>>> [WORKAROUND] > >>>>>> The workaround is to ensure any buffered write in the related extents > >>>>>> (not the reflink source range) get flushed before reflink. > >>>>> > >>>>> not the reflink source range -> not just the reflink source range > >>>>> > >>>>>> > >>>>>> However it's pretty expensive to do a comprehensive check. > >>>>>> In the reproducer, the reflink source is just a part of a larger > >>>>> > >>>>> Again, the reproducer needs to be fixed (yes, I tested it even if it's > >>>>> clear by looking at it that it doesn't trigger the nocow case). > >>>>> > >>>>>> preallocated extent, we need to flush all buffered write of that extent > >>>>>> before reflink. > >>>>>> Such backward search can be complex and we may not get much benefit > >>>>>> from > >>>>>> it. > >>>>>> > >>>>>> So this patch will just try to flush the whole inode before reflink. > >>>>> > >>>>> > >>>>>> > >>>>>> Signed-off-by: Qu Wenruo <w...@suse.com> > >>>>>> --- > >>>>>> Reason for RFC: > >>>>>> Flushing an inode just because it's a reflink source is definitely > >>>>>> overkilling, but I don't have any better way to handle it. > >>>>>> > >>>>>> Any comment on this is welcomed. > >>>>>> --- > >>>>>> fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++ > >>>>>> 1 file changed, 22 insertions(+) > >>>>>> > >>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > >>>>>> index 7755b503b348..8caa0edb6fbf 100644 > >>>>>> --- a/fs/btrfs/ioctl.c > >>>>>> +++ b/fs/btrfs/ioctl.c > >>>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct > >>>>>> file *file, struct file *file_src, > >>>>>> return ret; > >>>>>> } > >>>>>> > >>>>>> + /* > >>>>>> + * Workaround to make sure NOCOW buffered write reach disk as > >>>>>> NOCOW. > >>>>>> + * > >>>>>> + * Due to the limit of btrfs extent tree design, we can only > >>>>>> have > >>>>>> + * extent level share view. Any part of an extent is shared > >>>>>> then the > >>>>> > >>>>> Any -> If any > >>>>> > >>>>>> + * whole extent is shared and any write into that extent needs > >>>>>> to fall > >>>>> > >>>>> is -> is considered > >>>>> > >>>>>> + * back to COW. > >>>>> > >>>>> I would add, something like: > >>>>> > >>>>> That is, btrfs' back references do not have a block level granularity, > >>>>> they work at the whole extent level. > >>>>> > >>>>>> + * > >>>>>> + * NOCOW buffered write without data space reserved could to > >>>>>> lead to > >>>>>> + * either data space bytes_may_use underflow (kernel warning) > >>>>>> or ENOSPC > >>>>>> + * at delalloc time (transaction abort). > >>>>> > >>>>> I would omit the warning and transaction abort parts, that can change > >>>>> any time. And we have that information in the changelog, so it's not > >>>>> lost. > >>>>> > >>>>>> + * > >>>>>> + * Here we take a shortcut by flush the whole inode. We could > >>>>>> do better > >>>>>> + * by finding all extents in that range and flush the space > >>>>>> referring > >>>>>> + * all those extents. > >>>>>> + * But that's too complex for such corner case. > >>>>>> + */ > >>>>>> + filemap_flush(src->i_mapping); > >>>>>> + if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, > >>>>>> + &BTRFS_I(src)->runtime_flags)) > >>>>>> + filemap_flush(src->i_mapping); > >>>>> > >>>>> So a few comments here: > >>>>> > >>>>> - why just in the clone part? The dedupe side has the same problem, > >>>>> doesn't it? > >>>> > >>>> Right. > >>>> > >>>>> > >>>>> - I would move such flushing to btrfs_remap_file_range_prep - this is > >>>>> where we do the source and target range flush and wait. > >>>>> > >>>>> Can you turn the reproducer into an fstests case? > >>>> > >>>> Sure. > >>>> > >>>> Thanks for the info and all the comment, > >>>> Qu > >>>> > >>>>> > >>>>> Thanks. > >>>>> > >>>>>> + > >>>>>> /* > >>>>>> * Lock destination range to serialize with concurrent > >>>>>> readpages() and > >>>>>> * source range to serialize with relocation. > >>>>>> -- > >>>>>> 2.21.0 > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >> > > > > > -- Filipe David Manana, “Whether you think you can, or you think you can't — you're right.”