On Tue, May 7, 2019 at 12:13 PM Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
>
>
>
> On 2019/5/7 下午4:56, Filipe Manana wrote:
> > On Mon, May 6, 2019 at 3:04 AM Qu Wenruo <quwenruo.bt...@gmx.com> wrote:
> >>
> >> [snip]
> >>>>
> >>>> For data writeback, it should only cause sync related failure.
> >>>
> >>> Ok, so please remove the transaction abort comments for next iteration.
> >>> By sync related failure, you mean running dealloc fails with ENOSPC,
> >>> since when it tries to reserve a data extent it fails as it can't find
> >>> any free extent.
> >>
> >> Well, btrfs has some hidden way to fix such problem by itself already.
> >>
> >> At buffered write time, we have the following call chain:
> >> btrfs_buffered_write()
> >> |- btrfs_check_data_free_space()
> >>    |- btrfs_alloc_data_chunk_ondemand()
> >>       |- need_commit = 2; /* We have 2 chance to retry, 1 to flush */
> >>       |- do_chunk_alloc() /* we have no data space available */
> >>       |- if (ret < 0 && ret == -ENOSPC)
> >>       |-     goto commit_trans;  /* try to free some space */
> >>       |- commit_trans:
> >>       |-     need_commit--;
> >>       |-     if (need_commit > 0) {
> >>       |-         btrfs_start_delalloc_roots();
> >>       |-         btrfs_wait_ordered_roots();
> >>       |-     }
> >>
> >> This means, as long as we hit ENOSPC for data space, we will start write
> >> back, thus NODATACOW data will still hit disk as NODATACOW.
> >>
> >> Such hidden behavior along with always-reserve-data-space solves the
> >> problem pretty well.
> >
> > It doesn't solve the problem you reported in the rfc patch.
>
> You're right, it doesn't solve the problem at all.
> In fact, another bug caused my test script to pass even with some dirty
> pages unable to be flushed back.
>
> But it at least make sure all other pages reach disk as NODATACOW except
> the last page.
>
> >
> > What happens:
> >
> > 1) We have a file with a prealloc extent, that isn't shared
> >
> > 2) We have 0 bytes of available data space (or any amount less then
> > that of the buffered write size)
> >
> > 3) A buffered write happens that falls within a subrange of the prealloc 
> > extent.
> >     We can't reserve space, we do all those things at
> > btrfs_alloc_data_chunk_ondemand(), but we can't get any data space
> > released, since it's all allocated.
>
> At that time, we're already flushing all previously buffered write data.
>
> E.g. if we're writing into one 1M preallocated extent.
> The first 4K, we have no data space reserved, dirtied the page, prepare
> all delalloc.
>
> Then the 2nd 4K, we call btrfs_check_data_free_space(), as we're at low
> data free space already, we flush all inodes, including the previous 4K
> we just dirtied.
> Then the first 4K get written to disk NODATACOW, as expected.
>
> This loop happens until we reach the last page.

We fragment a buffered write into several chunks depending on its
size, the size of a page, the size of a pointer, etc.
We will hardly iterate for each page alone.
If your point is that btrfs_check_data_free_space() can be called
multiple times for a single buffered write, yes, that's true if it's
large than a certain threshold, but that will hardly be for each
individual page, typically it's much more than that.

>
> >     Therefore we fall back to nodatacow mode. We dirty the pages, mark
> > the range as dealloc, etc.
> >
> > 4) The reflink happens, for a subrange of the prealloc extent that
> > does not overlap the range of the buffered write.
>
> Just before the reflink, we only have 1 dirty page (the last page of
> that buffered write) doesn't reach disk yet.
>
> For the final page, we have no choice but do COW, and it fails with -ENOSPC.

Yes, isn't that what I said?

>
> However due to some other problem, the -ENOSPC doesn't reach user space
> at all.

Yes, isn't that what I said?

Yes... Same as the snapshotting case Robbie fixed. No news here.
Same as many other silent data loss issues I fixed with fsync a few
times sometime ago as well.

>
>
> >
> > 5) Some time after the reflink, writeback starts for the inode.
> >     During the writeback we fallback to COW mode, because the prealloc
> > extent is shared, even if the subrange of the buffered write does not
> > overlap the reflinked subrange.
> >     Now the write silently fails with -ENOSPC, and a user doesn't know
> > about it unless it does an fsync after that writeback, which will
> > report the error via filemap_check_wb_err().
> >
> >> We either:
> >> - reserve data space
> >>   Then no matter how it ends, we're OK, although it may end as CoW.
> >>
> >> - Failed to reserve data space
> >>   Writeback will be triggered anyway, no way to screw things around.
> >>
> >> Thus this workaround has nothing to fix, but only make certain NODATACOW
> >> reach disk as NODATACOW.
> >>
> >> It makes some NODATACOW behaves more correctly but won't fix any obvious
> >> bug.
> >>
> >> My personal take is to fix any strange behavior even it won't cause any
> >> problem, but the full inode writeback can be performance heavy.
> >>
> >> So my question is, do we really need this anyway?
> >
> > Do we need what? Your patch, that logic at
> > btrfs_alloc_data_chunk_ondemand(), something else?
>
> I meant the patch, but the deeper I dig into the problem, more problem I
> found.
>
> The patch is still needed, but there is a more important bug, that
> btrfs_run_delalloc_range() failure won't be reported in sync.

How could it report a failure?

sync calls sync(2), which returns void, meaning it always succeeds and
has no way to report errors.
See https://linux.die.net/man/2/sync

Error tracking and reporting was precisely one of main goals of
filemap_check_wb_err() and all the related work Jeff Layton did
sometime ago.

>
> The script here I'm using is:
> ------
> #!/bin/bash
>
> dev=/dev/test/test
> mnt=/mnt/btrfs
>
> #mkfs.btrfs -f $dev -b 1G > /dev/null
> #mount $dev $mnt -o nospace_cache
>
> umount $mnt &> /dev/null
> umount $dev &> /dev/null
>
> dmesg -C
> mkfs.btrfs -f $dev -b 512M > /dev/null
>
> mount $dev $mnt -o nospace_cache
>
> xfs_io -f -c "falloc 8k 64m" $mnt/file1
> xfs_io -f -c "pwrite 0 -b 4k 370M" $mnt/padding
>
> sync
> btrfs fi df $mnt --raw
>
> xfs_io -c "pwrite 1m 16m" $mnt/file1
> echo "nodatacow write finished" >> /dev/kmsg
> xfs_io -c "reflink $mnt/file1 8k 0 4k" $mnt/file1
> echo "reflink finished" >> /dev/kmsg
> sync
> echo "sync finished ret=$?" >> /dev/kmsg
> umount $dev
> ------
>
> As describe, the last write at 17821696 (17M - 4K) will fail due to ENOSPC.
> But the sync succeeded without reporting any problem.
>
> Thanks,
> Qu
>
> >
> > Thanks.
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>>>
> >>>>> I don't recall starting transactions when running dealloc, and failed
> >>>>> to see where after a quick glance to cow_file_range()
> >>>>> and run_delalloc_nocow(). I'm assuming that 'at delalloc time' means
> >>>>> when starting writeback.
> >>>>>
> >>>>>>
> >>>>>> [CAUSE]
> >>>>>> This is due to the fact that btrfs can only do extent level share 
> >>>>>> check.
> >>>>>>
> >>>>>> Btrfs can only tell if an extent is shared, no matter if only part of 
> >>>>>> the
> >>>>>> extent is shared or not.
> >>>>>>
> >>>>>> So for above script we have:
> >>>>>> - fallocate
> >>>>>> - buffered write
> >>>>>>   If we don't have enough data space, we fall back to NOCOW check.
> >>>>>>   At this timming, the extent is not shared, we can skip data
> >>>>>>   reservation.
> >>>>>
> >>>>> But in the above example we don't fall to nocow mode when doing the
> >>>>> buffered write, as there's plenty of data space available (1Gb -
> >>>>> 24Kb).
> >>>>> You need to update the example.
> >>>> I have to admit that the core part is mostly based on the worst case
> >>>> *assumption*.
> >>>>
> >>>> I'll try to make the case convincing by making it fail directly.
> >>>
> >>> Great, thanks.
> >>>
> >>>>
> >>>>>
> >>>>>
> >>>>>> - reflink
> >>>>>>   Now part of the large preallocated extent is shared.
> >>>>>> - delalloc kicks in
> >>>>>
> >>>>> writeback kicks in
> >>>>>
> >>>>>>   For the NOCOW range, as the preallocated extent is shared, we need
> >>>>>>   to fall back to COW.
> >>>>>>
> >>>>>> [WORKAROUND]
> >>>>>> The workaround is to ensure any buffered write in the related extents
> >>>>>> (not the reflink source range) get flushed before reflink.
> >>>>>
> >>>>> not the reflink source range -> not just the reflink source range
> >>>>>
> >>>>>>
> >>>>>> However it's pretty expensive to do a comprehensive check.
> >>>>>> In the reproducer, the reflink source is just a part of a larger
> >>>>>
> >>>>> Again, the reproducer needs to be fixed (yes, I tested it even if it's
> >>>>> clear by looking at it that it doesn't trigger the nocow case).
> >>>>>
> >>>>>> preallocated extent, we need to flush all buffered write of that extent
> >>>>>> before reflink.
> >>>>>> Such backward search can be complex and we may not get much benefit 
> >>>>>> from
> >>>>>> it.
> >>>>>>
> >>>>>> So this patch will just try to flush the whole inode before reflink.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Signed-off-by: Qu Wenruo <w...@suse.com>
> >>>>>> ---
> >>>>>> Reason for RFC:
> >>>>>> Flushing an inode just because it's a reflink source is definitely
> >>>>>> overkilling, but I don't have any better way to handle it.
> >>>>>>
> >>>>>> Any comment on this is welcomed.
> >>>>>> ---
> >>>>>>  fs/btrfs/ioctl.c | 22 ++++++++++++++++++++++
> >>>>>>  1 file changed, 22 insertions(+)
> >>>>>>
> >>>>>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> >>>>>> index 7755b503b348..8caa0edb6fbf 100644
> >>>>>> --- a/fs/btrfs/ioctl.c
> >>>>>> +++ b/fs/btrfs/ioctl.c
> >>>>>> @@ -3930,6 +3930,28 @@ static noinline int btrfs_clone_files(struct 
> >>>>>> file *file, struct file *file_src,
> >>>>>>                         return ret;
> >>>>>>         }
> >>>>>>
> >>>>>> +       /*
> >>>>>> +        * Workaround to make sure NOCOW buffered write reach disk as 
> >>>>>> NOCOW.
> >>>>>> +        *
> >>>>>> +        * Due to the limit of btrfs extent tree design, we can only 
> >>>>>> have
> >>>>>> +        * extent level share view. Any part of an extent is shared 
> >>>>>> then the
> >>>>>
> >>>>> Any -> If any
> >>>>>
> >>>>>> +        * whole extent is shared and any write into that extent needs 
> >>>>>> to fall
> >>>>>
> >>>>> is -> is considered
> >>>>>
> >>>>>> +        * back to COW.
> >>>>>
> >>>>> I would add, something like:
> >>>>>
> >>>>> That is, btrfs' back references do not have a block level granularity,
> >>>>> they work at the whole extent level.
> >>>>>
> >>>>>> +        *
> >>>>>> +        * NOCOW buffered write without data space reserved could to 
> >>>>>> lead to
> >>>>>> +        * either data space bytes_may_use underflow (kernel warning) 
> >>>>>> or ENOSPC
> >>>>>> +        * at delalloc time (transaction abort).
> >>>>>
> >>>>> I would omit the warning and transaction abort parts, that can change
> >>>>> any time. And we have that information in the changelog, so it's not
> >>>>> lost.
> >>>>>
> >>>>>> +        *
> >>>>>> +        * Here we take a shortcut by flush the whole inode. We could 
> >>>>>> do better
> >>>>>> +        * by finding all extents in that range and flush the space 
> >>>>>> referring
> >>>>>> +        * all those extents.
> >>>>>> +        * But that's too complex for such corner case.
> >>>>>> +        */
> >>>>>> +       filemap_flush(src->i_mapping);
> >>>>>> +       if (test_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
> >>>>>> +                    &BTRFS_I(src)->runtime_flags))
> >>>>>> +               filemap_flush(src->i_mapping);
> >>>>>
> >>>>> So a few comments here:
> >>>>>
> >>>>> - why just in the clone part? The dedupe side has the same problem, 
> >>>>> doesn't it?
> >>>>
> >>>> Right.
> >>>>
> >>>>>
> >>>>> - I would move such flushing to btrfs_remap_file_range_prep - this is
> >>>>> where we do the source and target range flush and wait.
> >>>>>
> >>>>> Can you turn the reproducer into an fstests case?
> >>>>
> >>>> Sure.
> >>>>
> >>>> Thanks for the info and all the comment,
> >>>> Qu
> >>>>
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>> +
> >>>>>>         /*
> >>>>>>          * Lock destination range to serialize with concurrent 
> >>>>>> readpages() and
> >>>>>>          * source range to serialize with relocation.
> >>>>>> --
> >>>>>> 2.21.0
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >
> >
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

Reply via email to