Re: [PATCH] Btrfs: fix order by which delayed references are run

Filipe Manana Mon, 27 Jul 2015 01:26:42 -0700

On Mon, Jul 27, 2015 at 7:53 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
> Hi Filipe,


Hi Qu,

>
> Sorry for the late reply after it is already merged,
> but I'm a little concerned about the extra loop to find the first inc
> delayed ref.
>
> It may take some extra time when there are a lot of delayed refs.
>
> What about allowing deleting the extent item at dec delayed ref time and
> then add it back for later inc delayed refs?

So, the reason I did it this way is simplicity - it's pretty much what
the pre 4.2-rc1 [1] code did, but slightly more efficient because it
iterates a linked list rather than a red black tree using rb_next().
So it seems somewhat odd that you're worrying about this after a
functional fix when we pretty much always had this behavior in
place...

Given the complexity of what you propose, I would prefer if we have a
benchmark that indeed shows this is a performance critical area (I
don't think this list can get that huge, but I might be wrong).

I also like the current approach of triggering a warning/bug_on/assert
if the extent item isn't found in the extent tree - it helps detecting
more quickly if a logic bug elsewhere exists.

But anyway, I'm ok if you are willing to implement such approach if
there's indeed a test/benchmark to justify more complex/optimized
code.

thanks

[1] - 
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/extent-tree.c?id=refs/tags/v4.1.3#n2327

>
> Thanks,
> Qu
>
>  wrote on 2015/07/09 15:50 +0100:
>>
>> From: Filipe Manana <fdman...@suse.com>
>>
>>
>> When we have an extent that got N references removed and N new references
>> added in the same transaction, we must run the insertion of the references
>> first because otherwise the last removed reference will remove the extent
>> item from the extent tree, resulting in a failure for the insertions.
>>
>> This is a regression introduced in the 4.2-rc1 release and this fix just
>> brings back the behaviour of selecting reference additions before any
>> reference removals.
>>
>> The following test case for fstests reproduces the issue:
>>
>>    seq=`basename $0`
>>    seqres=$RESULT_DIR/$seq
>>    echo "QA output created by $seq"
>>    tmp=/tmp/$$
>>    status=1     # failure is the default!
>>    trap "_cleanup; exit \$status" 0 1 2 3 15
>>
>>    _cleanup()
>>    {
>>        _cleanup_flakey
>>        rm -f $tmp.*
>>    }
>>
>>    # get standard environment, filters and checks
>>    . ./common/rc
>>    . ./common/filter
>>    . ./common/dmflakey
>>
>>    # real QA test starts here
>>    _need_to_be_root
>>    _supported_fs btrfs
>>    _supported_os Linux
>>    _require_scratch
>>    _require_dm_flakey
>>    _require_cloner
>>    _require_metadata_journaling $SCRATCH_DEV
>>
>>    rm -f $seqres.full
>>
>>    _scratch_mkfs >>$seqres.full 2>&1
>>    _init_flakey
>>    _mount_flakey
>>
>>    # Create prealloc extent covering range [160K, 620K[
>>    $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo
>>
>>    # Now write to the last 80K of the prealloc extent plus 40K to the
>> unallocated
>>    # space that immediately follows it. This creates a new extent of 40K
>> that spans
>>    # the range [620K, 660K[.
>>    $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo |
>> _filter_xfs_io
>>
>>    # At this point, there are now 2 back references to the prealloc extent
>> in our
>>    # extent tree. Both are for our file offset 160K and one relates to a
>> file
>>    # extent item with a data offset of 0 and a length of 380K, while the
>> other
>>    # relates to a file extent item with a data offset of 380K and a length
>> of 80K.
>>
>>    # Make sure everything done so far is durably persisted (all back
>> references are
>>    # in the extent tree, etc).
>>    sync
>>
>>    # Now clone all extents of our file that cover the offset 160K up to
>> its eof
>>    # (660K at this point) into itself at offset 2M. This leaves a hole in
>> the file
>>    # covering the range [660K, 2M[. The prealloc extent will now be
>> referenced by
>>    # the file twice, once for offset 160K and once for offset 2M. The 40K
>> extent
>>    # that follows the prealloc extent will also be referenced twice by our
>> file,
>>    # once for offset 620K and once for offset 2M + 460K.
>>    $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0
>> $SCRATCH_MNT/foo \
>>         $SCRATCH_MNT/foo
>>
>>    # Now create one new extent in our file with a size of 100Kb. It will
>> span the
>>    # range [3M, 3M + 100K[. It also will cause creation of a hole spanning
>> the
>>    # range [2M + 460K, 3M[. Our new file size is 3M + 100K.
>>    $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo |
>> _filter_xfs_io
>>
>>    # At this point, there are now (in memory) 4 back references to the
>> prealloc
>>    # extent.
>>    #
>>    # Two of them are for file offset 160K, related to file extent items
>>    # matching the file offsets 160K and 540K respectively, with data
>> offsets of
>>    # 0 and 380K respectively, and with lengths of 380K and 80K
>> respectively.
>>    #
>>    # The other two references are for file offset 2M, related to file
>> extent items
>>    # matching the file offsets 2M and 2M + 380K respectively, with data
>> offsets of
>>    # 0 and 380K respectively, and with lengths of 389K and 80K
>> respectively.
>>    #
>>    # The 40K extent has 2 back references, one for file offset 620K and
>> the other
>>    # for file offset 2M + 460K.
>>    #
>>    # The 100K extent has a single back reference and it relates to file
>> offset 3M.
>>
>>    # Now clone our 100K extent into offset 600K. That offset covers the
>> last 20K
>>    # of the prealloc extent, the whole 40K extent and 40K of the hole
>> starting at
>>    # offset 660K.
>>    $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 *
>> 1024)) \
>>        $SCRATCH_MNT/foo $SCRATCH_MNT/foo
>>
>>    # At this point there's only one reference to the 40K extent, at file
>> offset
>>    # 2M + 460K, we have 4 references for the prealloc extent (2 for file
>> offset
>>    # 160K and 2 for file offset 2M) and 2 references for the 100K extent
>> (1 for
>>    # file offset 3M and a new one for file offset 600K).
>>
>>    # Now fsync our file to make all its new data and metadata updates are
>> durably
>>    # persisted and present if a power failure/crash happens after a
>> successful
>>    # fsync and before the next transaction commit.
>>    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo
>>
>>    echo "File digest before power failure:"
>>    md5sum $SCRATCH_MNT/foo | _filter_scratch
>>
>>    # Silently drop all writes and ummount to simulate a crash/power
>> failure.
>>    _load_flakey_table $FLAKEY_DROP_WRITES
>>    _unmount_flakey
>>
>>    # Allow writes again, mount to trigger log replay and validate file
>> contents.
>>    # During log replay, the btrfs delayed references implementation used
>> to run the
>>    # deletion of back references before the addition of new back
>> references, which
>>    # made the addition fail as it didn't find the key in the extent tree
>> that it
>>    # was looking for. The failure triggered by this test was related to
>> the 40K
>>    # extent, which got 1 reference dropped and 1 reference added during
>> the fsync
>>    # log replay - when running the delayed references at transaction
>> commit time,
>>    # btrfs was applying the deletion before the insertion, resulting in a
>> failure
>>    # of the insertion that ended up turning the fs into read-only mode.
>>    _load_flakey_table $FLAKEY_ALLOW_WRITES
>>    _mount_flakey
>>
>>    echo "File digest after log replay:"
>>    md5sum $SCRATCH_MNT/foo | _filter_scratch
>>
>>    _unmount_flakey
>>
>>    status=0
>>    exit
>>
>> This issue turned the filesystem into read-only mode (current transaction
>> aborted) and produced the following traces:
>>
>>    [ 8247.578385] ------------[ cut here ]------------
>>    [ 8247.579947] WARNING: CPU: 0 PID: 11341 at
>> fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d
>> [btrfs]()
>>    (...)
>>    [ 8247.601697] Call Trace:
>>    [ 8247.602222]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
>>    [ 8247.604320]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
>>    [ 8247.605488]  [<ffffffffa0506c8d>] ?
>> lookup_inline_extent_backref+0x17d/0x45d [btrfs]
>>    [ 8247.608226]  [<ffffffffa0506c8d>]
>> lookup_inline_extent_backref+0x17d/0x45d [btrfs]
>>    [ 8247.617061]  [<ffffffffa0507957>]
>> insert_inline_extent_backref+0x41/0xb2 [btrfs]
>>    [ 8247.621856]  [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a
>> [btrfs]
>>    [ 8247.624366]  [<ffffffffa050ee60>]
>> __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs]
>>    [ 8247.626176]  [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4
>> [btrfs]
>>    [ 8247.627435]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
>>    [ 8247.628531]  [<ffffffffa0520482>]
>> btrfs_commit_transaction+0x4c/0xa20 [btrfs]
>>    (...)
>>    [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]---
>>
>>    [ 8247.727263] WARNING: CPU: 3 PID: 11341 at
>> fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]()
>>    [ 8247.728954] BTRFS: Transaction aborted (error -5)
>>    (...)
>>    [ 8247.760866] Call Trace:
>>    [ 8247.761534]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
>>    [ 8247.764271]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
>>    [ 8247.767582]  [<ffffffffa0510e04>] ?
>> btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
>>    [ 8247.769373]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
>>    [ 8247.770836]  [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4
>> [btrfs]
>>    [ 8247.772532]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
>>    [ 8247.773664]  [<ffffffffa0520482>]
>> btrfs_commit_transaction+0x4c/0xa20 [btrfs]
>>    [ 8247.775047]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
>>    [ 8247.776176]  [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189
>>    [ 8247.777427]  [<ffffffffa055a920>]
>> btrfs_recover_log_trees+0x2da/0x33d [btrfs]
>>    [ 8247.778575]  [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc
>> [btrfs]
>>    [ 8247.779838]  [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs]
>>    [ 8247.781020]  [<ffffffff81120f48>] ? register_shrinker+0x56/0x81
>>    [ 8247.782285]  [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs]
>>    (...)
>>    [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]---
>>    [ 8247.794276] BTRFS: error (device dm-0) in
>> btrfs_run_delayed_refs:2771: errno=-5 IO failure
>>    [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375:
>> errno=-5 IO failure (Failed to recover log tree)
>>
>> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root
>> in ref_head.")
>> Signed-off-by: Filipe Manana <fdman...@suse.com>
>>
>> ---
>>   fs/btrfs/extent-tree.c | 13 +++++++++++++
>>   1 file changed, 13 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 1c2bd17..bccceea5 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -2296,9 +2296,22 @@ static int run_one_delayed_ref(struct
>> btrfs_trans_handle *trans,
>>   static inline struct btrfs_delayed_ref_node *
>>   select_delayed_ref(struct btrfs_delayed_ref_head *head)
>>   {
>> +       struct btrfs_delayed_ref_node *ref;
>> +
>>         if (list_empty(&head->ref_list))
>>                 return NULL;
>>
>> +       /*
>> +        * Select a delayed ref of type BTRFS_ADD_DELAYED_REF first.
>> +        * This is to prevent a ref count from going down to zero, which
>> deletes
>> +        * the extent item from the extent tree, when here still are
>> references
>> +        * to add, which would fail because they would not find the extent
>> item.
>> +        */
>> +       list_for_each_entry(ref, &head->ref_list, list) {
>> +               if (ref->action == BTRFS_ADD_DELAYED_REF)
>> +                       return ref;
>> +       }
>> +
>>         return list_entry(head->ref_list.next, struct
>> btrfs_delayed_ref_node,
>>                           list);
>>   }
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: fix order by which delayed references are run

Reply via email to