Filipe Manana wrote on 2015/07/27 09:26 +0100:
On Mon, Jul 27, 2015 at 7:53 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
Hi Filipe,

Hi Qu,


Sorry for the late reply after it is already merged,
but I'm a little concerned about the extra loop to find the first inc
delayed ref.

It may take some extra time when there are a lot of delayed refs.

What about allowing deleting the extent item at dec delayed ref time and
then add it back for later inc delayed refs?

So, the reason I did it this way is simplicity - it's pretty much what
the pre 4.2-rc1 [1] code did, but slightly more efficient because it
iterates a linked list rather than a red black tree using rb_next().
So it seems somewhat odd that you're worrying about this after a
functional fix when we pretty much always had this behavior in
place...

Given the complexity of what you propose, I would prefer if we have a
benchmark that indeed shows this is a performance critical area (I
don't think this list can get that huge, but I might be wrong).
I agree with you that this may not be a performance hotspot.

And in fact, before the qgroup patchset, we tested the list implement of
delayed ref for performance regression, and the difference is so little that can be easily ignored.
So I think even we skip the loop, the performance will only change little.

One of the main purpose to change the delayed ref to list, is to make it a little more straightforward, without the hard to understand/review rb tree compare.
So I hope there is another more simple and straightforward one.

But as you mentioned, your fix is already simple enough, and after a short code search, my idea to fix may not be as simple as I thought.

So sorry for the disturb, and thanks for your fix again.

Thanks,
Qu


I also like the current approach of triggering a warning/bug_on/assert
if the extent item isn't found in the extent tree - it helps detecting
more quickly if a logic bug elsewhere exists.

But anyway, I'm ok if you are willing to implement such approach if
there's indeed a test/benchmark to justify more complex/optimized
code.

thanks

[1] - 
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/extent-tree.c?id=refs/tags/v4.1.3#n2327


Thanks,
Qu

  wrote on 2015/07/09 15:50 +0100:

From: Filipe Manana <fdman...@suse.com>


When we have an extent that got N references removed and N new references
added in the same transaction, we must run the insertion of the references
first because otherwise the last removed reference will remove the extent
item from the extent tree, resulting in a failure for the insertions.

This is a regression introduced in the 4.2-rc1 release and this fix just
brings back the behaviour of selecting reference additions before any
reference removals.

The following test case for fstests reproduces the issue:

    seq=`basename $0`
    seqres=$RESULT_DIR/$seq
    echo "QA output created by $seq"
    tmp=/tmp/$$
    status=1     # failure is the default!
    trap "_cleanup; exit \$status" 0 1 2 3 15

    _cleanup()
    {
        _cleanup_flakey
        rm -f $tmp.*
    }

    # get standard environment, filters and checks
    . ./common/rc
    . ./common/filter
    . ./common/dmflakey

    # real QA test starts here
    _need_to_be_root
    _supported_fs btrfs
    _supported_os Linux
    _require_scratch
    _require_dm_flakey
    _require_cloner
    _require_metadata_journaling $SCRATCH_DEV

    rm -f $seqres.full

    _scratch_mkfs >>$seqres.full 2>&1
    _init_flakey
    _mount_flakey

    # Create prealloc extent covering range [160K, 620K[
    $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo

    # Now write to the last 80K of the prealloc extent plus 40K to the
unallocated
    # space that immediately follows it. This creates a new extent of 40K
that spans
    # the range [620K, 660K[.
    $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo |
_filter_xfs_io

    # At this point, there are now 2 back references to the prealloc extent
in our
    # extent tree. Both are for our file offset 160K and one relates to a
file
    # extent item with a data offset of 0 and a length of 380K, while the
other
    # relates to a file extent item with a data offset of 380K and a length
of 80K.

    # Make sure everything done so far is durably persisted (all back
references are
    # in the extent tree, etc).
    sync

    # Now clone all extents of our file that cover the offset 160K up to
its eof
    # (660K at this point) into itself at offset 2M. This leaves a hole in
the file
    # covering the range [660K, 2M[. The prealloc extent will now be
referenced by
    # the file twice, once for offset 160K and once for offset 2M. The 40K
extent
    # that follows the prealloc extent will also be referenced twice by our
file,
    # once for offset 620K and once for offset 2M + 460K.
    $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0
$SCRATCH_MNT/foo \
         $SCRATCH_MNT/foo

    # Now create one new extent in our file with a size of 100Kb. It will
span the
    # range [3M, 3M + 100K[. It also will cause creation of a hole spanning
the
    # range [2M + 460K, 3M[. Our new file size is 3M + 100K.
    $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo |
_filter_xfs_io

    # At this point, there are now (in memory) 4 back references to the
prealloc
    # extent.
    #
    # Two of them are for file offset 160K, related to file extent items
    # matching the file offsets 160K and 540K respectively, with data
offsets of
    # 0 and 380K respectively, and with lengths of 380K and 80K
respectively.
    #
    # The other two references are for file offset 2M, related to file
extent items
    # matching the file offsets 2M and 2M + 380K respectively, with data
offsets of
    # 0 and 380K respectively, and with lengths of 389K and 80K
respectively.
    #
    # The 40K extent has 2 back references, one for file offset 620K and
the other
    # for file offset 2M + 460K.
    #
    # The 100K extent has a single back reference and it relates to file
offset 3M.

    # Now clone our 100K extent into offset 600K. That offset covers the
last 20K
    # of the prealloc extent, the whole 40K extent and 40K of the hole
starting at
    # offset 660K.
    $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 *
1024)) \
        $SCRATCH_MNT/foo $SCRATCH_MNT/foo

    # At this point there's only one reference to the 40K extent, at file
offset
    # 2M + 460K, we have 4 references for the prealloc extent (2 for file
offset
    # 160K and 2 for file offset 2M) and 2 references for the 100K extent
(1 for
    # file offset 3M and a new one for file offset 600K).

    # Now fsync our file to make all its new data and metadata updates are
durably
    # persisted and present if a power failure/crash happens after a
successful
    # fsync and before the next transaction commit.
    $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo

    echo "File digest before power failure:"
    md5sum $SCRATCH_MNT/foo | _filter_scratch

    # Silently drop all writes and ummount to simulate a crash/power
failure.
    _load_flakey_table $FLAKEY_DROP_WRITES
    _unmount_flakey

    # Allow writes again, mount to trigger log replay and validate file
contents.
    # During log replay, the btrfs delayed references implementation used
to run the
    # deletion of back references before the addition of new back
references, which
    # made the addition fail as it didn't find the key in the extent tree
that it
    # was looking for. The failure triggered by this test was related to
the 40K
    # extent, which got 1 reference dropped and 1 reference added during
the fsync
    # log replay - when running the delayed references at transaction
commit time,
    # btrfs was applying the deletion before the insertion, resulting in a
failure
    # of the insertion that ended up turning the fs into read-only mode.
    _load_flakey_table $FLAKEY_ALLOW_WRITES
    _mount_flakey

    echo "File digest after log replay:"
    md5sum $SCRATCH_MNT/foo | _filter_scratch

    _unmount_flakey

    status=0
    exit

This issue turned the filesystem into read-only mode (current transaction
aborted) and produced the following traces:

    [ 8247.578385] ------------[ cut here ]------------
    [ 8247.579947] WARNING: CPU: 0 PID: 11341 at
fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d
[btrfs]()
    (...)
    [ 8247.601697] Call Trace:
    [ 8247.602222]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
    [ 8247.604320]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
    [ 8247.605488]  [<ffffffffa0506c8d>] ?
lookup_inline_extent_backref+0x17d/0x45d [btrfs]
    [ 8247.608226]  [<ffffffffa0506c8d>]
lookup_inline_extent_backref+0x17d/0x45d [btrfs]
    [ 8247.617061]  [<ffffffffa0507957>]
insert_inline_extent_backref+0x41/0xb2 [btrfs]
    [ 8247.621856]  [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a
[btrfs]
    [ 8247.624366]  [<ffffffffa050ee60>]
__btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs]
    [ 8247.626176]  [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4
[btrfs]
    [ 8247.627435]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
    [ 8247.628531]  [<ffffffffa0520482>]
btrfs_commit_transaction+0x4c/0xa20 [btrfs]
    (...)
    [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]---

    [ 8247.727263] WARNING: CPU: 3 PID: 11341 at
fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]()
    [ 8247.728954] BTRFS: Transaction aborted (error -5)
    (...)
    [ 8247.760866] Call Trace:
    [ 8247.761534]  [<ffffffff8145f077>] dump_stack+0x4f/0x7b
    [ 8247.764271]  [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb
    [ 8247.767582]  [<ffffffffa0510e04>] ?
btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]
    [ 8247.769373]  [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48
    [ 8247.770836]  [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4
[btrfs]
    [ 8247.772532]  [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6
    [ 8247.773664]  [<ffffffffa0520482>]
btrfs_commit_transaction+0x4c/0xa20 [btrfs]
    [ 8247.775047]  [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf
    [ 8247.776176]  [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189
    [ 8247.777427]  [<ffffffffa055a920>]
btrfs_recover_log_trees+0x2da/0x33d [btrfs]
    [ 8247.778575]  [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc
[btrfs]
    [ 8247.779838]  [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs]
    [ 8247.781020]  [<ffffffff81120f48>] ? register_shrinker+0x56/0x81
    [ 8247.782285]  [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs]
    (...)
    [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]---
    [ 8247.794276] BTRFS: error (device dm-0) in
btrfs_run_delayed_refs:2771: errno=-5 IO failure
    [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375:
errno=-5 IO failure (Failed to recover log tree)

Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root
in ref_head.")
Signed-off-by: Filipe Manana <fdman...@suse.com>

---
   fs/btrfs/extent-tree.c | 13 +++++++++++++
   1 file changed, 13 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1c2bd17..bccceea5 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2296,9 +2296,22 @@ static int run_one_delayed_ref(struct
btrfs_trans_handle *trans,
   static inline struct btrfs_delayed_ref_node *
   select_delayed_ref(struct btrfs_delayed_ref_head *head)
   {
+       struct btrfs_delayed_ref_node *ref;
+
         if (list_empty(&head->ref_list))
                 return NULL;

+       /*
+        * Select a delayed ref of type BTRFS_ADD_DELAYED_REF first.
+        * This is to prevent a ref count from going down to zero, which
deletes
+        * the extent item from the extent tree, when here still are
references
+        * to add, which would fail because they would not find the extent
item.
+        */
+       list_for_each_entry(ref, &head->ref_list, list) {
+               if (ref->action == BTRFS_ADD_DELAYED_REF)
+                       return ref;
+       }
+
         return list_entry(head->ref_list.next, struct
btrfs_delayed_ref_node,
                           list);
   }


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to