On Mon, Jul 27, 2015 at 7:53 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote: > Hi Filipe,
Hi Qu, > > Sorry for the late reply after it is already merged, > but I'm a little concerned about the extra loop to find the first inc > delayed ref. > > It may take some extra time when there are a lot of delayed refs. > > What about allowing deleting the extent item at dec delayed ref time and > then add it back for later inc delayed refs? So, the reason I did it this way is simplicity - it's pretty much what the pre 4.2-rc1 [1] code did, but slightly more efficient because it iterates a linked list rather than a red black tree using rb_next(). So it seems somewhat odd that you're worrying about this after a functional fix when we pretty much always had this behavior in place... Given the complexity of what you propose, I would prefer if we have a benchmark that indeed shows this is a performance critical area (I don't think this list can get that huge, but I might be wrong). I also like the current approach of triggering a warning/bug_on/assert if the extent item isn't found in the extent tree - it helps detecting more quickly if a logic bug elsewhere exists. But anyway, I'm ok if you are willing to implement such approach if there's indeed a test/benchmark to justify more complex/optimized code. thanks [1] - https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/extent-tree.c?id=refs/tags/v4.1.3#n2327 > > Thanks, > Qu > > wrote on 2015/07/09 15:50 +0100: >> >> From: Filipe Manana <fdman...@suse.com> >> >> >> When we have an extent that got N references removed and N new references >> added in the same transaction, we must run the insertion of the references >> first because otherwise the last removed reference will remove the extent >> item from the extent tree, resulting in a failure for the insertions. >> >> This is a regression introduced in the 4.2-rc1 release and this fix just >> brings back the behaviour of selecting reference additions before any >> reference removals. >> >> The following test case for fstests reproduces the issue: >> >> seq=`basename $0` >> seqres=$RESULT_DIR/$seq >> echo "QA output created by $seq" >> tmp=/tmp/$$ >> status=1 # failure is the default! >> trap "_cleanup; exit \$status" 0 1 2 3 15 >> >> _cleanup() >> { >> _cleanup_flakey >> rm -f $tmp.* >> } >> >> # get standard environment, filters and checks >> . ./common/rc >> . ./common/filter >> . ./common/dmflakey >> >> # real QA test starts here >> _need_to_be_root >> _supported_fs btrfs >> _supported_os Linux >> _require_scratch >> _require_dm_flakey >> _require_cloner >> _require_metadata_journaling $SCRATCH_DEV >> >> rm -f $seqres.full >> >> _scratch_mkfs >>$seqres.full 2>&1 >> _init_flakey >> _mount_flakey >> >> # Create prealloc extent covering range [160K, 620K[ >> $XFS_IO_PROG -f -c "falloc 160K 460K" $SCRATCH_MNT/foo >> >> # Now write to the last 80K of the prealloc extent plus 40K to the >> unallocated >> # space that immediately follows it. This creates a new extent of 40K >> that spans >> # the range [620K, 660K[. >> $XFS_IO_PROG -c "pwrite -S 0xaa 540K 120K" $SCRATCH_MNT/foo | >> _filter_xfs_io >> >> # At this point, there are now 2 back references to the prealloc extent >> in our >> # extent tree. Both are for our file offset 160K and one relates to a >> file >> # extent item with a data offset of 0 and a length of 380K, while the >> other >> # relates to a file extent item with a data offset of 380K and a length >> of 80K. >> >> # Make sure everything done so far is durably persisted (all back >> references are >> # in the extent tree, etc). >> sync >> >> # Now clone all extents of our file that cover the offset 160K up to >> its eof >> # (660K at this point) into itself at offset 2M. This leaves a hole in >> the file >> # covering the range [660K, 2M[. The prealloc extent will now be >> referenced by >> # the file twice, once for offset 160K and once for offset 2M. The 40K >> extent >> # that follows the prealloc extent will also be referenced twice by our >> file, >> # once for offset 620K and once for offset 2M + 460K. >> $CLONER_PROG -s $((160 * 1024)) -d $((2 * 1024 * 1024)) -l 0 >> $SCRATCH_MNT/foo \ >> $SCRATCH_MNT/foo >> >> # Now create one new extent in our file with a size of 100Kb. It will >> span the >> # range [3M, 3M + 100K[. It also will cause creation of a hole spanning >> the >> # range [2M + 460K, 3M[. Our new file size is 3M + 100K. >> $XFS_IO_PROG -c "pwrite -S 0xbb 3M 100K" $SCRATCH_MNT/foo | >> _filter_xfs_io >> >> # At this point, there are now (in memory) 4 back references to the >> prealloc >> # extent. >> # >> # Two of them are for file offset 160K, related to file extent items >> # matching the file offsets 160K and 540K respectively, with data >> offsets of >> # 0 and 380K respectively, and with lengths of 380K and 80K >> respectively. >> # >> # The other two references are for file offset 2M, related to file >> extent items >> # matching the file offsets 2M and 2M + 380K respectively, with data >> offsets of >> # 0 and 380K respectively, and with lengths of 389K and 80K >> respectively. >> # >> # The 40K extent has 2 back references, one for file offset 620K and >> the other >> # for file offset 2M + 460K. >> # >> # The 100K extent has a single back reference and it relates to file >> offset 3M. >> >> # Now clone our 100K extent into offset 600K. That offset covers the >> last 20K >> # of the prealloc extent, the whole 40K extent and 40K of the hole >> starting at >> # offset 660K. >> $CLONER_PROG -s $((3 * 1024 * 1024)) -d $((600 * 1024)) -l $((100 * >> 1024)) \ >> $SCRATCH_MNT/foo $SCRATCH_MNT/foo >> >> # At this point there's only one reference to the 40K extent, at file >> offset >> # 2M + 460K, we have 4 references for the prealloc extent (2 for file >> offset >> # 160K and 2 for file offset 2M) and 2 references for the 100K extent >> (1 for >> # file offset 3M and a new one for file offset 600K). >> >> # Now fsync our file to make all its new data and metadata updates are >> durably >> # persisted and present if a power failure/crash happens after a >> successful >> # fsync and before the next transaction commit. >> $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foo >> >> echo "File digest before power failure:" >> md5sum $SCRATCH_MNT/foo | _filter_scratch >> >> # Silently drop all writes and ummount to simulate a crash/power >> failure. >> _load_flakey_table $FLAKEY_DROP_WRITES >> _unmount_flakey >> >> # Allow writes again, mount to trigger log replay and validate file >> contents. >> # During log replay, the btrfs delayed references implementation used >> to run the >> # deletion of back references before the addition of new back >> references, which >> # made the addition fail as it didn't find the key in the extent tree >> that it >> # was looking for. The failure triggered by this test was related to >> the 40K >> # extent, which got 1 reference dropped and 1 reference added during >> the fsync >> # log replay - when running the delayed references at transaction >> commit time, >> # btrfs was applying the deletion before the insertion, resulting in a >> failure >> # of the insertion that ended up turning the fs into read-only mode. >> _load_flakey_table $FLAKEY_ALLOW_WRITES >> _mount_flakey >> >> echo "File digest after log replay:" >> md5sum $SCRATCH_MNT/foo | _filter_scratch >> >> _unmount_flakey >> >> status=0 >> exit >> >> This issue turned the filesystem into read-only mode (current transaction >> aborted) and produced the following traces: >> >> [ 8247.578385] ------------[ cut here ]------------ >> [ 8247.579947] WARNING: CPU: 0 PID: 11341 at >> fs/btrfs/extent-tree.c:1547 lookup_inline_extent_backref+0x17d/0x45d >> [btrfs]() >> (...) >> [ 8247.601697] Call Trace: >> [ 8247.602222] [<ffffffff8145f077>] dump_stack+0x4f/0x7b >> [ 8247.604320] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb >> [ 8247.605488] [<ffffffffa0506c8d>] ? >> lookup_inline_extent_backref+0x17d/0x45d [btrfs] >> [ 8247.608226] [<ffffffffa0506c8d>] >> lookup_inline_extent_backref+0x17d/0x45d [btrfs] >> [ 8247.617061] [<ffffffffa0507957>] >> insert_inline_extent_backref+0x41/0xb2 [btrfs] >> [ 8247.621856] [<ffffffffa0507c4f>] __btrfs_inc_extent_ref+0x8c/0x20a >> [btrfs] >> [ 8247.624366] [<ffffffffa050ee60>] >> __btrfs_run_delayed_refs+0xb0c/0xd49 [btrfs] >> [ 8247.626176] [<ffffffffa0510dcd>] btrfs_run_delayed_refs+0x6d/0x1d4 >> [btrfs] >> [ 8247.627435] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 >> [ 8247.628531] [<ffffffffa0520482>] >> btrfs_commit_transaction+0x4c/0xa20 [btrfs] >> (...) >> [ 8247.648430] ---[ end trace 2461e55f92c2ac2d ]--- >> >> [ 8247.727263] WARNING: CPU: 3 PID: 11341 at >> fs/btrfs/extent-tree.c:2771 btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs]() >> [ 8247.728954] BTRFS: Transaction aborted (error -5) >> (...) >> [ 8247.760866] Call Trace: >> [ 8247.761534] [<ffffffff8145f077>] dump_stack+0x4f/0x7b >> [ 8247.764271] [<ffffffff8104b3b0>] warn_slowpath_common+0xa1/0xbb >> [ 8247.767582] [<ffffffffa0510e04>] ? >> btrfs_run_delayed_refs+0xa4/0x1d4 [btrfs] >> [ 8247.769373] [<ffffffff8104b410>] warn_slowpath_fmt+0x46/0x48 >> [ 8247.770836] [<ffffffffa0510e04>] btrfs_run_delayed_refs+0xa4/0x1d4 >> [btrfs] >> [ 8247.772532] [<ffffffff81155c9b>] ? __cache_free+0x4a7/0x4b6 >> [ 8247.773664] [<ffffffffa0520482>] >> btrfs_commit_transaction+0x4c/0xa20 [btrfs] >> [ 8247.775047] [<ffffffff81087310>] ? trace_hardirqs_on+0xd/0xf >> [ 8247.776176] [<ffffffff81155dd5>] ? kmem_cache_free+0x12b/0x189 >> [ 8247.777427] [<ffffffffa055a920>] >> btrfs_recover_log_trees+0x2da/0x33d [btrfs] >> [ 8247.778575] [<ffffffffa055898e>] ? replay_one_extent+0x4fc/0x4fc >> [btrfs] >> [ 8247.779838] [<ffffffffa051e265>] open_ctree+0x1cc0/0x201a [btrfs] >> [ 8247.781020] [<ffffffff81120f48>] ? register_shrinker+0x56/0x81 >> [ 8247.782285] [<ffffffffa04fb12c>] btrfs_mount+0x5f0/0x734 [btrfs] >> (...) >> [ 8247.793394] ---[ end trace 2461e55f92c2ac2e ]--- >> [ 8247.794276] BTRFS: error (device dm-0) in >> btrfs_run_delayed_refs:2771: errno=-5 IO failure >> [ 8247.797335] BTRFS: error (device dm-0) in btrfs_replay_log:2375: >> errno=-5 IO failure (Failed to recover log tree) >> >> Fixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root >> in ref_head.") >> Signed-off-by: Filipe Manana <fdman...@suse.com> >> >> --- >> fs/btrfs/extent-tree.c | 13 +++++++++++++ >> 1 file changed, 13 insertions(+) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 1c2bd17..bccceea5 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -2296,9 +2296,22 @@ static int run_one_delayed_ref(struct >> btrfs_trans_handle *trans, >> static inline struct btrfs_delayed_ref_node * >> select_delayed_ref(struct btrfs_delayed_ref_head *head) >> { >> + struct btrfs_delayed_ref_node *ref; >> + >> if (list_empty(&head->ref_list)) >> return NULL; >> >> + /* >> + * Select a delayed ref of type BTRFS_ADD_DELAYED_REF first. >> + * This is to prevent a ref count from going down to zero, which >> deletes >> + * the extent item from the extent tree, when here still are >> references >> + * to add, which would fail because they would not find the extent >> item. >> + */ >> + list_for_each_entry(ref, &head->ref_list, list) { >> + if (ref->action == BTRFS_ADD_DELAYED_REF) >> + return ref; >> + } >> + >> return list_entry(head->ref_list.next, struct >> btrfs_delayed_ref_node, >> list); >> } >> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html