Le 2015-10-22 19:03, Filipe Manana a écrit :
On Thu, Oct 22, 2015 at 3:58 PM, Stéphane Lesimple <[email protected]> wrote:Le 2015-10-22 11:47, Filipe Manana a écrit :On Thu, Oct 22, 2015 at 10:43 AM, Filipe Manana <[email protected]> wrote:On Thu, Oct 22, 2015 at 10:32 AM, Qu Wenruo <[email protected]>wrote:wrote on 2015/10/22 09:47 +0100:From: Filipe Manana <[email protected]> In the kernel 4.2 merge window we had a refactoring/rework of the delayed references implementation in order to fix certain problems with qgroups.However that rework introduced one more regression that leads to thefollowing trace when running delayed references for metadata: [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832![35908.065201] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_genericxor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc psmouse i2 [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: G W 4.3.0-rc5-btrfs-next-17+ #1[35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOSrel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper[btrfs][35908.065201] task: ffff880114b7d780 ti: ffff88010c4c8000 task.ti:ffff88010c4c8000[35908.065201] RIP: 0010:[<ffffffffa04928b5>] [<ffffffffa04928b5>]insert_inline_extent_backref+0x52/0xb1 [btrfs] [35908.065201] RSP: 0018:ffff88010c4cbb08 EFLAGS: 00010293 [35908.065201] RAX: 0000000000000000 RBX: ffff88008a661000 RCX: 0000000000000000 [35908.065201] RDX: ffffffffa04dd58f RSI: 0000000000000001 RDI: 0000000000000000 [35908.065201] RBP: ffff88010c4cbb40 R08: 0000000000001000 R09: ffff88010c4cb9f8 [35908.065201] R10: 0000000000000000 R11: 000000000000002c R12: 0000000000000000 [35908.065201] R13: ffff88020a74c578 R14: 0000000000000000 R15: 0000000000000000[35908.065201] FS: 0000000000000000(0000) GS:ffff88023edc0000(0000)knlGS:0000000000000000 [35908.065201] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [35908.065201] CR2: 00000000015e8708 CR3: 0000000102185000 CR4: 00000000000006e0 [35908.065201] Stack: [35908.065201] ffff88010c4cbb18 0000000000000f37 ffff88020a74c578 ffff88015a408000 [35908.065201] ffff880154a44000 0000000000000000 0000000000000005 ffff88010c4cbbd8 [35908.065201] ffffffffa0492b9a 0000000000000005 0000000000000000 0000000000000000 [35908.065201] Call Trace:[35908.065201] [<ffffffffa0492b9a>] __btrfs_inc_extent_ref+0x8b/0x208[btrfs] [35908.065201] [<ffffffffa0497117>] ? __btrfs_run_delayed_refs+0x4d4/0xd33 [btrfs] [35908.065201] [<ffffffffa049773d>] __btrfs_run_delayed_refs+0xafa/0xd33 [btrfs] [35908.065201] [<ffffffffa04a976a>] ? join_transaction.isra.10+0x25/0x41f [btrfs] [35908.065201] [<ffffffffa04a97ed>] ? join_transaction.isra.10+0xa8/0x41f [btrfs][35908.065201] [<ffffffffa049914d>] btrfs_run_delayed_refs+0x75/0x1dd[btrfs][35908.065201] [<ffffffffa04992f1>] delayed_ref_async_start+0x3c/0x7b[btrfs][35908.065201] [<ffffffffa04d4b4f>] normal_work_helper+0x14c/0x32a[btrfs][35908.065201] [<ffffffffa04d4e93>] btrfs_extent_refs_helper+0x12/0x14[btrfs] [35908.065201] [<ffffffff81063b23>] process_one_work+0x24a/0x4ac [35908.065201] [<ffffffff81064285>] worker_thread+0x206/0x2c2 [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [35908.065201] [<ffffffff8106407f>] ? rescuer_thread+0x2cb/0x2cb [35908.065201] [<ffffffff8106904d>] kthread+0xef/0xf7 [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24 [35908.065201] [<ffffffff8147d10f>] ret_from_fork+0x3f/0x70 [35908.065201] [<ffffffff81068f5e>] ? kthread_parkme+0x24/0x24[35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8488d 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 7702 <0f> 0b 4c 8b 45 30 8b 4d 28 45 31 [35908.065201] RIP [<ffffffffa04928b5>] insert_inline_extent_backref+0x52/0xb1 [btrfs] [35908.065201] RSP <ffff88010c4cbb08> [35908.310885] ---[ end trace fe4299baf0666457 ]---This happens because the new delayed references code no longer merges delayed references that have different sequence values. The followingsteps are an example sequence leading to this issue: 1) Transaction N starts, fs_info->tree_mod_seq has value 0;2) Extent buffer (btree node) A is allocated, delayed reference Ref1for bytenr A is created, with a value of 1 and a seq value of 0; 3) fs_info->tree_mod_seq is incremented to 1;4) Extent buffer A is deleted through btrfs_del_items(), which calls btrfs_del_leaf(), which in turn calls btrfs_free_tree_block(). The later returns the metadata extent associated to extent buffer A to the free space cache (the range is not pinned), because the extent buffer was created in the current transaction (N) and writebackneverhappened for the extent buffer (flag BTRFS_HEADER_FLAG_WRITTEN notset in the extent buffer).This creates the delayed reference Ref2 for bytenr A, with a valueof -1 and a seq value of 1;5) Delayed reference Ref2 is not merged with Ref1 when we create it,because they have different sequence numbers (decided at add_delayed_ref_tail_merge()); 6) fs_info->tree_mod_seq is incremented to 2; 7) Some task attempts to allocate a new extent buffer (done atextent-tree.c:find_free_extent()), but due to heavy fragmentation and running low on metadata space the clustered allocation failsand we fall back to unclustered allocation, which finds the extent at offset A, so a new extent buffer at offset A is allocated.This creates delayed reference Ref3 for bytenr A, with a value of-1 and a seq value of 2;8) Ref3 is not merged neither with Ref2 nor Ref1, again because theyall have different seq values; 9) We start running the delayed references (__btrfs_run_delayed_refs()); 10) The delayed Ref1 is the first one being applied, which ends up creating an inline extent backref in the extent tree;10) Next the delayed reference Ref3 is selected for execution, and not Ref2, because select_delayed_ref() always gives a preference forpositive references (that have an action of BTRFS_ADD_DELAYED_REF);11) When running Ref3 we encounter alreay the inline extent backref in the extent tree at insert_inline_extent_backref(), which makesus hit the following BUG_ON: BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID);This is always true because owner corresponds to the level of theextent buffer/btree node in the btree.For the scenario described above we hit the BUG_ON because we nevermerge references that have different seq values.We used to do the merging before the 4.2 kernel, more specifically,before the commmits:c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_rootin ref_head.")c43d160fcd5e ("btrfs: delayed-ref: Cleanup the unneeded functions.")This issue became more exposed after the following change that was added to 4.2 as well: cffc3374e567 ("Btrfs: fix order by which delayed references are run")Which in turn fixed another regression by the two commits previouslymentioned.So fix this by bringing back the delayed reference merge code, with the proper adaptations so that it operates against the new data structure(linked list vs old red black tree implementation).This issue was hit running fstest btrfs/063 in a loop. Several peoplehavereported this issue in the mailing list when running on kernels 4.2+.Thanks Filipe, My fault again. :(But I'm not completely sure about if tree_mod_seq is still needed now.IIRC, with the new qgroup accounting happen at commit_transaction time,btrfs_find_all_roots() should either searching commit tree for old roots, or search current tree for new roots. No need to search using tree_mod_seq.If so, I'd like just allow merging refs without checking tree_mod_seq.And to make it clear, that wouldn't work. If a backref walker starts iterating the btrees and then new delayed refs get merged independently of the current tree mod seq, the walker will see an inconsistent state in the extent tree if the delayed references are run (which can happen often before a transaction commit). So either make delayed references continue using tree mod seq as before 4.2, or come with a whole new mechanism that replaces the tree mod seq while still giving the same consistency guarantees.I used your patch at https://patchwork.kernel.org/patch/7463161/ to build a4.3.0-rc6 kernel.I mounted my FS with skip_balance, cancelled the paused balance (just to besure), then started a new one with : # btrfs balance start -dconvert=raid5,soft /tank(half of my data blocks are RAID1, the other half is RAID5, the goal is tobe fully RAID5).I got a different stacktrace than the usual one, via netconsole, after a fewminutes. It's still referencing btrfs_run_delayed_refs though: [ 822.461809] ------------[ cut here ]------------ [ 822.461833] kernel BUG at fs/btrfs/extent-tree.c:2287! [ 822.461849] invalid opcode: 0000 [#1] SMP[ 822.461866] Modules linked in: nfnetlink_queue nfnetlink_log nfnetlink xt_multiport xt_comment xt_conntrack xt_nat xt_tcpudp xts gf128mul drbgansi_cprng btrfs nf_conntrack_ftp nf_conntrack_sane iptable_securityiptable_filter iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 [ 822.462365] CPU: 1 PID: 3810 Comm: btrfs-transacti Tainted: G W4.3.0-rc6p7463161+ #3 [ 822.462391] Hardware name: ASUS All Series/H87I-PLUS, BIOS 1005 01/06/2014 [ 822.462412] task: ffff88011a4a1a00 ti: ffff8800b6638000 task.ti: ffff8800b6638000 [ 822.462434] RIP: 0010:[<ffffffffc032310b>] [<ffffffffc032310b>] __btrfs_run_delayed_refs.constprop.73+0x108b/0x10d0 [btrfs] [ 822.462476] RSP: 0018:ffff8800b663bcb0 EFLAGS: 00010202 [ 822.462495] RAX: 0000000000000001 RBX: ffff8800b3821888 RCX: ffff8800b3a46cb8 [ 822.462517] RDX: 0000000000000001 RSI: 00000000000001e1 RDI: ffff8800b3a46cb0 [ 822.462543] RBP: ffff8800b663bdb8 R08: 0000000000000000 R09: ffff8800b3a46cb8 [ 822.462565] R10: ffff8800b3a46cb8 R11: ffff8800b3a46cb8 R12: 0000000000000000 [ 822.462587] R13: 0000000000000000 R14: 00000afefd330000 R15: ffff8800b3a46c38 [ 822.462687] FS: 0000000000000000(0000) GS:ffff88011fb00000(0000) knlGS:0000000000000000 [ 822.462716] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 822.462736] CR2: 00007ff7a21ca000 CR3: 0000000001c10000 CR4: 00000000000406e0 [ 822.462762] Stack: [ 822.462770] ffffffffc037c8be ffff88011fb169f0 0000000000000001 0000000000000001 [ 822.462802] 0000000000000001 ffff88011a4a1a60 ffff8800b663bd38 ffff8800b663bd40 [ 822.462831] ffffffff810aded5 ffffffff8101fcc9 ffff8800b663bd20 00000000000014d9 [ 822.462859] Call Trace: [ 822.462877] [<ffffffffc037c8be>] ? try_merge_free_space.isra.26+0x12e/0x180 [btrfs] [ 822.462902] [<ffffffff810aded5>] ? put_prev_entity+0x35/0x660 [ 822.462924] [<ffffffff8101fcc9>] ? sched_clock+0x9/0x10 [ 822.462949] [<ffffffffc0325dcd>] btrfs_run_delayed_refs+0x7d/0x2b0 [btrfs] [ 822.462972] [<ffffffff817aa7ab>] ? schedule_timeout+0x16b/0x2a0[ 822.462999] [<ffffffffc033a6b3>] btrfs_commit_transaction+0x43/0xb10[btrfs][ 822.463028] [<ffffffffc0335c19>] transaction_kthread+0x1a9/0x230 [btrfs] [ 822.463056] [<ffffffffc0335a70>] ? btrfs_cleanup_transaction+0x550/0x550[btrfs] [ 822.463080] [<ffffffff81097099>] kthread+0xc9/0xe0 [ 822.463096] [<ffffffff81096fd0>] ? kthread_park+0x60/0x60 [ 822.463116] [<ffffffff817aba8f>] ret_from_fork+0x3f/0x70 [ 822.463134] [<ffffffff81096fd0>] ? kthread_park+0x60/0x60[ 822.463153] Code: c0 48 8b bd 40 ff ff ff 31 c0 e8 31 94 fe ff 0f 0b be d3 00 00 00 48 c7 c7 3b c6 3b c0 e8 6e 63 d5 c0 e9 64 f9 ff ff 0f 0b 0f 0b<0f> 0b be d3 00 00 00 48 c7 c7 3b c6 3b c0 e8 52 63 d5 c0 e9 60 [ 822.463300] RIP [<ffffffffc032310b>] __btrfs_run_delayed_refs.constprop.73+0x108b/0x10d0 [btrfs] [ 822.463335] RSP <ffff8800b663bcb0> [ 822.472131] ---[ end trace f1e21f38cb0ea144 ]---A couple other stacktraces follow after some seconds, then the system diescompletely. sysrqd doesn't even work to reboot it remotely using sysrq logic.fs/btrfs/extent-tree.c:2287 is the line you get from a vanilla 4.3-rc6 + your patch. I'll post it as soon as I can get somebody to manually rebootthis remote machine (my kernel build machine is the same than the one hosting the btrfs FS).Don't hesitate to ask if you need me to debug or even ftrace something.Thanks Stéphane. I haven't seen that crash yet (still running tests for 2 consecutive days now). Can you please try the following patch, which works on top of mine, and enable ftrace before running balance: Debug patch: https://friendpaste.com/5s3dItRpcpq3dH1E4KUJor Enable ftrace: $ echo > /sys/kernel/debug/tracing/trace $ echo "nop" > /sys/kernel/debug/tracing/current_tracer $ echo 100000 > /sys/kernel/debug/tracing/buffer_size_kb # if you can use larger buffer size, even better $ echo > /sys/kernel/debug/tracing/set_ftrace_filter $ echo 1 > /sys/kernel/debug/tracing/tracing_on $ run balance... wait until it finishes with IO error or the patch's printk message shows up in dmesg/syslog $ echo 0 > /sys/kernel/debug/tracing/tracing_on $ cat /sys/kernel/debug/tracing/trace > some_file.txt Then send is some_file.txt for debugging, hopefully it will give some useful information. Note that it might produce tons of messages, depending on how long it takes for you to hit the BUG_ON. Thanks a lot for this.
I'm compiling it now (using your v2 of the friendpaste diff).I took the liberty to add a tracing_off() right before the return -EIO so that the trace tail ends exactly at the right place.
Last time I tried to use ftrace to diagnose the bug we're trying to fix, the system crashes so hard that usually it's complicated to get the trace contents written somewhere before the system is unusable. But I'll eventually work around it by using /sys/kernel/debug/tracing/trace_pipe to send the trace live to another machine over the LAN.
This series of bugs are so easy to trigger on my system that we'll hopefully get something useful out of the trace. I guess that's a good thing !
Thanks, -- Stéphane.I was going to do it but not completely sure is there any other user oftree_mod_seq. And if it's possible to get rid of tree_mod_seq and merge with last delayed_ref, things should get cleaner without new codes.Well, the tree mod seq is what allows backref walkers (and possibly other paths) to get a consistent view of all btrees and delayed refs state while doing some processing - that's why we have calls to btrfs_check_delayed_seq() when running delayed references - so thatany backref walker will not see changes that happen after it started,i.e. it will see a consistent view of all the btrees (like an in-memory snapshot of all btrees while the transaction is running).I don't think you can get this level of consistency through any otherexisting means.So just adding back yet more code that was removed despite still beingneeded. This is affecting way too many people now, I would like to get this fixed and later, if there's a better (new) solution for this, we can get it in. thanksThanks, QuFixes: c6fc24549960 ("btrfs: delayed-ref: Use list to replace the ref_root in ref_head.") Reported-by: Peter Becker <[email protected]> Reported-by: Stéphane Lesimple <[email protected]> Reported-by: Malte Schröder <[email protected]> Reported-by: Derek Dongray <[email protected]> Reported-by: Erkki Seppala <[email protected]> Cc: [email protected] # 4.2+ Signed-off-by: Filipe Manana <[email protected]> --- fs/btrfs/delayed-ref.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/extent-tree.c | 14 ++++++ 2 files changed, 127 insertions(+) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index ac3e81d..4832943 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -197,6 +197,119 @@ static inline void drop_delayed_ref(struct btrfs_trans_handle *trans, trans->delayed_ref_updates--; } +static bool merge_ref(struct btrfs_trans_handle *trans, + struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head, + struct btrfs_delayed_ref_node *ref, + u64 seq) +{ + struct btrfs_delayed_ref_node *next; + bool done = false; + + next = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node, + list); + while (!done && &next->list != &head->ref_list) { + int mod; + struct btrfs_delayed_ref_node *next2; + + next2 = list_next_entry(next, list); + + if (next == ref) + goto next; + + if (seq && next->seq >= seq) + goto next; + + if (next->type != ref->type || next->no_quota != ref->no_quota) + goto next; + + if ((ref->type == BTRFS_TREE_BLOCK_REF_KEY || + ref->type == BTRFS_SHARED_BLOCK_REF_KEY) &&+ comp_tree_refs(btrfs_delayed_node_to_tree_ref(ref),+ btrfs_delayed_node_to_tree_ref(next), + ref->type)) + goto next; + if ((ref->type == BTRFS_EXTENT_DATA_REF_KEY || + ref->type == BTRFS_SHARED_DATA_REF_KEY) &&+ comp_data_refs(btrfs_delayed_node_to_data_ref(ref),+ btrfs_delayed_node_to_data_ref(next))) + goto next; + + if (ref->action == next->action) { + mod = next->ref_mod; + } else { + if (ref->ref_mod < next->ref_mod) { + swap(ref, next); + done = true; + } + mod = -next->ref_mod; + } + + drop_delayed_ref(trans, delayed_refs, head, next); + ref->ref_mod += mod; + if (ref->ref_mod == 0) {+ drop_delayed_ref(trans, delayed_refs, head,ref); + done = true; + } else { + /*+ * Can't have multiples of the same ref on atree block. + */+ WARN_ON(ref->type == BTRFS_TREE_BLOCK_REF_KEY|| + ref->type == BTRFS_SHARED_BLOCK_REF_KEY); + } +next: + next = next2; + } + + return done; +} + +void btrfs_merge_delayed_refs(struct btrfs_trans_handle *trans, + struct btrfs_fs_info *fs_info, + struct btrfs_delayed_ref_root *delayed_refs, + struct btrfs_delayed_ref_head *head) +{ + struct btrfs_delayed_ref_node *ref; + u64 seq = 0; + + assert_spin_locked(&head->lock); + + if (list_empty(&head->ref_list)) + return; + + /* We don't have too many refs to merge for data. */ + if (head->is_data) + return; + + spin_lock(&fs_info->tree_mod_seq_lock); + if (!list_empty(&fs_info->tree_mod_seq_list)) { + struct seq_list *elem; ++ elem = list_first_entry(&fs_info->tree_mod_seq_list,+ struct seq_list, list); + seq = elem->seq; + } + spin_unlock(&fs_info->tree_mod_seq_lock); + + ref = list_first_entry(&head->ref_list, struct btrfs_delayed_ref_node, + list); + while (&ref->list != &head->ref_list) { + if (seq && ref->seq >= seq) + goto next; ++ if (merge_ref(trans, delayed_refs, head, ref, seq)) {+ if (list_empty(&head->ref_list)) + break; + ref = list_first_entry(&head->ref_list, + struct btrfs_delayed_ref_node, + list); + continue; + } +next: + ref = list_next_entry(ref, list); + } +} + int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info, struct btrfs_delayed_ref_root *delayed_refs, u64 seq) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 522fb45..42d9310 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2433,7 +2433,21 @@ static noinline int __btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, } } + /*+ * We need to try and merge add/drops of the same refsince we + * can run into issues with relocate dropping the implicit ref+ * and then it being added back again before the dropcan+ * finish. If we merged anything we need to re-loop sowe can + * get a good ref.+ * Or we can get node references of the same type thatweren't+ * merged when created due to bumps in the tree modseq, and+ * we need to merge them to prevent adding an inlineextent+ * backref before dropping it (triggering a BUG_ON at+ * insert_inline_extent_backref()). + */ spin_lock(&locked_ref->lock);+ btrfs_merge_delayed_refs(trans, fs_info, delayed_refs,+ locked_ref); /** locked_ref is the head node, so we have to go one--To unsubscribe from this list: send the line "unsubscribe linux-btrfs" inthe body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html--To unsubscribe from this list: send the line "unsubscribe linux-btrfs" inthe body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
