On 11/17/2016 07:04 AM, John Thompson wrote: > Hi Partha / Ying, > > I will try out the patch and let you know how it goes. > I also note about providing the other CPU core dumps - in one of my cases I > didn't have them but in others I did but > they were interleaved and so were difficult to interpret.
Thanks, it's unnecessary for us to collect more logs as its soft lockup scenario should be just what Partha described. Regards, Ying > > Thanks for getting a patch together so quickly. > > JT > > On Wed, Nov 16, 2016 at 10:23 PM, Parthasarathy Bhuvaragan < > parthasarathy.bhuvara...@ericsson.com> wrote: > >> Hi Ying / John, >> >> The soft lock is the call chain of tipc_nametbl_withdraw(), when it >> performs the tipc_conn_kref_release() as it tries to grab nametbl_lock >> again while holding it already. >>> tipc_nametbl_withdraw >>> spin_lock_bh(&tn->nametbl_lock); >>> tipc_nametbl_remove_publ >>> spin_lock_bh(&seq->lock); >>> tipc_nameseq_remove_publ >>> tipc_subscrp_report_overlap >>> tipc_subscrp_send_event >>> tipc_conn_sendmsg >> << Here, the (test_bit(CF_CONNECTED, &con->flags)) Fails, leading to the >> else case where do do a conn_put() and that triggers the cleanup as >> refcount reached 0. Leading the call chain below : >> >> tipc_conn_kref_release >> tipc_sock_release >> tipc_conn_release >> tipc_subscrb_delete >> tipc_subscrp_delete >> tipc_nametbl_unsubscribe >> spin_lock_bh(&tn->nametbl_lock); << !! Soft Lockup >> >> >> One cause is that tipc_exit_net() calls first calls tipc_topsrv_stop() and >> then tipc_nametbl_withdraw() in scope of tipc_net_stop(). >> >> The above chain will only occur in a narrow window for a given connection: >> CPU#1: >> tipc_nametbl_withdraw() manages to perform tipc_conn_lookup() and steps >> the refcount to 2, while in CPU#2 the following occurs: >> CPU#2: >> tipc_server_stop() calls tipc_close_conn(con). This performs a conn_put() >> decrementing refcount to 1. >> Now, CPU#1 continues and detects that the connection is not CF_CONNECTED >> and does a conn_put(), triggering the release callback. >> >> Before commit 333f796235a527, the above wont happen. >> >> /Partha >> >> >> On 11/15/2016 04:11 PM, Xue, Ying wrote: >> >>> Hi John, >>> >>> Regarding the stack trace you provided below, I get the two potential >>> call chains: >>> >>> tipc_nametbl_withdraw >>> spin_lock_bh(&tn->nametbl_lock); >>> tipc_nametbl_remove_publ >>> spin_lock_bh(&seq->lock); >>> tipc_nameseq_remove_publ >>> tipc_subscrp_report_overlap >>> tipc_subscrp_send_event >>> tipc_conn_sendmsg >>> spin_lock_bh(&con->outqueue_lock); >>> list_add_tail(&e->list, &con->outqueue); >>> >>> >>> tipc_topsrv_stop >>> tipc_server_stop >>> tipc_close_conn >>> kernel_sock_shutdown >>> tipc_subscrb_delete >>> spin_lock_bh(&subscriber->lock); >>> tipc_nametbl_unsubscribe(sub); >>> spin_lock_bh(&tn->nametbl_lock); >>> >>> Although I suspect this is a revert lock issue leading to the soft >>> lockup, I am still unable to understand which lock together with >>> nametbl_lock is taken reversely on the two different paths above. >>> However, you just gave us the log printed on CPU#2, but the logs >>> outputted by other cores are also important. So if possible, please share >>> them with us. >>> >>> By the way, I agree with you, and it seems that commit 333f796235a527 is >>> related to the soft lockup. >>> >>> Regards, >>> Ying >>> >>> -----Original Message----- >>> From: John Thompson [mailto:thompa....@gmail.com] >>> Sent: Tuesday, November 15, 2016 8:01 AM >>> To: tipc-discussion@lists.sourceforge.net >>> Subject: [tipc-discussion] v4.7: soft lockup when releasing a socket >>> >>> Hi, >>> >>> I am seeing an occasional kernel soft lockup. I have TIPC v4.7 and the >>> kernel dump occurs when the system is going down for a reboot. >>> >>> The kernel dump is: >>> >>> <0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [exfx:1474] >>> <6>Modules linked in: tipc jitterentropy_rng echainiv drbg >>> platform_driver(O) ipifwd(PO) >>> ... >>> <6> >>> <6>GPR00: c15333e8 a4e0fb80 a4ee3600 a51748ac 00000000 ae475024 a537feec >>> fffffffd >>> <6>GPR08: a2197408 00000001 00000001 00000004 80691c00 <6>NIP [80691c40] >>> _raw_spin_lock_bh+0x40/0x70 <6>LR [c1534f30] >>> tipc_nametbl_unsubscribe+0x50/0x120 >>> [tipc] <6>Call Trace: >>> <6>[a4e0fba0] [c15333e8] tipc_named_reinit+0xf8/0x820 [tipc] >>> <6>[a4e0fbb0] [c15336a0] tipc_named_reinit+0x3b0/0x820 [tipc] <6>[a4e0fbd0] >>> [c1540bac] tipc_nl_publ_dump+0x50c/0xed0 [tipc] <6>[a4e0fc00] [c154164c] >>> tipc_conn_sendmsg+0xdc/0x170 [tipc] <6>[a4e0fc30] [c1533c9c] >>> tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] <6>[a4e0fc70] [c153425c] >>> tipc_topsrv_stop+0x45c/0x4f0 [tipc] <6>[a4e0fca0] [c1534788] >>> tipc_nametbl_remove_publ+0x58/0x110 [tipc] <6>[a4e0fcd0] [c1534c48] >>> tipc_nametbl_withdraw+0x68/0x140 [tipc] <6>[a4e0fd00] [c153cc24] >>> tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] <6>[a4e0fd30] [c153d838] >>> tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] <6>[a4e0fd70] [804f2870] >>> sock_release+0x30/0xf0 <6>[a4e0fd80] [804f2944] sock_close+0x14/0x30 >>> <6>[a4e0fd90] [80105844] __fput+0x94/0x200 <6>[a4e0fdb0] [8003dca4] >>> task_work_run+0xd4/0x100 <6>[a4e0fdd0] [80023620] do_exit+0x280/0x980 >>> <6>[a4e0fe10] [80024c48] do_group_exit+0x48/0xb0 <6>[a4e0fe30] [80030344] >>> get_signal+0x244/0x4f0 <6>[a4e0fe80] [80007734] do_signal+0x34/0x1c0 >>> <6>[a4e0ff30] [800079a8] do_notify_resume+0x68/0x80 <6>[a4e0ff40] >>> [8000fa1c] do_user_signal+0x74/0xc4 >>> >>> >>> From the stack dump it looks like tipc_named_reinit is trying to >>>> >>> acquire nametbl_lock. >>> >>> From looking at the call chain I can see that tipc_conn_sendmsg can >>>> >>> send up calling conn_put >>> >>> which will go on and call the tipc_named_reinit via tipc_sock_release. >>> >>> As tipc_nametbl_withdraw (from the stack dump) has already acquired the >>> nametbl_lock, tipc_named_reinit >>> >>> cannot get it and so the process hangs. >>> >>> >>> The call to tipc_sock_release (added in Commit 333f796235a527 >>> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- >>> stable.git/commit/?id=333f796235a52727db7e0a13888045f3aa3d5335>) >>> seems to have changed the behaviour >>> >>> such that it tries to do a lot more when shutting the connection down. >>> >>> >>> If there is other information I can provide please let me know. >>> >>> Regards, >>> >>> John >>> ------------------------------------------------------------ >>> ------------------ >>> _______________________________________________ >>> tipc-discussion mailing list >>> tipc-discussion@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> ------------------------------------------------------------ >>> ------------------ >>> _______________________________________________ >>> tipc-discussion mailing list >>> tipc-discussion@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> >>> >> > ------------------------------------------------------------------------------ > _______________________________________________ > tipc-discussion mailing list > tipc-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion