Hi Partha, In my testing over the weekend the patch performed well - I didn't see any kernel dumps due to this issue.
Thanks for the quick response. JT On Fri, Nov 18, 2016 at 10:34 AM, John Thompson <thompa....@gmail.com> wrote: > Hi, > > I will be able to have some test results by the start of next week on the > first patch. > > Regards, > JT > > > On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue <ying....@windriver.com> wrote: > >> On 11/17/2016 07:04 AM, John Thompson wrote: >> >>> Hi Partha / Ying, >>> >>> I will try out the patch and let you know how it goes. >>> I also note about providing the other CPU core dumps - in one of my >>> cases I >>> didn't have them but in others I did but >>> they were interleaved and so were difficult to interpret. >>> >> >> Thanks, it's unnecessary for us to collect more logs as its soft lockup >> scenario should be just what Partha described. >> >> Regards, >> Ying >> >> >> >>> Thanks for getting a patch together so quickly. >>> >>> JT >>> >>> On Wed, Nov 16, 2016 at 10:23 PM, Parthasarathy Bhuvaragan < >>> parthasarathy.bhuvara...@ericsson.com> wrote: >>> >>> Hi Ying / John, >>>> >>>> The soft lock is the call chain of tipc_nametbl_withdraw(), when it >>>> performs the tipc_conn_kref_release() as it tries to grab nametbl_lock >>>> again while holding it already. >>>> >>>>> tipc_nametbl_withdraw >>>>> spin_lock_bh(&tn->nametbl_lock); >>>>> tipc_nametbl_remove_publ >>>>> spin_lock_bh(&seq->lock); >>>>> tipc_nameseq_remove_publ >>>>> tipc_subscrp_report_overlap >>>>> tipc_subscrp_send_event >>>>> tipc_conn_sendmsg >>>>> >>>> << Here, the (test_bit(CF_CONNECTED, &con->flags)) Fails, leading to the >>>> else case where do do a conn_put() and that triggers the cleanup as >>>> refcount reached 0. Leading the call chain below : >> >>>> tipc_conn_kref_release >>>> tipc_sock_release >>>> tipc_conn_release >>>> tipc_subscrb_delete >>>> tipc_subscrp_delete >>>> tipc_nametbl_unsubscribe >>>> spin_lock_bh(&tn->nametbl_lock); << !! Soft Lockup >> >>>> >>>> One cause is that tipc_exit_net() calls first calls tipc_topsrv_stop() >>>> and >>>> then tipc_nametbl_withdraw() in scope of tipc_net_stop(). >>>> >>>> The above chain will only occur in a narrow window for a given >>>> connection: >>>> CPU#1: >>>> tipc_nametbl_withdraw() manages to perform tipc_conn_lookup() and steps >>>> the refcount to 2, while in CPU#2 the following occurs: >>>> CPU#2: >>>> tipc_server_stop() calls tipc_close_conn(con). This performs a >>>> conn_put() >>>> decrementing refcount to 1. >>>> Now, CPU#1 continues and detects that the connection is not CF_CONNECTED >>>> and does a conn_put(), triggering the release callback. >>>> >>>> Before commit 333f796235a527, the above wont happen. >>>> >>>> /Partha >>>> >>>> >>>> On 11/15/2016 04:11 PM, Xue, Ying wrote: >>>> >>>> Hi John, >>>>> >>>>> Regarding the stack trace you provided below, I get the two potential >>>>> call chains: >>>>> >>>>> tipc_nametbl_withdraw >>>>> spin_lock_bh(&tn->nametbl_lock); >>>>> tipc_nametbl_remove_publ >>>>> spin_lock_bh(&seq->lock); >>>>> tipc_nameseq_remove_publ >>>>> tipc_subscrp_report_overlap >>>>> tipc_subscrp_send_event >>>>> tipc_conn_sendmsg >>>>> spin_lock_bh(&con->outqueue_lock); >>>>> list_add_tail(&e->list, &con->outqueue); >>>>> >>>>> >>>>> tipc_topsrv_stop >>>>> tipc_server_stop >>>>> tipc_close_conn >>>>> kernel_sock_shutdown >>>>> tipc_subscrb_delete >>>>> spin_lock_bh(&subscriber->lock); >>>>> tipc_nametbl_unsubscribe(sub); >>>>> spin_lock_bh(&tn->nametbl_lock); >>>>> >>>>> Although I suspect this is a revert lock issue leading to the soft >>>>> lockup, I am still unable to understand which lock together with >>>>> nametbl_lock is taken reversely on the two different paths above. >>>>> However, you just gave us the log printed on CPU#2, but the logs >>>>> outputted by other cores are also important. So if possible, please >>>>> share >>>>> them with us. >>>>> >>>>> By the way, I agree with you, and it seems that commit 333f796235a527 >>>>> is >>>>> related to the soft lockup. >>>>> >>>>> Regards, >>>>> Ying >>>>> >>>>> -----Original Message----- >>>>> From: John Thompson [mailto:thompa....@gmail.com] >>>>> Sent: Tuesday, November 15, 2016 8:01 AM >>>>> To: tipc-discussion@lists.sourceforge.net >>>>> Subject: [tipc-discussion] v4.7: soft lockup when releasing a socket >>>>> >>>>> Hi, >>>>> >>>>> I am seeing an occasional kernel soft lockup. I have TIPC v4.7 and the >>>>> kernel dump occurs when the system is going down for a reboot. >>>>> >>>>> The kernel dump is: >>>>> >>>>> <0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [exfx:1474] >>>>> <6>Modules linked in: tipc jitterentropy_rng echainiv drbg >>>>> platform_driver(O) ipifwd(PO) >>>>> ... >>>>> <6> >>>>> <6>GPR00: c15333e8 a4e0fb80 a4ee3600 a51748ac 00000000 ae475024 >>>>> a537feec >>>>> fffffffd >>>>> <6>GPR08: a2197408 00000001 00000001 00000004 80691c00 <6>NIP >>>>> [80691c40] >>>>> _raw_spin_lock_bh+0x40/0x70 <6>LR [c1534f30] >>>>> tipc_nametbl_unsubscribe+0x50/0x120 >>>>> [tipc] <6>Call Trace: >>>>> <6>[a4e0fba0] [c15333e8] tipc_named_reinit+0xf8/0x820 [tipc] >>>>> <6>[a4e0fbb0] [c15336a0] tipc_named_reinit+0x3b0/0x820 [tipc] >>>>> <6>[a4e0fbd0] >>>>> [c1540bac] tipc_nl_publ_dump+0x50c/0xed0 [tipc] <6>[a4e0fc00] >>>>> [c154164c] >>>>> tipc_conn_sendmsg+0xdc/0x170 [tipc] <6>[a4e0fc30] [c1533c9c] >>>>> tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] <6>[a4e0fc70] [c153425c] >>>>> tipc_topsrv_stop+0x45c/0x4f0 [tipc] <6>[a4e0fca0] [c1534788] >>>>> tipc_nametbl_remove_publ+0x58/0x110 [tipc] <6>[a4e0fcd0] [c1534c48] >>>>> tipc_nametbl_withdraw+0x68/0x140 [tipc] <6>[a4e0fd00] [c153cc24] >>>>> tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] <6>[a4e0fd30] [c153d838] >>>>> tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] <6>[a4e0fd70] [804f2870] >>>>> sock_release+0x30/0xf0 <6>[a4e0fd80] [804f2944] sock_close+0x14/0x30 >>>>> <6>[a4e0fd90] [80105844] __fput+0x94/0x200 <6>[a4e0fdb0] [8003dca4] >>>>> task_work_run+0xd4/0x100 <6>[a4e0fdd0] [80023620] do_exit+0x280/0x980 >>>>> <6>[a4e0fe10] [80024c48] do_group_exit+0x48/0xb0 <6>[a4e0fe30] >>>>> [80030344] >>>>> get_signal+0x244/0x4f0 <6>[a4e0fe80] [80007734] do_signal+0x34/0x1c0 >>>>> <6>[a4e0ff30] [800079a8] do_notify_resume+0x68/0x80 <6>[a4e0ff40] >>>>> [8000fa1c] do_user_signal+0x74/0xc4 >>>>> >>>>> >>>>> From the stack dump it looks like tipc_named_reinit is trying to >>>>> >>>>>> >>>>>> acquire nametbl_lock. >>>>> >>>>> From looking at the call chain I can see that tipc_conn_sendmsg can >>>>> >>>>>> >>>>>> send up calling conn_put >>>>> >>>>> which will go on and call the tipc_named_reinit via tipc_sock_release. >>>>> >>>>> As tipc_nametbl_withdraw (from the stack dump) has already acquired the >>>>> nametbl_lock, tipc_named_reinit >>>>> >>>>> cannot get it and so the process hangs. >>>>> >>>>> >>>>> The call to tipc_sock_release (added in Commit 333f796235a527 >>>>> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- >>>>> stable.git/commit/?id=333f796235a52727db7e0a13888045f3aa3d5335>) >>>>> seems to have changed the behaviour >>>>> >>>>> such that it tries to do a lot more when shutting the connection down. >>>>> >>>>> >>>>> If there is other information I can provide please let me know. >>>>> >>>>> Regards, >>>>> >>>>> John >>>>> ------------------------------------------------------------ >>>>> ------------------ >>>>> _______________________________________________ >>>>> tipc-discussion mailing list >>>>> tipc-discussion@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>>>> ------------------------------------------------------------ >>>>> ------------------ >>>>> _______________________________________________ >>>>> tipc-discussion mailing list >>>>> tipc-discussion@lists.sourceforge.net >>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>>>> >>>>> >>>>> >>>> ------------------------------------------------------------ >>> ------------------ >>> _______________________________________________ >>> tipc-discussion mailing list >>> tipc-discussion@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> >>> >> > ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion