Hi John, Ok. Can you test with the 2 patches I posted regarding the nametbl_lock?
/Partha On 11/21/2016 11:32 PM, John Thompson wrote: > Hi Partha, > > My test has 4 nodes, 2 of which are alternately rebooting. When the > rebooted node rejoins a few minutes pass and then the other node is > rebooted. > I am not printing out link stats and believe that the the other code is > not doing so either, when nodes leave or rejoin. > > JT > > > On Tue, Nov 22, 2016 at 2:22 AM, Parthasarathy Bhuvaragan > <parthasarathy.bhuvara...@ericsson.com > <mailto:parthasarathy.bhuvara...@ericsson.com>> wrote: > > Hi, > > There is an other branch where softlockup for nametbl_lock occurs. > > tipc_named_rcv() Grabs nametbl_lock > tipc_update_nametbl() (publish/withdraw) > tipc_node_subscribe()/unsubscribe() > tipc_node_write_unlock() > << lockup occurs if it needs to process NODE UP/DOWN LINK > UP/DOWN, as it grabs nametbl_lock again >> > > /Partha > > > On 11/21/2016 01:04 PM, Parthasarathy Bhuvaragan wrote: > > Hi, > > tipc_nametbl_withdraw() triggers the softlockup as it tries to grab > nametbl_lock twice if the node triggered a TIPC_NOTIFY_LINK_DOWN > event > while its is running. The erroneous call chain is: > > tipc_nametbl_withdraw() Grab nametbl_lock > tipc_named_process_backlog() > tipc_update_nametbl() > if (dtype == WITHDRAWAL) tipc_node_unsubscribe() > tipc_node_write_unlock() > if (flags & TIPC_NOTIFY_LINK_DOWN) > tipc_nametbl_withdraw() > spin_lock_bh(&tn->nametbl_lock); << Soft Lockup >> > > Three callers which can cause this under module exit: > > Case1: > tipc_exit_net() > tipc_nametbl_withdraw() Grab nametbl_lock > > Case2: > tipc_server_stop() > tipc_conn_kref_release > tipc_sock_release > sock_release() > tipc_release() > tipc_sk_withdraw() > tipc_nametbl_withdraw() > > Case3: > tipc_server_stop() > tipc_conn_kref_release() > kernel_bind() > tipc_bind() > tipc_sk_withdraw() > tipc_nametbl_withdraw() > > I will work on a solution for this. > > What kind of test were you performing when this occurred > (linkup/down)? > Do you read link statistics periodically in your tests? > > /Partha > > On 11/21/2016 05:30 AM, John Thompson wrote: > > Hi Partha, > > I was doing some some more testing today and have still > observed the > problem (contrary to what I had emailed earlier). > > Here is the kernel dump. > > <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! > [Pluggable > Serve:2221] > <6>Modules linked in: tipc jitterentropy_rng echainiv drbg > platform_driver(O) > <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P O > <6>task: ae54ced0 ti: aec42000 task.ti: aec42000 > <6>NIP: 8069257c LR: c13ebf50 CTR: 80692540 > <6>REGS: aec43ad0 TRAP: 0901 Tainted: P O > <6>MSR: 00029002 <CE,EE,ME> CR: 48002444 XER: 00000000 > <6> > <6>GPR00: c13ea408 aec43b80 ae54ced0 a624690c 00000000 > a6271d84 a39a60cc > fffffffd > <6>GPR08: aeefbbc8 00000001 00000001 00000004 80692540 > <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70 > <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc] > <6>Call Trace: > <6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable) > <6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc] > <6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc] > <6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc] > <6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc] > <6>[aec43c30] [c13eacbc] > tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] > <6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc] > <6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 > [tipc] > <6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc] > <6>[aec43d00] [c13f3c34] > tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] > <6>[aec43d30] [c13f4848] > tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] > > TIPC_CMD_SHOW_LINK_STATS or TIPC_NL_LINK_GET > > <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0 > <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30 > <6>[aec43d90] [80105844] __fput+0x94/0x200 > <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100 > <6>[aec43dd0] [80023620] do_exit+0x280/0x980 > <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0 > <6>[aec43e30] [80030344] get_signal+0x244/0x4f0 > <6>[aec43e80] [80007734] do_signal+0x34/0x1c0 > <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80 > <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4 > <6>--- interrupt: c00 at 0xf4f3d08 > <6> LR = 0xf4f3ce8 > <6>Instruction dump: > <6>912a0008 39400001 7d201828 2c090000 40820010 7d40192d > 40a2fff0 7c2004ac > <6>2f890000 4dbe0020 7c210b78 81230000 <2f890000> 40befff4 > 7c421378 > 7d201828 > <0>Kernel panic - not syncing: softlockup: hung tasks > <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P > O L > <6>Call Trace: > <6>[aec43930] [80694e20] dump_stack+0x84/0xb0 (unreliable) > <6>[aec43940] [80692ca8] panic+0xd8/0x214 > <6>[aec439a0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0 > <6>[aec439f0] [8007ae58] __hrtimer_run_queues+0x118/0x1d0 > <6>[aec43a30] [8007b608] hrtimer_interrupt+0xd8/0x270 > <6>[aec43a80] [8000983c] __timer_interrupt+0xac/0x1b0 > <6>[aec43aa0] [80009b70] timer_interrupt+0xb0/0xe0 > <6>[aec43ac0] [8000f450] ret_from_except+0x0/0x18 > <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70 > <6> LR = tipc_nametbl_unsubscribe+0x50/0x120 [tipc] > <6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable) > <6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc] > <6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc] > <6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc] > <6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc] > <6>[aec43c30] [c13eacbc] > tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] > <6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc] > <6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 > [tipc] > <6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc] > <6>[aec43d00] [c13f3c34] > tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] > <6>[aec43d30] [c13f4848] > tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] > <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0 > <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30 > <6>[aec43d90] [80105844] __fput+0x94/0x200 > <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100 > <6>[aec43dd0] [80023620] do_exit+0x280/0x980 > <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0 > <6>[aec43e30] [80030344] get_signal+0x244/0x4f0 > <6>[aec43e80] [80007734] do_signal+0x34/0x1c0 > <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80 > <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4 > <6>--- interrupt: c00 at 0xf4f3d08 > <6> LR = 0xf4f3ce8 > > > On Mon, Nov 21, 2016 at 9:59 AM, John Thompson > <thompa....@gmail.com <mailto:thompa....@gmail.com> > <mailto:thompa....@gmail.com <mailto:thompa....@gmail.com>>> > wrote: > > Hi Partha, > > In my testing over the weekend the patch performed well > - I didn't > see any kernel dumps due to this issue. > > Thanks for the quick response. > JT > > > On Fri, Nov 18, 2016 at 10:34 AM, John Thompson > <thompa....@gmail.com <mailto:thompa....@gmail.com> > <mailto:thompa....@gmail.com <mailto:thompa....@gmail.com>>> > wrote: > > Hi, > > I will be able to have some test results by the > start of next > week on the first patch. > > Regards, > JT > > > On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue > <ying....@windriver.com > <mailto:ying....@windriver.com> > <mailto:ying....@windriver.com > <mailto:ying....@windriver.com>>> wrote: > > On 11/17/2016 07:04 AM, John Thompson wrote: > > Hi Partha / Ying, > > I will try out the patch and let you know > how it goes. > I also note about providing the other CPU > core dumps - > in one of my cases I > didn't have them but in others I did but > they were interleaved and so were difficult > to interpret. > > > Thanks, it's unnecessary for us to collect more > logs as its > soft lockup scenario should be just what Partha > described. > > Regards, > Ying > > > > Thanks for getting a patch together so quickly. > > JT > > On Wed, Nov 16, 2016 at 10:23 PM, > Parthasarathy Bhuvaragan < > parthasarathy.bhuvara...@ericsson.com > <mailto:parthasarathy.bhuvara...@ericsson.com> > > <mailto:parthasarathy.bhuvara...@ericsson.com > <mailto:parthasarathy.bhuvara...@ericsson.com>>> wrote: > > Hi Ying / John, > > The soft lock is the call chain of > tipc_nametbl_withdraw(), when it > performs the tipc_conn_kref_release() as > it tries to > grab nametbl_lock > again while holding it already. > > tipc_nametbl_withdraw > spin_lock_bh(&tn->nametbl_lock); > tipc_nametbl_remove_publ > spin_lock_bh(&seq->lock); > tipc_nameseq_remove_publ > tipc_subscrp_report_overlap > tipc_subscrp_send_event > tipc_conn_sendmsg > > << Here, the (test_bit(CF_CONNECTED, > &con->flags)) > Fails, leading to the > else case where do do a conn_put() and > that triggers > the cleanup as > refcount reached 0. Leading the call > chain below : >> > tipc_conn_kref_release > tipc_sock_release > tipc_conn_release > tipc_subscrb_delete > tipc_subscrp_delete > tipc_nametbl_unsubscribe > > spin_lock_bh(&tn->nametbl_lock); > << !! Soft Lockup >> > > One cause is that tipc_exit_net() calls > first calls > tipc_topsrv_stop() and > then tipc_nametbl_withdraw() in scope of > tipc_net_stop(). > > The above chain will only occur in a > narrow window > for a given connection: > CPU#1: > tipc_nametbl_withdraw() manages to perform > tipc_conn_lookup() and steps > the refcount to 2, while in CPU#2 the > following occurs: > CPU#2: > tipc_server_stop() calls > tipc_close_conn(con). This > performs a conn_put() > decrementing refcount to 1. > Now, CPU#1 continues and detects that > the connection > is not CF_CONNECTED > and does a conn_put(), triggering the > release callback. > > Before commit 333f796235a527, the above > wont happen. > > /Partha > > > On 11/15/2016 04:11 PM, Xue, Ying wrote: > > Hi John, > > Regarding the stack trace you > provided below, I > get the two potential > call chains: > > tipc_nametbl_withdraw > spin_lock_bh(&tn->nametbl_lock); > tipc_nametbl_remove_publ > spin_lock_bh(&seq->lock); > tipc_nameseq_remove_publ > tipc_subscrp_report_overlap > tipc_subscrp_send_event > tipc_conn_sendmsg > > spin_lock_bh(&con->outqueue_lock); > list_add_tail(&e->list, > &con->outqueue); > > > tipc_topsrv_stop > tipc_server_stop > tipc_close_conn > kernel_sock_shutdown > tipc_subscrb_delete > > spin_lock_bh(&subscriber->lock); > tipc_nametbl_unsubscribe(sub); > > spin_lock_bh(&tn->nametbl_lock); > > Although I suspect this is a revert > lock issue > leading to the soft > lockup, I am still unable to > understand which > lock together with > nametbl_lock is taken reversely on > the two > different paths above. > However, you just gave us the log > printed on > CPU#2, but the logs > outputted by other cores are also > important. So > if possible, please share > them with us. > > By the way, I agree with you, and it > seems that > commit 333f796235a527 is > related to the soft lockup. > > Regards, > Ying > > -----Original Message----- > From: John Thompson > [mailto:thompa....@gmail.com <mailto:thompa....@gmail.com> > <mailto:thompa....@gmail.com > <mailto:thompa....@gmail.com>>] > Sent: Tuesday, November 15, 2016 8:01 AM > To: > tipc-discussion@lists.sourceforge.net > <mailto:tipc-discussion@lists.sourceforge.net> > > <mailto:tipc-discussion@lists.sourceforge.net > <mailto:tipc-discussion@lists.sourceforge.net>> > Subject: [tipc-discussion] v4.7: > soft lockup > when releasing a socket > > Hi, > > I am seeing an occasional kernel > soft lockup. I > have TIPC v4.7 and the > kernel dump occurs when the system > is going down > for a reboot. > > The kernel dump is: > > <0>NMI watchdog: BUG: soft lockup - > CPU#2 stuck > for 23s! [exfx:1474] > <6>Modules linked in: tipc > jitterentropy_rng > echainiv drbg > platform_driver(O) ipifwd(PO) > ... > <6> > <6>GPR00: c15333e8 a4e0fb80 a4ee3600 > a51748ac > 00000000 ae475024 a537feec > fffffffd > <6>GPR08: a2197408 00000001 00000001 > 00000004 > 80691c00 <6>NIP [80691c40] > _raw_spin_lock_bh+0x40/0x70 <6>LR > [c1534f30] > tipc_nametbl_unsubscribe+0x50/0x120 > [tipc] <6>Call Trace: > <6>[a4e0fba0] [c15333e8] > tipc_named_reinit+0xf8/0x820 [tipc] > <6>[a4e0fbb0] [c15336a0] > tipc_named_reinit+0x3b0/0x820 [tipc] > <6>[a4e0fbd0] > [c1540bac] > tipc_nl_publ_dump+0x50c/0xed0 [tipc] > <6>[a4e0fc00] [c154164c] > tipc_conn_sendmsg+0xdc/0x170 [tipc] > <6>[a4e0fc30] [c1533c9c] > > tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] > <6>[a4e0fc70] [c153425c] > tipc_topsrv_stop+0x45c/0x4f0 [tipc] > <6>[a4e0fca0] [c1534788] > tipc_nametbl_remove_publ+0x58/0x110 > [tipc] > <6>[a4e0fcd0] [c1534c48] > tipc_nametbl_withdraw+0x68/0x140 [tipc] > <6>[a4e0fd00] [c153cc24] > tipc_nl_node_dump_link+0x1904/0x45d0 > [tipc] > <6>[a4e0fd30] [c153d838] > tipc_nl_node_dump_link+0x2518/0x45d0 > [tipc] > <6>[a4e0fd70] [804f2870] > sock_release+0x30/0xf0 <6>[a4e0fd80] > [804f2944] > sock_close+0x14/0x30 > <6>[a4e0fd90] [80105844] > __fput+0x94/0x200 > <6>[a4e0fdb0] [8003dca4] > task_work_run+0xd4/0x100 <6>[a4e0fdd0] > [80023620] do_exit+0x280/0x980 > <6>[a4e0fe10] [80024c48] > do_group_exit+0x48/0xb0 > <6>[a4e0fe30] [80030344] > get_signal+0x244/0x4f0 <6>[a4e0fe80] > [80007734] > do_signal+0x34/0x1c0 > <6>[a4e0ff30] [800079a8] > do_notify_resume+0x68/0x80 <6>[a4e0ff40] > [8000fa1c] do_user_signal+0x74/0xc4 > > > From the stack dump it looks like > tipc_named_reinit is trying to > > > acquire nametbl_lock. > > From looking at the call chain I can > see that > tipc_conn_sendmsg can > > > send up calling conn_put > > which will go on and call the > tipc_named_reinit > via tipc_sock_release. > > As tipc_nametbl_withdraw (from the > stack dump) > has already acquired the > nametbl_lock, tipc_named_reinit > > cannot get it and so the process hangs. > > > The call to tipc_sock_release (added > in Commit > 333f796235a527 > > <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- > <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-> > > <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- > <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->> > > stable.git/commit/?id=333f796235a52727db7e0a13888045f3aa3d5335>) > seems to have changed the behaviour > > such that it tries to do a lot more when > shutting the connection down. > > > If there is other information I can > provide > please let me know. > > Regards, > > John > > ------------------------------------------------------------ > ------------------ > > _______________________________________________ > tipc-discussion mailing list > > tipc-discussion@lists.sourceforge.net > <mailto:tipc-discussion@lists.sourceforge.net> > > <mailto:tipc-discussion@lists.sourceforge.net > <mailto:tipc-discussion@lists.sourceforge.net>> > > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> > > <https://lists.sourceforge.net/lists/listinfo/tipc-discussion > <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> > > ------------------------------------------------------------ > ------------------ > > _______________________________________________ > tipc-discussion mailing list > > tipc-discussion@lists.sourceforge.net > <mailto:tipc-discussion@lists.sourceforge.net> > > <mailto:tipc-discussion@lists.sourceforge.net > <mailto:tipc-discussion@lists.sourceforge.net>> > > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> > > <https://lists.sourceforge.net/lists/listinfo/tipc-discussion > <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> > > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > tipc-discussion mailing list > tipc-discussion@lists.sourceforge.net > <mailto:tipc-discussion@lists.sourceforge.net> > > <mailto:tipc-discussion@lists.sourceforge.net > <mailto:tipc-discussion@lists.sourceforge.net>> > > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> > > <https://lists.sourceforge.net/lists/listinfo/tipc-discussion > <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> > > > > > > > > ------------------------------------------------------------------------------ > _______________________________________________ > tipc-discussion mailing list > tipc-discussion@lists.sourceforge.net > <mailto:tipc-discussion@lists.sourceforge.net> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> > > ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion