Hi Ying, I ran the standard test suite, and found it didn't work even once. So the topology subscription is totally broken, and the fact that this was delivered to 'net' makes it even more urgent to fix it. Since Partha will be out of office for the next three weeks, it must be you or me who fix this.
My findings so far: - Replacing the premature 'return' statements with gotos made the test go a little further, but we still get a deadlock. - I identified the following scenario, but there are certainly others, since John is seeing a different, probably related one: Thread 1: Thread 2: -------------- ------------ nametbl_publish() nametbl_lock() nametbl_insert() subscrp_report_overlap() // kref == 2 tipc_close_conn() // rcv error, - a normal event conn_put() tipc_conn_kref_release() tipc_sock_release() tipc_subscrb_delete() tipc_subscrb_subscrp_delete() spin_lock_bh(sub_lock) subscrp_get() // kref 2->3 spin_unlock(sub_lock) tipc_subscrp_delete() subscrp_put() //kref 3->2 spin_lock(sub_lock) subscrp_put() //kref 2->1 subscrp_put() // 1->0 tipc_subscrp_kref_release() spink_lock(sub_lock) nametbl_unsubscribe() nametbl_lock() //DEADLOCK! Moving the tipc_nametbl_unsubscribe() call from tipc_subscrp_release() to tipc_subscrp_delete() as follows: static void tipc_subscrp_delete(struct tipc_subscription *sub) { u32 timeout = htohl(sub->evt.s.timeout, sub->swap); if (timeout == TIPC_WAIT_FOREVER || del_timer(&sub->timer)) { tipc_nametbl_unsubscribe(sub); tipc_subscrp_put(sub); } } made the test suite go through, but I need a second opinion on this, and I think you know this part of the code better than me now. Maybe it could be something for John to test too? BR ///jon > -----Original Message----- > From: Xue, Ying [mailto:ying....@windriver.com] > Sent: Tuesday, February 14, 2017 05:16 AM > To: John Thompson <thompa....@gmail.com>; tipc- > discuss...@lists.sourceforge.net > Subject: Re: [tipc-discussion] Nametable soft lockup > > Hi John, > > Thanks for the report and good analysis! > > You are absolutely right, and I will figure out how to fix the potential > deadlock > issue. The second issue you pointed out below also really exists. Good catch! > > Regards, > Ying > > -----Original Message----- > From: John Thompson [mailto:thompa....@gmail.com] > Sent: Thursday, February 09, 2017 6:37 AM > To: tipc-discussion@lists.sourceforge.net > Subject: [tipc-discussion] Nametable soft lockup > > Hi, > > I have been using the patches Partha had provided for the nametable soft > lockup, and that I had tested. This was seen when testing on a SMP system. > > Unfortunately I have come across another nametable soft lockup: > > <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [AIS > listener:1591] <6>Modules linked in: tipc jitterentropy_rng echainiv drbg > platform_driver(O) ipifwd(PO) > <6>CPU: 0 PID: 1591 Comm: AIS listener Tainted: P O > <6>task: ae393600 ti: ae286000 task.ti: ae286000 > <6>NIP: 806952bc LR: c160bfe0 CTR: 80695280 > <6>REGS: ae287b40 TRAP: 0901 Tainted: P O > <6>MSR: 00029002 <CE,EE,ME> CR: 48002484 XER: 00000000 <6> > <6>GPR00: c160a64c ae287bf0 ae393600 a20f18ac 00000000 00000000 > ae064fbc > 00000030 > <6>GPR08: 01001006 00000001 00000001 00000006 80695280 <6>NIP > [806952bc] _raw_spin_lock_bh+0x3c/0x70 <6>LR [c160bfe0] > tipc_nametbl_unsubscribe+0x50/0x120 [tipc] <6>Call Trace: > <6>[ae287c10] [c160a64c] tipc_named_reinit+0x33c/0x8a0 [tipc] > <6>[ae287c30] [c160ad44] tipc_subscrp_report_overlap+0xc4/0xe0 [tipc] > <6>[ae287c70] [c160b30c] tipc_topsrv_stop+0x45c/0x4f0 [tipc] > <6>[ae287ca0] [c160b838] tipc_nametbl_remove_publ+0x58/0x110 [tipc] > <6>[ae287cd0] [c160bcf8] tipc_nametbl_withdraw+0x68/0x140 [tipc] > <6>[ae287d00] [c1613cd4] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] > <6>[ae287d30] [c16148e8] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] > <6>[ae287d70] [804f5a40] sock_release+0x30/0xf0 <6>[ae287d80] > [804f5b14] sock_close+0x14/0x30 <6>[ae287d90] [80105844] > __fput+0x94/0x200 <6>[ae287db0] [8003dca4] task_work_run+0xd4/0x100 > <6>[ae287dd0] [80023620] do_exit+0x280/0x980 <6>[ae287e10] [80024c48] > do_group_exit+0x48/0xb0 <6>[ae287e30] [80030344] > get_signal+0x244/0x4f0 <6>[ae287e80] [80007734] do_signal+0x34/0x1c0 > <6>[ae287f30] [800079a8] do_notify_resume+0x68/0x80 <6>[ae287f40] > [8000fa1c] do_user_signal+0x74/0xc4 > > > I have gone through the code and I think I have found a place where there is > a potential soft lockup. > The call chain is: > tipc_nametbl_stop() Grabs nametbl_lock > tipc_purge_publications() > tipc_nameseq_remove_publ() > tipc_subscrp_report_overlap() > tipc_subscrp_put() Calls kref_put when kref == 0 -- could have > been > put by a different CPU > tipc_subscrp_kref_release() > tipc_nametbl_unsubscribe() > << lockup occurs as it grabs the > nametbl_lock again >> > > > Another possible issue is in tipc_subscrp_report_overlap(), there are 2 early > returns after a tipc_subscrp_get() before the tipc_subscrp_put(). > Could this end up with an incorrect kref? > > JT > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most engaging > tech sites, SlashDot.org! http://sdm.link/slashdot > _______________________________________________ > tipc-discussion mailing list > tipc-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most engaging > tech sites, SlashDot.org! http://sdm.link/slashdot > _______________________________________________ > tipc-discussion mailing list > tipc-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/tipc-discussion ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion