Hi John, Thanks for the report and good analysis!
You are absolutely right, and I will figure out how to fix the potential deadlock issue. The second issue you pointed out below also really exists. Good catch! Regards, Ying -----Original Message----- From: John Thompson [mailto:thompa....@gmail.com] Sent: Thursday, February 09, 2017 6:37 AM To: tipc-discussion@lists.sourceforge.net Subject: [tipc-discussion] Nametable soft lockup Hi, I have been using the patches Partha had provided for the nametable soft lockup, and that I had tested. This was seen when testing on a SMP system. Unfortunately I have come across another nametable soft lockup: <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [AIS listener:1591] <6>Modules linked in: tipc jitterentropy_rng echainiv drbg platform_driver(O) ipifwd(PO) <6>CPU: 0 PID: 1591 Comm: AIS listener Tainted: P O <6>task: ae393600 ti: ae286000 task.ti: ae286000 <6>NIP: 806952bc LR: c160bfe0 CTR: 80695280 <6>REGS: ae287b40 TRAP: 0901 Tainted: P O <6>MSR: 00029002 <CE,EE,ME> CR: 48002484 XER: 00000000 <6> <6>GPR00: c160a64c ae287bf0 ae393600 a20f18ac 00000000 00000000 ae064fbc 00000030 <6>GPR08: 01001006 00000001 00000001 00000006 80695280 <6>NIP [806952bc] _raw_spin_lock_bh+0x3c/0x70 <6>LR [c160bfe0] tipc_nametbl_unsubscribe+0x50/0x120 [tipc] <6>Call Trace: <6>[ae287c10] [c160a64c] tipc_named_reinit+0x33c/0x8a0 [tipc] <6>[ae287c30] [c160ad44] tipc_subscrp_report_overlap+0xc4/0xe0 [tipc] <6>[ae287c70] [c160b30c] tipc_topsrv_stop+0x45c/0x4f0 [tipc] <6>[ae287ca0] [c160b838] tipc_nametbl_remove_publ+0x58/0x110 [tipc] <6>[ae287cd0] [c160bcf8] tipc_nametbl_withdraw+0x68/0x140 [tipc] <6>[ae287d00] [c1613cd4] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] <6>[ae287d30] [c16148e8] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] <6>[ae287d70] [804f5a40] sock_release+0x30/0xf0 <6>[ae287d80] [804f5b14] sock_close+0x14/0x30 <6>[ae287d90] [80105844] __fput+0x94/0x200 <6>[ae287db0] [8003dca4] task_work_run+0xd4/0x100 <6>[ae287dd0] [80023620] do_exit+0x280/0x980 <6>[ae287e10] [80024c48] do_group_exit+0x48/0xb0 <6>[ae287e30] [80030344] get_signal+0x244/0x4f0 <6>[ae287e80] [80007734] do_signal+0x34/0x1c0 <6>[ae287f30] [800079a8] do_notify_resume+0x68/0x80 <6>[ae287f40] [8000fa1c] do_user_signal+0x74/0xc4 I have gone through the code and I think I have found a place where there is a potential soft lockup. The call chain is: tipc_nametbl_stop() Grabs nametbl_lock tipc_purge_publications() tipc_nameseq_remove_publ() tipc_subscrp_report_overlap() tipc_subscrp_put() Calls kref_put when kref == 0 -- could have been put by a different CPU tipc_subscrp_kref_release() tipc_nametbl_unsubscribe() << lockup occurs as it grabs the nametbl_lock again >> Another possible issue is in tipc_subscrp_report_overlap(), there are 2 early returns after a tipc_subscrp_get() before the tipc_subscrp_put(). Could this end up with an incorrect kref? JT ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion