Hi John,

Thanks for the report and good analysis!

You are absolutely right, and I will figure out how to fix the potential 
deadlock issue. The second issue you pointed out below also really exists. Good 
catch!

Regards,
Ying

-----Original Message-----
From: John Thompson [mailto:thompa....@gmail.com] 
Sent: Thursday, February 09, 2017 6:37 AM
To: tipc-discussion@lists.sourceforge.net
Subject: [tipc-discussion] Nametable soft lockup

Hi,

I have been using the patches Partha had provided for the nametable soft 
lockup, and that I had tested.  This was seen when testing on a SMP system.

Unfortunately I have come across another nametable soft lockup:

<0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [AIS listener:1591] 
<6>Modules linked in: tipc jitterentropy_rng echainiv drbg
platform_driver(O) ipifwd(PO)
<6>CPU: 0 PID: 1591 Comm: AIS listener Tainted: P           O
<6>task: ae393600 ti: ae286000 task.ti: ae286000
<6>NIP: 806952bc LR: c160bfe0 CTR: 80695280
<6>REGS: ae287b40 TRAP: 0901   Tainted: P           O
<6>MSR: 00029002 <CE,EE,ME>  CR: 48002484  XER: 00000000 <6>
<6>GPR00: c160a64c ae287bf0 ae393600 a20f18ac 00000000 00000000 ae064fbc
00000030
<6>GPR08: 01001006 00000001 00000001 00000006 80695280 <6>NIP [806952bc] 
_raw_spin_lock_bh+0x3c/0x70 <6>LR [c160bfe0] 
tipc_nametbl_unsubscribe+0x50/0x120 [tipc] <6>Call Trace:
<6>[ae287c10] [c160a64c] tipc_named_reinit+0x33c/0x8a0 [tipc] <6>[ae287c30] 
[c160ad44] tipc_subscrp_report_overlap+0xc4/0xe0 [tipc] <6>[ae287c70] 
[c160b30c] tipc_topsrv_stop+0x45c/0x4f0 [tipc] <6>[ae287ca0] [c160b838] 
tipc_nametbl_remove_publ+0x58/0x110 [tipc] <6>[ae287cd0] [c160bcf8] 
tipc_nametbl_withdraw+0x68/0x140 [tipc] <6>[ae287d00] [c1613cd4] 
tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] <6>[ae287d30] [c16148e8] 
tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] <6>[ae287d70] [804f5a40] 
sock_release+0x30/0xf0 <6>[ae287d80] [804f5b14] sock_close+0x14/0x30 
<6>[ae287d90] [80105844] __fput+0x94/0x200 <6>[ae287db0] [8003dca4] 
task_work_run+0xd4/0x100 <6>[ae287dd0] [80023620] do_exit+0x280/0x980 
<6>[ae287e10] [80024c48] do_group_exit+0x48/0xb0 <6>[ae287e30] [80030344] 
get_signal+0x244/0x4f0 <6>[ae287e80] [80007734] do_signal+0x34/0x1c0 
<6>[ae287f30] [800079a8] do_notify_resume+0x68/0x80 <6>[ae287f40] [8000fa1c] 
do_user_signal+0x74/0xc4


I have gone through the code and I think I have found a place where there is a 
potential soft lockup.
The call chain is:
tipc_nametbl_stop() Grabs nametbl_lock
   tipc_purge_publications()
      tipc_nameseq_remove_publ()
         tipc_subscrp_report_overlap()
            tipc_subscrp_put() Calls kref_put when kref == 0 -- could have been 
put by a different CPU
               tipc_subscrp_kref_release()
                  tipc_nametbl_unsubscribe()
                     << lockup occurs as it grabs the
                         nametbl_lock again >>


Another possible issue is in tipc_subscrp_report_overlap(), there are 2 early 
returns after a tipc_subscrp_get() before the tipc_subscrp_put().
Could this end up with an incorrect kref?

JT
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most engaging tech 
sites, SlashDot.org! http://sdm.link/slashdot 
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to