Hi Partha, I rebuilt afresh and retried the test with the same lockup kernel dumps. Yes I have multiple tipc clients subscribed to the topology server, at least 10 clients. They all use a subscription timeout of TIPC_WAIT_FOREVER
I will try the kernel command line parameter next week. JT On Fri, Nov 25, 2016 at 3:07 AM, Parthasarathy Bhuvaragan < parthasarathy.bhuvara...@ericsson.com> wrote: > Hi John, > > Do you have several tipc clients subscribed to topology server? > What subscription timeout do they use? > > Please enable kernel command line parameter: > softlockup_all_cpu_backtrace=1 > > /Partha > > On 11/23/2016 11:04 PM, John Thompson wrote: > >> Hi Partha, >> >> I tested overnight with the 2 patches you provided yesterday. >> Testing is still showing problems, here is one of the soft lockups, the >> other is the same as I sent the other day. >> I am going to redo my build as I expected some change in behaviour with >> your patches. >> >> It is possible that I am doing some dumping of nodes or links as I am >> not certain of all the code or paths. >> I have found that we do a tipc-config -nt and tipc-config -ls in some >> situations but it shouldn 't be initiated in this >> reboot case. >> >> <0>NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [pimd:1220] >> <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [rpc.13:2419] >> <6>Modules linked in: >> <0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [AIS >> listener:1600] >> <6> tipc >> <6>Modules linked in: >> <6> jitterentropy_rng >> <6> tipc >> <6> echainiv >> <6> jitterentropy_rng >> <6> drbg >> <6> echainiv >> <6> platform_driver(O) >> <6> drbg >> <6> platform_driver(O) >> <6> >> <6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P O >> <6>CPU: 2 PID: 1600 Comm: AIS listener Tainted: P O >> <6>task: aed76d20 ti: ae70c000 task.ti: ae70c000 >> <6>task: aee3ced0 ti: ae686000 task.ti: ae686000 >> <6>NIP: 8069257c LR: c13ebc4c CTR: 80692540 >> <6>NIP: 80692578 LR: c13ebf50 CTR: 80692540 >> <6>REGS: ae70dc20 TRAP: 0901 Tainted: P O >> <6>REGS: ae687ad0 TRAP: 0901 Tainted: P O >> <6>MSR: 00029002 >> <6>MSR: 00029002 >> <6>< >> <6>< >> <6>CE >> <6>CE >> <6>,EE >> <6>,EE >> <6>,ME >> <6>,ME >> <6>> >> <6>> >> <6> CR: 42002484 XER: 20000000 >> <6> CR: 48002444 XER: 00000000 >> <6> >> <6>GPR00: >> <6> >> <6>GPR00: >> <6>c13f3c34 >> <6>c13ea408 >> <6>ae70dcd0 >> <6>ae687b80 >> <6>aed76d20 >> <6>aee3ced0 >> <6>ae55c8ec >> <6>ae55c8ec >> <6>00002711 >> <6>00000000 >> <6>00000005 >> <6>a30e7264 >> <6>8666592a >> <6>ae5e070c >> <6>8666592b >> <6>fffffffd >> <6> >> <6>GPR08: >> <6> >> <6>GPR08: >> <6>ae9dad20 >> <6>ae72fbc8 >> <6>00000001 >> <6>00000001 >> <6>00000001 >> <6>00000001 >> <6>00000000 >> <6>00000004 >> <6>80692540 >> <6>80692540 >> <6> >> <6> >> <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70 >> <6>NIP [80692578] _raw_spin_lock_bh+0x38/0x70 >> <6>LR [c13ebc4c] tipc_nametbl_withdraw+0x4c/0x140 [tipc] >> <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc] >> <6>Call Trace: >> <6>Call Trace: >> <6>[ae70dcd0] [a85d99a0] 0xa85d99a0 >> <6>[ae687b80] [800fa258] check_object+0xc8/0x270 >> <6> (unreliable) >> <6> (unreliable) >> <6> >> <6> >> <6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] >> <6>[ae687ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc] >> <6> >> <6> >> <6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] >> <6>[ae687bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc] >> <6> >> <6> >> <6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0 >> <6>[ae687bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc] >> <6> >> <6> >> <6>[ae70dd80] [804f2ab4] sock_close+0x14/0x30 >> <6>[ae687c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc] >> <6> >> <6> >> <6>[ae70dd90] [80105844] __fput+0x94/0x200 >> <6>[ae687c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] >> <6> >> <6> >> <6>[ae70ddb0] [8003dca4] task_work_run+0xd4/0x100 >> <6>[ae687c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc] >> <6> >> <6> >> <6>[ae70ddd0] [80023620] do_exit+0x280/0x980 >> <6>[ae687ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc] >> <6> >> <6> >> <6>[ae70de10] [80024c48] do_group_exit+0x48/0xb0 >> <6>[ae687cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc] >> <6> >> <6> >> <6>[ae70de30] [80030344] get_signal+0x244/0x4f0 >> <6>[ae687d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] >> <6> >> <6> >> <6>[ae70de80] [80007734] do_signal+0x34/0x1c0 >> <6>[ae687d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] >> <6> >> <6> >> <6>[ae70df30] [800079a8] do_notify_resume+0x68/0x80 >> <6>[ae687d70] [804f29e0] sock_release+0x30/0xf0 >> <6> >> <6> >> <6>[ae70df40] [8000fa1c] do_user_signal+0x74/0xc4 >> <6>[ae687d80] [804f2ab4] sock_close+0x14/0x30 >> <6> >> <6> >> <6>--- interrupt: c00 at 0xfe490c4 >> <6> LR = 0xfe490a0 >> <6>[ae687d90] [80105844] __fput+0x94/0x200 >> <6>Instruction dump: >> <6> >> <6> >> <6>[ae687db0] [8003dca4] task_work_run+0xd4/0x100 >> <6>912a0008 >> <6> >> <6>39400001 >> <6>[ae687dd0] [80023620] do_exit+0x280/0x980 >> <6>7d201828 >> <6> >> <6>2c090000 >> <6>[ae687e10] [80024c48] do_group_exit+0x48/0xb0 >> <6>40820010 >> <6> >> <6>7d40192d >> <6>[ae687e30] [80030344] get_signal+0x244/0x4f0 >> <6>40a2fff0 >> <6> >> <6>7c2004ac >> <6>[ae687e80] [80007734] do_signal+0x34/0x1c0 >> <6> >> <6> >> <6>2f890000 >> <6>[ae687f30] [800079a8] do_notify_resume+0x68/0x80 >> <6>4dbe0020 >> <6> >> <6>7c210b78 >> <6>[ae687f40] [8000fa1c] do_user_signal+0x74/0xc4 >> <6>81230000 >> <6> >> <6><2f890000> >> <6>--- interrupt: c00 at 0xf4f3d08 >> <6> LR = 0xf4f3ce8 >> <6>40befff4 >> <6>Instruction dump: >> <6>7c421378 >> <6> >> <6>7d201828 >> <6>39290200 >> <6> >> <6>912a0008 39400001 >> <0>Kernel panic - not syncing: softlockup: hung tasks >> <6>7d201828 >> <6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P O L >> <6>2c090000 >> <6>Call Trace: >> <6>40820010 >> <6>[ae70da80] [80694e20] dump_stack+0x84/0xb0 >> <6>7d40192d >> <6> (unreliable) >> <6>40a2fff0 >> <6> >> <6> >> <6>[ae70da90] [80692ca8] panic+0xd8/0x214 >> <6>7c2004ac >> <6> >> <6>2f890000 >> <6>[ae70daf0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0 >> <6>4dbe0020 >> <6> >> <6>7c210b78 >> <6>[ae70db40] [8007ae58] __hrtimer_run_queues+0x118/0x1d0 >> <6><81230000> >> <6> >> <6>2f890000 >> <6>[ae70db80] [8007b608] hrtimer_interrupt+0xd8/0x270 >> <6>40befff4 >> <6> >> <6>7c421378 >> <6>[ae70dbd0] [8000983c] __timer_interrupt+0xac/0x1b0 >> <6> >> <6> >> <6>[ae70dbf0] [80009b70] timer_interrupt+0xb0/0xe0 >> <6>[ae70dc10] [8000f450] ret_from_except+0x0/0x18 >> <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70 >> <6> LR = tipc_nametbl_withdraw+0x4c/0x140 [tipc] >> <6>[ae70dcd0] [a85d99a0] 0xa85d99a0 (unreliable) >> <6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] >> <6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] >> <6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0 >> <6>[ae70dd80] [804f2ab4] sock_close+0x14/0x30 >> <6>[ae70dd90] [80105844] __fput+0x94/0x200 >> <6>[ae70ddb0] [8003dca4] task_work_run+0xd4/0x100 >> <6>[ae70ddd0] [80023620] do_exit+0x280/0x980 >> <6>[ae70de10] [80024c48] do_group_exit+0x48/0xb0 >> <6>[ae70de30] [80030344] get_signal+0x244/0x4f0 >> <6>[ae70de80] [80007734] do_signal+0x34/0x1c0 >> <6>[ae70df30] [800079a8] do_notify_resume+0x68/0x80 >> <6>[ae70df40] [8000fa1c] do_user_signal+0x74/0xc4 >> <6>--- interrupt: c00 at 0xfe490c4 >> <6> LR = 0xfe490a0 >> >> >> >> On Wed, Nov 23, 2016 at 9:16 PM, Parthasarathy Bhuvaragan >> <parthasarathy.bhuvara...@ericsson.com >> <mailto:parthasarathy.bhuvara...@ericsson.com>> wrote: >> >> Hi John, >> >> Ok. Can you test with the 2 patches I posted regarding the >> nametbl_lock? >> >> /Partha >> >> On 11/21/2016 11:32 PM, John Thompson wrote: >> >> Hi Partha, >> >> My test has 4 nodes, 2 of which are alternately rebooting. When >> the >> rebooted node rejoins a few minutes pass and then the other node >> is >> rebooted. >> I am not printing out link stats and believe that the the other >> code is >> not doing so either, when nodes leave or rejoin. >> >> JT >> >> >> On Tue, Nov 22, 2016 at 2:22 AM, Parthasarathy Bhuvaragan >> <parthasarathy.bhuvara...@ericsson.com >> <mailto:parthasarathy.bhuvara...@ericsson.com> >> <mailto:parthasarathy.bhuvara...@ericsson.com >> <mailto:parthasarathy.bhuvara...@ericsson.com>>> wrote: >> >> Hi, >> >> There is an other branch where softlockup for nametbl_lock >> occurs. >> >> tipc_named_rcv() Grabs nametbl_lock >> tipc_update_nametbl() (publish/withdraw) >> tipc_node_subscribe()/unsubscribe() >> tipc_node_write_unlock() >> << lockup occurs if it needs to process NODE >> UP/DOWN LINK >> UP/DOWN, as it grabs nametbl_lock again >> >> >> /Partha >> >> >> On 11/21/2016 01:04 PM, Parthasarathy Bhuvaragan wrote: >> >> Hi, >> >> tipc_nametbl_withdraw() triggers the softlockup as it >> tries to grab >> nametbl_lock twice if the node triggered a >> TIPC_NOTIFY_LINK_DOWN >> event >> while its is running. The erroneous call chain is: >> >> tipc_nametbl_withdraw() Grab nametbl_lock >> tipc_named_process_backlog() >> tipc_update_nametbl() >> if (dtype == WITHDRAWAL) tipc_node_unsubscribe() >> tipc_node_write_unlock() >> if (flags & TIPC_NOTIFY_LINK_DOWN) >> tipc_nametbl_withdraw() >> spin_lock_bh(&tn->nametbl_lock); << Soft >> Lockup >> >> >> Three callers which can cause this under module exit: >> >> Case1: >> tipc_exit_net() >> tipc_nametbl_withdraw() Grab nametbl_lock >> >> Case2: >> tipc_server_stop() >> tipc_conn_kref_release >> tipc_sock_release >> sock_release() >> tipc_release() >> tipc_sk_withdraw() >> tipc_nametbl_withdraw() >> >> Case3: >> tipc_server_stop() >> tipc_conn_kref_release() >> kernel_bind() >> tipc_bind() >> tipc_sk_withdraw() >> tipc_nametbl_withdraw() >> >> I will work on a solution for this. >> >> What kind of test were you performing when this occurred >> (linkup/down)? >> Do you read link statistics periodically in your tests? >> >> /Partha >> >> On 11/21/2016 05:30 AM, John Thompson wrote: >> >> Hi Partha, >> >> I was doing some some more testing today and have >> still >> observed the >> problem (contrary to what I had emailed earlier). >> >> Here is the kernel dump. >> >> <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for >> 22s! >> [Pluggable >> Serve:2221] >> <6>Modules linked in: tipc jitterentropy_rng >> echainiv drbg >> platform_driver(O) >> <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: >> P O >> <6>task: ae54ced0 ti: aec42000 task.ti: aec42000 >> <6>NIP: 8069257c LR: c13ebf50 CTR: 80692540 >> <6>REGS: aec43ad0 TRAP: 0901 Tainted: P O >> <6>MSR: 00029002 <CE,EE,ME> CR: 48002444 XER: >> 00000000 >> <6> >> <6>GPR00: c13ea408 aec43b80 ae54ced0 a624690c 00000000 >> a6271d84 a39a60cc >> fffffffd >> <6>GPR08: aeefbbc8 00000001 00000001 00000004 80692540 >> <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70 >> <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 >> [tipc] >> <6>Call Trace: >> <6>[aec43b80] [800fa258] check_object+0xc8/0x270 >> (unreliable) >> <6>[aec43ba0] [c13ea408] >> tipc_named_reinit+0xf8/0x820 [tipc] >> <6>[aec43bb0] [c13ea6c0] >> tipc_named_reinit+0x3b0/0x820 [tipc] >> <6>[aec43bd0] [c13f7bbc] >> tipc_nl_publ_dump+0x50c/0xed0 [tipc] >> <6>[aec43c00] [c13f865c] >> tipc_conn_sendmsg+0xdc/0x170 [tipc] >> <6>[aec43c30] [c13eacbc] >> tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] >> <6>[aec43c70] [c13eb27c] >> tipc_topsrv_stop+0x45c/0x4f0 [tipc] >> <6>[aec43ca0] [c13eb7a8] >> tipc_nametbl_remove_publ+0x58/0x110 >> [tipc] >> <6>[aec43cd0] [c13ebc68] >> tipc_nametbl_withdraw+0x68/0x140 [tipc] >> <6>[aec43d00] [c13f3c34] >> tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] >> <6>[aec43d30] [c13f4848] >> tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] >> >> TIPC_CMD_SHOW_LINK_STATS or TIPC_NL_LINK_GET >> >> <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0 >> <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30 >> <6>[aec43d90] [80105844] __fput+0x94/0x200 >> <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100 >> <6>[aec43dd0] [80023620] do_exit+0x280/0x980 >> <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0 >> <6>[aec43e30] [80030344] get_signal+0x244/0x4f0 >> <6>[aec43e80] [80007734] do_signal+0x34/0x1c0 >> <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80 >> <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4 >> <6>--- interrupt: c00 at 0xf4f3d08 >> <6> LR = 0xf4f3ce8 >> <6>Instruction dump: >> <6>912a0008 39400001 7d201828 2c090000 40820010 >> 7d40192d >> 40a2fff0 7c2004ac >> <6>2f890000 4dbe0020 7c210b78 81230000 <2f890000> >> 40befff4 >> 7c421378 >> 7d201828 >> <0>Kernel panic - not syncing: softlockup: hung tasks >> <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P >> O L >> <6>Call Trace: >> <6>[aec43930] [80694e20] dump_stack+0x84/0xb0 >> (unreliable) >> <6>[aec43940] [80692ca8] panic+0xd8/0x214 >> <6>[aec439a0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0 >> <6>[aec439f0] [8007ae58] >> __hrtimer_run_queues+0x118/0x1d0 >> <6>[aec43a30] [8007b608] hrtimer_interrupt+0xd8/0x270 >> <6>[aec43a80] [8000983c] __timer_interrupt+0xac/0x1b0 >> <6>[aec43aa0] [80009b70] timer_interrupt+0xb0/0xe0 >> <6>[aec43ac0] [8000f450] ret_from_except+0x0/0x18 >> <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70 >> <6> LR = tipc_nametbl_unsubscribe+0x50/0x120 >> [tipc] >> <6>[aec43b80] [800fa258] check_object+0xc8/0x270 >> (unreliable) >> <6>[aec43ba0] [c13ea408] >> tipc_named_reinit+0xf8/0x820 [tipc] >> <6>[aec43bb0] [c13ea6c0] >> tipc_named_reinit+0x3b0/0x820 [tipc] >> <6>[aec43bd0] [c13f7bbc] >> tipc_nl_publ_dump+0x50c/0xed0 [tipc] >> <6>[aec43c00] [c13f865c] >> tipc_conn_sendmsg+0xdc/0x170 [tipc] >> <6>[aec43c30] [c13eacbc] >> tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] >> <6>[aec43c70] [c13eb27c] >> tipc_topsrv_stop+0x45c/0x4f0 [tipc] >> <6>[aec43ca0] [c13eb7a8] >> tipc_nametbl_remove_publ+0x58/0x110 >> [tipc] >> <6>[aec43cd0] [c13ebc68] >> tipc_nametbl_withdraw+0x68/0x140 [tipc] >> <6>[aec43d00] [c13f3c34] >> tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] >> <6>[aec43d30] [c13f4848] >> tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] >> <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0 >> <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30 >> <6>[aec43d90] [80105844] __fput+0x94/0x200 >> <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100 >> <6>[aec43dd0] [80023620] do_exit+0x280/0x980 >> <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0 >> <6>[aec43e30] [80030344] get_signal+0x244/0x4f0 >> <6>[aec43e80] [80007734] do_signal+0x34/0x1c0 >> <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80 >> <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4 >> <6>--- interrupt: c00 at 0xf4f3d08 >> <6> LR = 0xf4f3ce8 >> >> >> On Mon, Nov 21, 2016 at 9:59 AM, John Thompson >> <thompa....@gmail.com <mailto:thompa....@gmail.com> >> <mailto:thompa....@gmail.com <mailto:thompa....@gmail.com>> >> <mailto:thompa....@gmail.com >> <mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com >> <mailto:thompa....@gmail.com>>>> >> wrote: >> >> Hi Partha, >> >> In my testing over the weekend the patch >> performed well >> - I didn't >> see any kernel dumps due to this issue. >> >> Thanks for the quick response. >> JT >> >> >> On Fri, Nov 18, 2016 at 10:34 AM, John Thompson >> <thompa....@gmail.com >> <mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com >> <mailto:thompa....@gmail.com>> >> <mailto:thompa....@gmail.com >> <mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com >> <mailto:thompa....@gmail.com>>>> >> wrote: >> >> Hi, >> >> I will be able to have some test results by >> the >> start of next >> week on the first patch. >> >> Regards, >> JT >> >> >> On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue >> <ying....@windriver.com >> <mailto:ying....@windriver.com> >> <mailto:ying....@windriver.com >> <mailto:ying....@windriver.com>> >> <mailto:ying....@windriver.com >> <mailto:ying....@windriver.com> >> <mailto:ying....@windriver.com >> <mailto:ying....@windriver.com>>>> wrote: >> >> On 11/17/2016 07:04 AM, John Thompson >> wrote: >> >> Hi Partha / Ying, >> >> I will try out the patch and let you >> know >> how it goes. >> I also note about providing the >> other CPU >> core dumps - >> in one of my cases I >> didn't have them but in others I did >> but >> they were interleaved and so were >> difficult >> to interpret. >> >> >> Thanks, it's unnecessary for us to >> collect more >> logs as its >> soft lockup scenario should be just what >> Partha >> described. >> >> Regards, >> Ying >> >> >> >> Thanks for getting a patch together >> so quickly. >> >> JT >> >> On Wed, Nov 16, 2016 at 10:23 PM, >> Parthasarathy Bhuvaragan < >> >> parthasarathy.bhuvara...@ericsson.com >> <mailto:parthasarathy.bhuvara...@ericsson.com> >> <mailto:parthasarathy.bhuvara...@ericsson.com >> <mailto:parthasarathy.bhuvara...@ericsson.com>> >> >> <mailto:parthasarathy.bhuvara...@ericsson.com >> <mailto:parthasarathy.bhuvara...@ericsson.com> >> >> <mailto:parthasarathy.bhuvara...@ericsson.com >> <mailto:parthasarathy.bhuvara...@ericsson.com>>>> wrote: >> >> Hi Ying / John, >> >> The soft lock is the call chain of >> tipc_nametbl_withdraw(), when it >> performs the >> tipc_conn_kref_release() as >> it tries to >> grab nametbl_lock >> again while holding it already. >> >> tipc_nametbl_withdraw >> >> spin_lock_bh(&tn->nametbl_lock); >> tipc_nametbl_remove_publ >> spin_lock_bh(&seq->lock); >> tipc_nameseq_remove_publ >> >> tipc_subscrp_report_overlap >> >> tipc_subscrp_send_event >> tipc_conn_sendmsg >> >> << Here, the >> (test_bit(CF_CONNECTED, >> &con->flags)) >> Fails, leading to the >> else case where do do a >> conn_put() and >> that triggers >> the cleanup as >> refcount reached 0. Leading the >> call >> chain below : >> >> tipc_conn_kref_release >> tipc_sock_release >> tipc_conn_release >> tipc_subscrb_delete >> tipc_subscrp_delete >> >> tipc_nametbl_unsubscribe >> >> spin_lock_bh(&tn->nametbl_lock); >> << !! Soft Lockup >> >> >> One cause is that >> tipc_exit_net() calls >> first calls >> tipc_topsrv_stop() and >> then tipc_nametbl_withdraw() in >> scope of >> tipc_net_stop(). >> >> The above chain will only occur >> in a >> narrow window >> for a given connection: >> CPU#1: >> tipc_nametbl_withdraw() manages >> to perform >> tipc_conn_lookup() and steps >> the refcount to 2, while in >> CPU#2 the >> following occurs: >> CPU#2: >> tipc_server_stop() calls >> tipc_close_conn(con). This >> performs a conn_put() >> decrementing refcount to 1. >> Now, CPU#1 continues and detects >> that >> the connection >> is not CF_CONNECTED >> and does a conn_put(), >> triggering the >> release callback. >> >> Before commit 333f796235a527, >> the above >> wont happen. >> >> /Partha >> >> >> On 11/15/2016 04:11 PM, Xue, >> Ying wrote: >> >> Hi John, >> >> Regarding the stack trace you >> provided below, I >> get the two potential >> call chains: >> >> tipc_nametbl_withdraw >> >> spin_lock_bh(&tn->nametbl_lock); >> tipc_nametbl_remove_publ >> spin_lock_bh(&seq->lock); >> tipc_nameseq_remove_publ >> >> tipc_subscrp_report_overlap >> >> tipc_subscrp_send_event >> tipc_conn_sendmsg >> >> spin_lock_bh(&con->outqueue_lock); >> >> list_add_tail(&e->list, >> &con->outqueue); >> >> >> tipc_topsrv_stop >> tipc_server_stop >> tipc_close_conn >> kernel_sock_shutdown >> tipc_subscrb_delete >> >> spin_lock_bh(&subscriber->lock); >> >> tipc_nametbl_unsubscribe(sub); >> >> spin_lock_bh(&tn->nametbl_lock); >> >> Although I suspect this is a >> revert >> lock issue >> leading to the soft >> lockup, I am still unable to >> understand which >> lock together with >> nametbl_lock is taken >> reversely on >> the two >> different paths above. >> However, you just gave us >> the log >> printed on >> CPU#2, but the logs >> outputted by other cores are >> also >> important. So >> if possible, please share >> them with us. >> >> By the way, I agree with >> you, and it >> seems that >> commit 333f796235a527 is >> related to the soft lockup. >> >> Regards, >> Ying >> >> -----Original Message----- >> From: John Thompson >> [mailto:thompa....@gmail.com >> <mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com >> <mailto:thompa....@gmail.com>> >> <mailto:thompa....@gmail.com >> <mailto:thompa....@gmail.com> >> <mailto:thompa....@gmail.com >> <mailto:thompa....@gmail.com>>>] >> Sent: Tuesday, November 15, >> 2016 8:01 AM >> To: >> tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net>> >> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net> >> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net>>> >> Subject: [tipc-discussion] >> v4.7: >> soft lockup >> when releasing a socket >> >> Hi, >> >> I am seeing an occasional >> kernel >> soft lockup. I >> have TIPC v4.7 and the >> kernel dump occurs when the >> system >> is going down >> for a reboot. >> >> The kernel dump is: >> >> <0>NMI watchdog: BUG: soft >> lockup - >> CPU#2 stuck >> for 23s! [exfx:1474] >> <6>Modules linked in: tipc >> jitterentropy_rng >> echainiv drbg >> platform_driver(O) ipifwd(PO) >> ... >> <6> >> <6>GPR00: c15333e8 a4e0fb80 >> a4ee3600 >> a51748ac >> 00000000 ae475024 a537feec >> fffffffd >> <6>GPR08: a2197408 00000001 >> 00000001 >> 00000004 >> 80691c00 <6>NIP [80691c40] >> _raw_spin_lock_bh+0x40/0x70 >> <6>LR >> [c1534f30] >> >> tipc_nametbl_unsubscribe+0x50/0x120 >> [tipc] <6>Call Trace: >> <6>[a4e0fba0] [c15333e8] >> tipc_named_reinit+0xf8/0x820 >> [tipc] >> <6>[a4e0fbb0] [c15336a0] >> >> tipc_named_reinit+0x3b0/0x820 [tipc] >> <6>[a4e0fbd0] >> [c1540bac] >> tipc_nl_publ_dump+0x50c/0xed0 [tipc] >> <6>[a4e0fc00] [c154164c] >> tipc_conn_sendmsg+0xdc/0x170 >> [tipc] >> <6>[a4e0fc30] [c1533c9c] >> >> tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] >> <6>[a4e0fc70] [c153425c] >> tipc_topsrv_stop+0x45c/0x4f0 >> [tipc] >> <6>[a4e0fca0] [c1534788] >> >> tipc_nametbl_remove_publ+0x58/0x110 >> [tipc] >> <6>[a4e0fcd0] [c1534c48] >> >> tipc_nametbl_withdraw+0x68/0x140 [tipc] >> <6>[a4e0fd00] [c153cc24] >> >> tipc_nl_node_dump_link+0x1904/0x45d0 >> [tipc] >> <6>[a4e0fd30] [c153d838] >> >> tipc_nl_node_dump_link+0x2518/0x45d0 >> [tipc] >> <6>[a4e0fd70] [804f2870] >> sock_release+0x30/0xf0 >> <6>[a4e0fd80] >> [804f2944] >> sock_close+0x14/0x30 >> <6>[a4e0fd90] [80105844] >> __fput+0x94/0x200 >> <6>[a4e0fdb0] [8003dca4] >> task_work_run+0xd4/0x100 >> <6>[a4e0fdd0] >> [80023620] do_exit+0x280/0x980 >> <6>[a4e0fe10] [80024c48] >> do_group_exit+0x48/0xb0 >> <6>[a4e0fe30] [80030344] >> get_signal+0x244/0x4f0 >> <6>[a4e0fe80] >> [80007734] >> do_signal+0x34/0x1c0 >> <6>[a4e0ff30] [800079a8] >> do_notify_resume+0x68/0x80 >> <6>[a4e0ff40] >> [8000fa1c] >> do_user_signal+0x74/0xc4 >> >> >> From the stack dump it looks >> like >> tipc_named_reinit is trying to >> >> >> acquire nametbl_lock. >> >> From looking at the call >> chain I can >> see that >> tipc_conn_sendmsg can >> >> >> send up calling conn_put >> >> which will go on and call the >> tipc_named_reinit >> via tipc_sock_release. >> >> As tipc_nametbl_withdraw >> (from the >> stack dump) >> has already acquired the >> nametbl_lock, >> tipc_named_reinit >> >> cannot get it and so the >> process hangs. >> >> >> The call to >> tipc_sock_release (added >> in Commit >> 333f796235a527 >> >> >> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- >> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-> >> >> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- >> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->> >> >> >> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- >> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-> >> >> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- >> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->>> >> >> >> stable.git/commit/?id=333f796235a52727db7e0a13888045f3aa3d5335>) >> seems to have changed the >> behaviour >> >> such that it tries to do a >> lot more when >> shutting the connection down. >> >> >> If there is other >> information I can >> provide >> please let me know. >> >> Regards, >> >> John >> >> >> ------------------------------------------------------------ >> ------------------ >> >> _______________________________________________ >> tipc-discussion mailing list >> >> tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net>> >> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net>>> >> >> >> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >> >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> >> >> >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>> >> >> >> ------------------------------------------------------------ >> ------------------ >> >> _______________________________________________ >> tipc-discussion mailing list >> >> tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net>> >> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net>>> >> >> >> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >> >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> >> >> >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>> >> >> >> >> >> >> ------------------------------------------------------------ >> ------------------ >> >> _______________________________________________ >> tipc-discussion mailing list >> >> tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net>> >> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net>>> >> >> >> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >> >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> >> >> >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>> >> >> >> >> >> >> >> >> ------------------------------------------------------------ >> ------------------ >> _______________________________________________ >> tipc-discussion mailing list >> tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net> >> <mailto:tipc-discussion@lists.sourceforge.net >> <mailto:tipc-discussion@lists.sourceforge.net>> >> >> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >> >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> >> >> >> >> ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion