Hi Ying,
Do you have any comments on the v2 of 3 patches that fixes this issue? /Partha ________________________________ From: John Thompson <thompa....@gmail.com> Sent: Monday, November 28, 2016 8:21:57 PM To: Parthasarathy Bhuvaragan Cc: Ying Xue; tipc-discussion@lists.sourceforge.net Subject: Re: [tipc-discussion] v4.7: soft lockup when releasing a socket Hi Partha, I tested with the latest 3 patches last night and observed no soft lockups. Thanks, John On Fri, Nov 25, 2016 at 11:50 AM, John Thompson <thompa....@gmail.com<mailto:thompa....@gmail.com>> wrote: Hi Partha, I rebuilt afresh and retried the test with the same lockup kernel dumps. Yes I have multiple tipc clients subscribed to the topology server, at least 10 clients. They all use a subscription timeout of TIPC_WAIT_FOREVER I will try the kernel command line parameter next week. JT On Fri, Nov 25, 2016 at 3:07 AM, Parthasarathy Bhuvaragan <parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>> wrote: Hi John, Do you have several tipc clients subscribed to topology server? What subscription timeout do they use? Please enable kernel command line parameter: softlockup_all_cpu_backtrace=1 /Partha On 11/23/2016 11:04 PM, John Thompson wrote: Hi Partha, I tested overnight with the 2 patches you provided yesterday. Testing is still showing problems, here is one of the soft lockups, the other is the same as I sent the other day. I am going to redo my build as I expected some change in behaviour with your patches. It is possible that I am doing some dumping of nodes or links as I am not certain of all the code or paths. I have found that we do a tipc-config -nt and tipc-config -ls in some situations but it shouldn 't be initiated in this reboot case. <0>NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [pimd:1220] <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [rpc.13:2419] <6>Modules linked in: <0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [AIS listener:1600] <6> tipc <6>Modules linked in: <6> jitterentropy_rng <6> tipc <6> echainiv <6> jitterentropy_rng <6> drbg <6> echainiv <6> platform_driver(O) <6> drbg <6> platform_driver(O) <6> <6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P O <6>CPU: 2 PID: 1600 Comm: AIS listener Tainted: P O <6>task: aed76d20 ti: ae70c000 task.ti: ae70c000 <6>task: aee3ced0 ti: ae686000 task.ti: ae686000 <6>NIP: 8069257c LR: c13ebc4c CTR: 80692540 <6>NIP: 80692578 LR: c13ebf50 CTR: 80692540 <6>REGS: ae70dc20 TRAP: 0901 Tainted: P O <6>REGS: ae687ad0 TRAP: 0901 Tainted: P O <6>MSR: 00029002 <6>MSR: 00029002 <6>< <6>< <6>CE <6>CE <6>,EE <6>,EE <6>,ME <6>,ME <6>> <6>> <6> CR: 42002484 XER: 20000000 <6> CR: 48002444 XER: 00000000 <6> <6>GPR00: <6> <6>GPR00: <6>c13f3c34 <6>c13ea408 <6>ae70dcd0 <6>ae687b80 <6>aed76d20 <6>aee3ced0 <6>ae55c8ec <6>ae55c8ec <6>00002711 <6>00000000 <6>00000005 <6>a30e7264 <6>8666592a <6>ae5e070c <6>8666592b <6>fffffffd <6> <6>GPR08: <6> <6>GPR08: <6>ae9dad20 <6>ae72fbc8 <6>00000001 <6>00000001 <6>00000001 <6>00000001 <6>00000000 <6>00000004 <6>80692540 <6>80692540 <6> <6> <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70 <6>NIP [80692578] _raw_spin_lock_bh+0x38/0x70 <6>LR [c13ebc4c] tipc_nametbl_withdraw+0x4c/0x140 [tipc] <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc] <6>Call Trace: <6>Call Trace: <6>[ae70dcd0] [a85d99a0] 0xa85d99a0 <6>[ae687b80] [800fa258] check_object+0xc8/0x270 <6> (unreliable) <6> (unreliable) <6> <6> <6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] <6>[ae687ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc] <6> <6> <6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] <6>[ae687bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc] <6> <6> <6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0 <6>[ae687bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc] <6> <6> <6>[ae70dd80] [804f2ab4] sock_close+0x14/0x30 <6>[ae687c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc] <6> <6> <6>[ae70dd90] [80105844] __fput+0x94/0x200 <6>[ae687c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] <6> <6> <6>[ae70ddb0] [8003dca4] task_work_run+0xd4/0x100 <6>[ae687c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc] <6> <6> <6>[ae70ddd0] [80023620] do_exit+0x280/0x980 <6>[ae687ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc] <6> <6> <6>[ae70de10] [80024c48] do_group_exit+0x48/0xb0 <6>[ae687cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc] <6> <6> <6>[ae70de30] [80030344] get_signal+0x244/0x4f0 <6>[ae687d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] <6> <6> <6>[ae70de80] [80007734] do_signal+0x34/0x1c0 <6>[ae687d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] <6> <6> <6>[ae70df30] [800079a8] do_notify_resume+0x68/0x80 <6>[ae687d70] [804f29e0] sock_release+0x30/0xf0 <6> <6> <6>[ae70df40] [8000fa1c] do_user_signal+0x74/0xc4 <6>[ae687d80] [804f2ab4] sock_close+0x14/0x30 <6> <6> <6>--- interrupt: c00 at 0xfe490c4 <6> LR = 0xfe490a0 <6>[ae687d90] [80105844] __fput+0x94/0x200 <6>Instruction dump: <6> <6> <6>[ae687db0] [8003dca4] task_work_run+0xd4/0x100 <6>912a0008 <6> <6>39400001 <6>[ae687dd0] [80023620] do_exit+0x280/0x980 <6>7d201828 <6> <6>2c090000 <6>[ae687e10] [80024c48] do_group_exit+0x48/0xb0 <6>40820010 <6> <6>7d40192d <6>[ae687e30] [80030344] get_signal+0x244/0x4f0 <6>40a2fff0 <6> <6>7c2004ac <6>[ae687e80] [80007734] do_signal+0x34/0x1c0 <6> <6> <6>2f890000 <6>[ae687f30] [800079a8] do_notify_resume+0x68/0x80 <6>4dbe0020 <6> <6>7c210b78 <6>[ae687f40] [8000fa1c] do_user_signal+0x74/0xc4 <6>81230000 <6> <6><2f890000> <6>--- interrupt: c00 at 0xf4f3d08 <6> LR = 0xf4f3ce8 <6>40befff4 <6>Instruction dump: <6>7c421378 <6> <6>7d201828 <6>39290200 <6> <6>912a0008 39400001 <0>Kernel panic - not syncing: softlockup: hung tasks <6>7d201828 <6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P O L <6>2c090000 <6>Call Trace: <6>40820010 <6>[ae70da80] [80694e20] dump_stack+0x84/0xb0 <6>7d40192d <6> (unreliable) <6>40a2fff0 <6> <6> <6>[ae70da90] [80692ca8] panic+0xd8/0x214 <6>7c2004ac <6> <6>2f890000 <6>[ae70daf0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0 <6>4dbe0020 <6> <6>7c210b78 <6>[ae70db40] [8007ae58] __hrtimer_run_queues+0x118/0x1d0 <6><81230000> <6> <6>2f890000 <6>[ae70db80] [8007b608] hrtimer_interrupt+0xd8/0x270 <6>40befff4 <6> <6>7c421378 <6>[ae70dbd0] [8000983c] __timer_interrupt+0xac/0x1b0 <6> <6> <6>[ae70dbf0] [80009b70] timer_interrupt+0xb0/0xe0 <6>[ae70dc10] [8000f450] ret_from_except+0x0/0x18 <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70 <6> LR = tipc_nametbl_withdraw+0x4c/0x140 [tipc] <6>[ae70dcd0] [a85d99a0] 0xa85d99a0 (unreliable) <6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] <6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] <6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0 <6>[ae70dd80] [804f2ab4] sock_close+0x14/0x30 <6>[ae70dd90] [80105844] __fput+0x94/0x200 <6>[ae70ddb0] [8003dca4] task_work_run+0xd4/0x100 <6>[ae70ddd0] [80023620] do_exit+0x280/0x980 <6>[ae70de10] [80024c48] do_group_exit+0x48/0xb0 <6>[ae70de30] [80030344] get_signal+0x244/0x4f0 <6>[ae70de80] [80007734] do_signal+0x34/0x1c0 <6>[ae70df30] [800079a8] do_notify_resume+0x68/0x80 <6>[ae70df40] [8000fa1c] do_user_signal+0x74/0xc4 <6>--- interrupt: c00 at 0xfe490c4 <6> LR = 0xfe490a0 On Wed, Nov 23, 2016 at 9:16 PM, Parthasarathy Bhuvaragan <parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com> <mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>> wrote: Hi John, Ok. Can you test with the 2 patches I posted regarding the nametbl_lock? /Partha On 11/21/2016 11:32 PM, John Thompson wrote: Hi Partha, My test has 4 nodes, 2 of which are alternately rebooting. When the rebooted node rejoins a few minutes pass and then the other node is rebooted. I am not printing out link stats and believe that the the other code is not doing so either, when nodes leave or rejoin. JT On Tue, Nov 22, 2016 at 2:22 AM, Parthasarathy Bhuvaragan <parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com> <mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>> <mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com> <mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>>> wrote: Hi, There is an other branch where softlockup for nametbl_lock occurs. tipc_named_rcv() Grabs nametbl_lock tipc_update_nametbl() (publish/withdraw) tipc_node_subscribe()/unsubscribe() tipc_node_write_unlock() << lockup occurs if it needs to process NODE UP/DOWN LINK UP/DOWN, as it grabs nametbl_lock again >> /Partha On 11/21/2016 01:04 PM, Parthasarathy Bhuvaragan wrote: Hi, tipc_nametbl_withdraw() triggers the softlockup as it tries to grab nametbl_lock twice if the node triggered a TIPC_NOTIFY_LINK_DOWN event while its is running. The erroneous call chain is: tipc_nametbl_withdraw() Grab nametbl_lock tipc_named_process_backlog() tipc_update_nametbl() if (dtype == WITHDRAWAL) tipc_node_unsubscribe() tipc_node_write_unlock() if (flags & TIPC_NOTIFY_LINK_DOWN) tipc_nametbl_withdraw() spin_lock_bh(&tn->nametbl_lock); << Soft Lockup >> Three callers which can cause this under module exit: Case1: tipc_exit_net() tipc_nametbl_withdraw() Grab nametbl_lock Case2: tipc_server_stop() tipc_conn_kref_release tipc_sock_release sock_release() tipc_release() tipc_sk_withdraw() tipc_nametbl_withdraw() Case3: tipc_server_stop() tipc_conn_kref_release() kernel_bind() tipc_bind() tipc_sk_withdraw() tipc_nametbl_withdraw() I will work on a solution for this. What kind of test were you performing when this occurred (linkup/down)? Do you read link statistics periodically in your tests? /Partha On 11/21/2016 05:30 AM, John Thompson wrote: Hi Partha, I was doing some some more testing today and have still observed the problem (contrary to what I had emailed earlier). Here is the kernel dump. <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [Pluggable Serve:2221] <6>Modules linked in: tipc jitterentropy_rng echainiv drbg platform_driver(O) <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P O <6>task: ae54ced0 ti: aec42000 task.ti: aec42000 <6>NIP: 8069257c LR: c13ebf50 CTR: 80692540 <6>REGS: aec43ad0 TRAP: 0901 Tainted: P O <6>MSR: 00029002 <CE,EE,ME> CR: 48002444 XER: 00000000 <6> <6>GPR00: c13ea408 aec43b80 ae54ced0 a624690c 00000000 a6271d84 a39a60cc fffffffd <6>GPR08: aeefbbc8 00000001 00000001 00000004 80692540 <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70 <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc] <6>Call Trace: <6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable) <6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc] <6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc] <6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc] <6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc] <6>[aec43c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] <6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc] <6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc] <6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc] <6>[aec43d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] <6>[aec43d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] TIPC_CMD_SHOW_LINK_STATS or TIPC_NL_LINK_GET <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0 <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30 <6>[aec43d90] [80105844] __fput+0x94/0x200 <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100 <6>[aec43dd0] [80023620] do_exit+0x280/0x980 <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0 <6>[aec43e30] [80030344] get_signal+0x244/0x4f0 <6>[aec43e80] [80007734] do_signal+0x34/0x1c0 <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80 <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4 <6>--- interrupt: c00 at 0xf4f3d08 <6> LR = 0xf4f3ce8 <6>Instruction dump: <6>912a0008 39400001 7d201828 2c090000 40820010 7d40192d 40a2fff0 7c2004ac <6>2f890000 4dbe0020 7c210b78 81230000 <2f890000> 40befff4 7c421378 7d201828 <0>Kernel panic - not syncing: softlockup: hung tasks <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P O L <6>Call Trace: <6>[aec43930] [80694e20] dump_stack+0x84/0xb0 (unreliable) <6>[aec43940] [80692ca8] panic+0xd8/0x214 <6>[aec439a0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0 <6>[aec439f0] [8007ae58] __hrtimer_run_queues+0x118/0x1d0 <6>[aec43a30] [8007b608] hrtimer_interrupt+0xd8/0x270 <6>[aec43a80] [8000983c] __timer_interrupt+0xac/0x1b0 <6>[aec43aa0] [80009b70] timer_interrupt+0xb0/0xe0 <6>[aec43ac0] [8000f450] ret_from_except+0x0/0x18 <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70 <6> LR = tipc_nametbl_unsubscribe+0x50/0x120 [tipc] <6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable) <6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc] <6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc] <6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc] <6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc] <6>[aec43c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] <6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc] <6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc] <6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc] <6>[aec43d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] <6>[aec43d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0 <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30 <6>[aec43d90] [80105844] __fput+0x94/0x200 <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100 <6>[aec43dd0] [80023620] do_exit+0x280/0x980 <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0 <6>[aec43e30] [80030344] get_signal+0x244/0x4f0 <6>[aec43e80] [80007734] do_signal+0x34/0x1c0 <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80 <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4 <6>--- interrupt: c00 at 0xf4f3d08 <6> LR = 0xf4f3ce8 On Mon, Nov 21, 2016 at 9:59 AM, John Thompson <thompa....@gmail.com<mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>>>> wrote: Hi Partha, In my testing over the weekend the patch performed well - I didn't see any kernel dumps due to this issue. Thanks for the quick response. JT On Fri, Nov 18, 2016 at 10:34 AM, John Thompson <thompa....@gmail.com<mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>>>> wrote: Hi, I will be able to have some test results by the start of next week on the first patch. Regards, JT On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue <ying....@windriver.com<mailto:ying....@windriver.com> <mailto:ying....@windriver.com<mailto:ying....@windriver.com>> <mailto:ying....@windriver.com<mailto:ying....@windriver.com> <mailto:ying....@windriver.com<mailto:ying....@windriver.com>>> <mailto:ying....@windriver.com<mailto:ying....@windriver.com> <mailto:ying....@windriver.com<mailto:ying....@windriver.com>> <mailto:ying....@windriver.com<mailto:ying....@windriver.com> <mailto:ying....@windriver.com<mailto:ying....@windriver.com>>>>> wrote: On 11/17/2016 07:04 AM, John Thompson wrote: Hi Partha / Ying, I will try out the patch and let you know how it goes. I also note about providing the other CPU core dumps - in one of my cases I didn't have them but in others I did but they were interleaved and so were difficult to interpret. Thanks, it's unnecessary for us to collect more logs as its soft lockup scenario should be just what Partha described. Regards, Ying Thanks for getting a patch together so quickly. JT On Wed, Nov 16, 2016 at 10:23 PM, Parthasarathy Bhuvaragan < parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com> <mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>> <mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com> <mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>> <mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com> <mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>> <mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com> <mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>>>> wrote: Hi Ying / John, The soft lock is the call chain of tipc_nametbl_withdraw(), when it performs the tipc_conn_kref_release() as it tries to grab nametbl_lock again while holding it already. tipc_nametbl_withdraw spin_lock_bh(&tn->nametbl_lock); tipc_nametbl_remove_publ spin_lock_bh(&seq->lock); tipc_nameseq_remove_publ tipc_subscrp_report_overlap tipc_subscrp_send_event tipc_conn_sendmsg << Here, the (test_bit(CF_CONNECTED, &con->flags)) Fails, leading to the else case where do do a conn_put() and that triggers the cleanup as refcount reached 0. Leading the call chain below : >> tipc_conn_kref_release tipc_sock_release tipc_conn_release tipc_subscrb_delete tipc_subscrp_delete tipc_nametbl_unsubscribe spin_lock_bh(&tn->nametbl_lock); << !! Soft Lockup >> One cause is that tipc_exit_net() calls first calls tipc_topsrv_stop() and then tipc_nametbl_withdraw() in scope of tipc_net_stop(). The above chain will only occur in a narrow window for a given connection: CPU#1: tipc_nametbl_withdraw() manages to perform tipc_conn_lookup() and steps the refcount to 2, while in CPU#2 the following occurs: CPU#2: tipc_server_stop() calls tipc_close_conn(con). This performs a conn_put() decrementing refcount to 1. Now, CPU#1 continues and detects that the connection is not CF_CONNECTED and does a conn_put(), triggering the release callback. Before commit 333f796235a527, the above wont happen. /Partha On 11/15/2016 04:11 PM, Xue, Ying wrote: Hi John, Regarding the stack trace you provided below, I get the two potential call chains: tipc_nametbl_withdraw spin_lock_bh(&tn->nametbl_lock); tipc_nametbl_remove_publ spin_lock_bh(&seq->lock); tipc_nameseq_remove_publ tipc_subscrp_report_overlap tipc_subscrp_send_event tipc_conn_sendmsg spin_lock_bh(&con->outqueue_lock); list_add_tail(&e->list, &con->outqueue); tipc_topsrv_stop tipc_server_stop tipc_close_conn kernel_sock_shutdown tipc_subscrb_delete spin_lock_bh(&subscriber->lock); tipc_nametbl_unsubscribe(sub); spin_lock_bh(&tn->nametbl_lock); Although I suspect this is a revert lock issue leading to the soft lockup, I am still unable to understand which lock together with nametbl_lock is taken reversely on the two different paths above. However, you just gave us the log printed on CPU#2, but the logs outputted by other cores are also important. So if possible, please share them with us. By the way, I agree with you, and it seems that commit 333f796235a527 is related to the soft lockup. Regards, Ying -----Original Message----- From: John Thompson [mailto:thompa....@gmail.com<mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>>>] Sent: Tuesday, November 15, 2016 8:01 AM To: tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>> Subject: [tipc-discussion] v4.7: soft lockup when releasing a socket Hi, I am seeing an occasional kernel soft lockup. I have TIPC v4.7 and the kernel dump occurs when the system is going down for a reboot. The kernel dump is: <0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [exfx:1474] <6>Modules linked in: tipc jitterentropy_rng echainiv drbg platform_driver(O) ipifwd(PO) ... <6> <6>GPR00: c15333e8 a4e0fb80 a4ee3600 a51748ac 00000000 ae475024 a537feec fffffffd <6>GPR08: a2197408 00000001 00000001 00000004 80691c00 <6>NIP [80691c40] _raw_spin_lock_bh+0x40/0x70 <6>LR [c1534f30] tipc_nametbl_unsubscribe+0x50/0x120 [tipc] <6>Call Trace: <6>[a4e0fba0] [c15333e8] tipc_named_reinit+0xf8/0x820 [tipc] <6>[a4e0fbb0] [c15336a0] tipc_named_reinit+0x3b0/0x820 [tipc] <6>[a4e0fbd0] [c1540bac] tipc_nl_publ_dump+0x50c/0xed0 [tipc] <6>[a4e0fc00] [c154164c] tipc_conn_sendmsg+0xdc/0x170 [tipc] <6>[a4e0fc30] [c1533c9c] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] <6>[a4e0fc70] [c153425c] tipc_topsrv_stop+0x45c/0x4f0 [tipc] <6>[a4e0fca0] [c1534788] tipc_nametbl_remove_publ+0x58/0x110 [tipc] <6>[a4e0fcd0] [c1534c48] tipc_nametbl_withdraw+0x68/0x140 [tipc] <6>[a4e0fd00] [c153cc24] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] <6>[a4e0fd30] [c153d838] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] <6>[a4e0fd70] [804f2870] sock_release+0x30/0xf0 <6>[a4e0fd80] [804f2944] sock_close+0x14/0x30 <6>[a4e0fd90] [80105844] __fput+0x94/0x200 <6>[a4e0fdb0] [8003dca4] task_work_run+0xd4/0x100 <6>[a4e0fdd0] [80023620] do_exit+0x280/0x980 <6>[a4e0fe10] [80024c48] do_group_exit+0x48/0xb0 <6>[a4e0fe30] [80030344] get_signal+0x244/0x4f0 <6>[a4e0fe80] [80007734] do_signal+0x34/0x1c0 <6>[a4e0ff30] [800079a8] do_notify_resume+0x68/0x80 <6>[a4e0ff40] [8000fa1c] do_user_signal+0x74/0xc4 From the stack dump it looks like tipc_named_reinit is trying to acquire nametbl_lock. From looking at the call chain I can see that tipc_conn_sendmsg can send up calling conn_put which will go on and call the tipc_named_reinit via tipc_sock_release. As tipc_nametbl_withdraw (from the stack dump) has already acquired the nametbl_lock, tipc_named_reinit cannot get it and so the process hangs. The call to tipc_sock_release (added in Commit 333f796235a527 <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->>> stable.git/commit/?id=333f796235a52727db7e0a13888045f3aa3d5335>) seems to have changed the behaviour such that it tries to do a lot more when shutting the connection down. If there is other information I can provide please let me know. Regards, John ------------------------------------------------------------ ------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>> ------------------------------------------------------------ ------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>> ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>> ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net> <mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion