Hi Partha, I tested with the latest 3 patches last night and observed no soft lockups.
Thanks, John On Fri, Nov 25, 2016 at 11:50 AM, John Thompson <thompa....@gmail.com> wrote: > Hi Partha, > > I rebuilt afresh and retried the test with the same lockup kernel dumps. > Yes I have multiple tipc clients subscribed to the topology server, at > least 10 clients. > They all use a subscription timeout of TIPC_WAIT_FOREVER > > I will try the kernel command line parameter next week. > JT > > > On Fri, Nov 25, 2016 at 3:07 AM, Parthasarathy Bhuvaragan < > parthasarathy.bhuvara...@ericsson.com> wrote: > >> Hi John, >> >> Do you have several tipc clients subscribed to topology server? >> What subscription timeout do they use? >> >> Please enable kernel command line parameter: >> softlockup_all_cpu_backtrace=1 >> >> /Partha >> >> On 11/23/2016 11:04 PM, John Thompson wrote: >> >>> Hi Partha, >>> >>> I tested overnight with the 2 patches you provided yesterday. >>> Testing is still showing problems, here is one of the soft lockups, the >>> other is the same as I sent the other day. >>> I am going to redo my build as I expected some change in behaviour with >>> your patches. >>> >>> It is possible that I am doing some dumping of nodes or links as I am >>> not certain of all the code or paths. >>> I have found that we do a tipc-config -nt and tipc-config -ls in some >>> situations but it shouldn 't be initiated in this >>> reboot case. >>> >>> <0>NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [pimd:1220] >>> <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [rpc.13:2419] >>> <6>Modules linked in: >>> <0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [AIS >>> listener:1600] >>> <6> tipc >>> <6>Modules linked in: >>> <6> jitterentropy_rng >>> <6> tipc >>> <6> echainiv >>> <6> jitterentropy_rng >>> <6> drbg >>> <6> echainiv >>> <6> platform_driver(O) >>> <6> drbg >>> <6> platform_driver(O) >>> <6> >>> <6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P O >>> <6>CPU: 2 PID: 1600 Comm: AIS listener Tainted: P O >>> <6>task: aed76d20 ti: ae70c000 task.ti: ae70c000 >>> <6>task: aee3ced0 ti: ae686000 task.ti: ae686000 >>> <6>NIP: 8069257c LR: c13ebc4c CTR: 80692540 >>> <6>NIP: 80692578 LR: c13ebf50 CTR: 80692540 >>> <6>REGS: ae70dc20 TRAP: 0901 Tainted: P O >>> <6>REGS: ae687ad0 TRAP: 0901 Tainted: P O >>> <6>MSR: 00029002 >>> <6>MSR: 00029002 >>> <6>< >>> <6>< >>> <6>CE >>> <6>CE >>> <6>,EE >>> <6>,EE >>> <6>,ME >>> <6>,ME >>> <6>> >>> <6>> >>> <6> CR: 42002484 XER: 20000000 >>> <6> CR: 48002444 XER: 00000000 >>> <6> >>> <6>GPR00: >>> <6> >>> <6>GPR00: >>> <6>c13f3c34 >>> <6>c13ea408 >>> <6>ae70dcd0 >>> <6>ae687b80 >>> <6>aed76d20 >>> <6>aee3ced0 >>> <6>ae55c8ec >>> <6>ae55c8ec >>> <6>00002711 >>> <6>00000000 >>> <6>00000005 >>> <6>a30e7264 >>> <6>8666592a >>> <6>ae5e070c >>> <6>8666592b >>> <6>fffffffd >>> <6> >>> <6>GPR08: >>> <6> >>> <6>GPR08: >>> <6>ae9dad20 >>> <6>ae72fbc8 >>> <6>00000001 >>> <6>00000001 >>> <6>00000001 >>> <6>00000001 >>> <6>00000000 >>> <6>00000004 >>> <6>80692540 >>> <6>80692540 >>> <6> >>> <6> >>> <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70 >>> <6>NIP [80692578] _raw_spin_lock_bh+0x38/0x70 >>> <6>LR [c13ebc4c] tipc_nametbl_withdraw+0x4c/0x140 [tipc] >>> <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc] >>> <6>Call Trace: >>> <6>Call Trace: >>> <6>[ae70dcd0] [a85d99a0] 0xa85d99a0 >>> <6>[ae687b80] [800fa258] check_object+0xc8/0x270 >>> <6> (unreliable) >>> <6> (unreliable) >>> <6> >>> <6> >>> <6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] >>> <6>[ae687ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc] >>> <6> >>> <6> >>> <6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] >>> <6>[ae687bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc] >>> <6> >>> <6> >>> <6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0 >>> <6>[ae687bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc] >>> <6> >>> <6> >>> <6>[ae70dd80] [804f2ab4] sock_close+0x14/0x30 >>> <6>[ae687c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc] >>> <6> >>> <6> >>> <6>[ae70dd90] [80105844] __fput+0x94/0x200 >>> <6>[ae687c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] >>> <6> >>> <6> >>> <6>[ae70ddb0] [8003dca4] task_work_run+0xd4/0x100 >>> <6>[ae687c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc] >>> <6> >>> <6> >>> <6>[ae70ddd0] [80023620] do_exit+0x280/0x980 >>> <6>[ae687ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc] >>> <6> >>> <6> >>> <6>[ae70de10] [80024c48] do_group_exit+0x48/0xb0 >>> <6>[ae687cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc] >>> <6> >>> <6> >>> <6>[ae70de30] [80030344] get_signal+0x244/0x4f0 >>> <6>[ae687d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] >>> <6> >>> <6> >>> <6>[ae70de80] [80007734] do_signal+0x34/0x1c0 >>> <6>[ae687d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] >>> <6> >>> <6> >>> <6>[ae70df30] [800079a8] do_notify_resume+0x68/0x80 >>> <6>[ae687d70] [804f29e0] sock_release+0x30/0xf0 >>> <6> >>> <6> >>> <6>[ae70df40] [8000fa1c] do_user_signal+0x74/0xc4 >>> <6>[ae687d80] [804f2ab4] sock_close+0x14/0x30 >>> <6> >>> <6> >>> <6>--- interrupt: c00 at 0xfe490c4 >>> <6> LR = 0xfe490a0 >>> <6>[ae687d90] [80105844] __fput+0x94/0x200 >>> <6>Instruction dump: >>> <6> >>> <6> >>> <6>[ae687db0] [8003dca4] task_work_run+0xd4/0x100 >>> <6>912a0008 >>> <6> >>> <6>39400001 >>> <6>[ae687dd0] [80023620] do_exit+0x280/0x980 >>> <6>7d201828 >>> <6> >>> <6>2c090000 >>> <6>[ae687e10] [80024c48] do_group_exit+0x48/0xb0 >>> <6>40820010 >>> <6> >>> <6>7d40192d >>> <6>[ae687e30] [80030344] get_signal+0x244/0x4f0 >>> <6>40a2fff0 >>> <6> >>> <6>7c2004ac >>> <6>[ae687e80] [80007734] do_signal+0x34/0x1c0 >>> <6> >>> <6> >>> <6>2f890000 >>> <6>[ae687f30] [800079a8] do_notify_resume+0x68/0x80 >>> <6>4dbe0020 >>> <6> >>> <6>7c210b78 >>> <6>[ae687f40] [8000fa1c] do_user_signal+0x74/0xc4 >>> <6>81230000 >>> <6> >>> <6><2f890000> >>> <6>--- interrupt: c00 at 0xf4f3d08 >>> <6> LR = 0xf4f3ce8 >>> <6>40befff4 >>> <6>Instruction dump: >>> <6>7c421378 >>> <6> >>> <6>7d201828 >>> <6>39290200 >>> <6> >>> <6>912a0008 39400001 >>> <0>Kernel panic - not syncing: softlockup: hung tasks >>> <6>7d201828 >>> <6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P O L >>> <6>2c090000 >>> <6>Call Trace: >>> <6>40820010 >>> <6>[ae70da80] [80694e20] dump_stack+0x84/0xb0 >>> <6>7d40192d >>> <6> (unreliable) >>> <6>40a2fff0 >>> <6> >>> <6> >>> <6>[ae70da90] [80692ca8] panic+0xd8/0x214 >>> <6>7c2004ac >>> <6> >>> <6>2f890000 >>> <6>[ae70daf0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0 >>> <6>4dbe0020 >>> <6> >>> <6>7c210b78 >>> <6>[ae70db40] [8007ae58] __hrtimer_run_queues+0x118/0x1d0 >>> <6><81230000> >>> <6> >>> <6>2f890000 >>> <6>[ae70db80] [8007b608] hrtimer_interrupt+0xd8/0x270 >>> <6>40befff4 >>> <6> >>> <6>7c421378 >>> <6>[ae70dbd0] [8000983c] __timer_interrupt+0xac/0x1b0 >>> <6> >>> <6> >>> <6>[ae70dbf0] [80009b70] timer_interrupt+0xb0/0xe0 >>> <6>[ae70dc10] [8000f450] ret_from_except+0x0/0x18 >>> <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70 >>> <6> LR = tipc_nametbl_withdraw+0x4c/0x140 [tipc] >>> <6>[ae70dcd0] [a85d99a0] 0xa85d99a0 (unreliable) >>> <6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] >>> <6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] >>> <6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0 >>> <6>[ae70dd80] [804f2ab4] sock_close+0x14/0x30 >>> <6>[ae70dd90] [80105844] __fput+0x94/0x200 >>> <6>[ae70ddb0] [8003dca4] task_work_run+0xd4/0x100 >>> <6>[ae70ddd0] [80023620] do_exit+0x280/0x980 >>> <6>[ae70de10] [80024c48] do_group_exit+0x48/0xb0 >>> <6>[ae70de30] [80030344] get_signal+0x244/0x4f0 >>> <6>[ae70de80] [80007734] do_signal+0x34/0x1c0 >>> <6>[ae70df30] [800079a8] do_notify_resume+0x68/0x80 >>> <6>[ae70df40] [8000fa1c] do_user_signal+0x74/0xc4 >>> <6>--- interrupt: c00 at 0xfe490c4 >>> <6> LR = 0xfe490a0 >>> >>> >>> >>> On Wed, Nov 23, 2016 at 9:16 PM, Parthasarathy Bhuvaragan >>> <parthasarathy.bhuvara...@ericsson.com >>> <mailto:parthasarathy.bhuvara...@ericsson.com>> wrote: >>> >>> Hi John, >>> >>> Ok. Can you test with the 2 patches I posted regarding the >>> nametbl_lock? >>> >>> /Partha >>> >>> On 11/21/2016 11:32 PM, John Thompson wrote: >>> >>> Hi Partha, >>> >>> My test has 4 nodes, 2 of which are alternately rebooting. When >>> the >>> rebooted node rejoins a few minutes pass and then the other node >>> is >>> rebooted. >>> I am not printing out link stats and believe that the the other >>> code is >>> not doing so either, when nodes leave or rejoin. >>> >>> JT >>> >>> >>> On Tue, Nov 22, 2016 at 2:22 AM, Parthasarathy Bhuvaragan >>> <parthasarathy.bhuvara...@ericsson.com >>> <mailto:parthasarathy.bhuvara...@ericsson.com> >>> <mailto:parthasarathy.bhuvara...@ericsson.com >>> <mailto:parthasarathy.bhuvara...@ericsson.com>>> wrote: >>> >>> Hi, >>> >>> There is an other branch where softlockup for nametbl_lock >>> occurs. >>> >>> tipc_named_rcv() Grabs nametbl_lock >>> tipc_update_nametbl() (publish/withdraw) >>> tipc_node_subscribe()/unsubscribe() >>> tipc_node_write_unlock() >>> << lockup occurs if it needs to process NODE >>> UP/DOWN LINK >>> UP/DOWN, as it grabs nametbl_lock again >> >>> >>> /Partha >>> >>> >>> On 11/21/2016 01:04 PM, Parthasarathy Bhuvaragan wrote: >>> >>> Hi, >>> >>> tipc_nametbl_withdraw() triggers the softlockup as it >>> tries to grab >>> nametbl_lock twice if the node triggered a >>> TIPC_NOTIFY_LINK_DOWN >>> event >>> while its is running. The erroneous call chain is: >>> >>> tipc_nametbl_withdraw() Grab nametbl_lock >>> tipc_named_process_backlog() >>> tipc_update_nametbl() >>> if (dtype == WITHDRAWAL) tipc_node_unsubscribe() >>> tipc_node_write_unlock() >>> if (flags & TIPC_NOTIFY_LINK_DOWN) >>> tipc_nametbl_withdraw() >>> spin_lock_bh(&tn->nametbl_lock); << Soft >>> Lockup >> >>> >>> Three callers which can cause this under module exit: >>> >>> Case1: >>> tipc_exit_net() >>> tipc_nametbl_withdraw() Grab nametbl_lock >>> >>> Case2: >>> tipc_server_stop() >>> tipc_conn_kref_release >>> tipc_sock_release >>> sock_release() >>> tipc_release() >>> tipc_sk_withdraw() >>> tipc_nametbl_withdraw() >>> >>> Case3: >>> tipc_server_stop() >>> tipc_conn_kref_release() >>> kernel_bind() >>> tipc_bind() >>> tipc_sk_withdraw() >>> tipc_nametbl_withdraw() >>> >>> I will work on a solution for this. >>> >>> What kind of test were you performing when this occurred >>> (linkup/down)? >>> Do you read link statistics periodically in your tests? >>> >>> /Partha >>> >>> On 11/21/2016 05:30 AM, John Thompson wrote: >>> >>> Hi Partha, >>> >>> I was doing some some more testing today and have >>> still >>> observed the >>> problem (contrary to what I had emailed earlier). >>> >>> Here is the kernel dump. >>> >>> <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for >>> 22s! >>> [Pluggable >>> Serve:2221] >>> <6>Modules linked in: tipc jitterentropy_rng >>> echainiv drbg >>> platform_driver(O) >>> <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: >>> P O >>> <6>task: ae54ced0 ti: aec42000 task.ti: aec42000 >>> <6>NIP: 8069257c LR: c13ebf50 CTR: 80692540 >>> <6>REGS: aec43ad0 TRAP: 0901 Tainted: P O >>> <6>MSR: 00029002 <CE,EE,ME> CR: 48002444 XER: >>> 00000000 >>> <6> >>> <6>GPR00: c13ea408 aec43b80 ae54ced0 a624690c >>> 00000000 >>> a6271d84 a39a60cc >>> fffffffd >>> <6>GPR08: aeefbbc8 00000001 00000001 00000004 >>> 80692540 >>> <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70 >>> <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 >>> [tipc] >>> <6>Call Trace: >>> <6>[aec43b80] [800fa258] check_object+0xc8/0x270 >>> (unreliable) >>> <6>[aec43ba0] [c13ea408] >>> tipc_named_reinit+0xf8/0x820 [tipc] >>> <6>[aec43bb0] [c13ea6c0] >>> tipc_named_reinit+0x3b0/0x820 [tipc] >>> <6>[aec43bd0] [c13f7bbc] >>> tipc_nl_publ_dump+0x50c/0xed0 [tipc] >>> <6>[aec43c00] [c13f865c] >>> tipc_conn_sendmsg+0xdc/0x170 [tipc] >>> <6>[aec43c30] [c13eacbc] >>> tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] >>> <6>[aec43c70] [c13eb27c] >>> tipc_topsrv_stop+0x45c/0x4f0 [tipc] >>> <6>[aec43ca0] [c13eb7a8] >>> tipc_nametbl_remove_publ+0x58/0x110 >>> [tipc] >>> <6>[aec43cd0] [c13ebc68] >>> tipc_nametbl_withdraw+0x68/0x140 [tipc] >>> <6>[aec43d00] [c13f3c34] >>> tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] >>> <6>[aec43d30] [c13f4848] >>> tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] >>> >>> TIPC_CMD_SHOW_LINK_STATS or TIPC_NL_LINK_GET >>> >>> <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0 >>> <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30 >>> <6>[aec43d90] [80105844] __fput+0x94/0x200 >>> <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100 >>> <6>[aec43dd0] [80023620] do_exit+0x280/0x980 >>> <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0 >>> <6>[aec43e30] [80030344] get_signal+0x244/0x4f0 >>> <6>[aec43e80] [80007734] do_signal+0x34/0x1c0 >>> <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80 >>> <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4 >>> <6>--- interrupt: c00 at 0xf4f3d08 >>> <6> LR = 0xf4f3ce8 >>> <6>Instruction dump: >>> <6>912a0008 39400001 7d201828 2c090000 40820010 >>> 7d40192d >>> 40a2fff0 7c2004ac >>> <6>2f890000 4dbe0020 7c210b78 81230000 <2f890000> >>> 40befff4 >>> 7c421378 >>> 7d201828 >>> <0>Kernel panic - not syncing: softlockup: hung tasks >>> <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P >>> O L >>> <6>Call Trace: >>> <6>[aec43930] [80694e20] dump_stack+0x84/0xb0 >>> (unreliable) >>> <6>[aec43940] [80692ca8] panic+0xd8/0x214 >>> <6>[aec439a0] [800a0258] >>> watchdog_timer_fn+0x2d8/0x2e0 >>> <6>[aec439f0] [8007ae58] >>> __hrtimer_run_queues+0x118/0x1d0 >>> <6>[aec43a30] [8007b608] hrtimer_interrupt+0xd8/0x270 >>> <6>[aec43a80] [8000983c] __timer_interrupt+0xac/0x1b0 >>> <6>[aec43aa0] [80009b70] timer_interrupt+0xb0/0xe0 >>> <6>[aec43ac0] [8000f450] ret_from_except+0x0/0x18 >>> <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70 >>> <6> LR = tipc_nametbl_unsubscribe+0x50/0x120 >>> [tipc] >>> <6>[aec43b80] [800fa258] check_object+0xc8/0x270 >>> (unreliable) >>> <6>[aec43ba0] [c13ea408] >>> tipc_named_reinit+0xf8/0x820 [tipc] >>> <6>[aec43bb0] [c13ea6c0] >>> tipc_named_reinit+0x3b0/0x820 [tipc] >>> <6>[aec43bd0] [c13f7bbc] >>> tipc_nl_publ_dump+0x50c/0xed0 [tipc] >>> <6>[aec43c00] [c13f865c] >>> tipc_conn_sendmsg+0xdc/0x170 [tipc] >>> <6>[aec43c30] [c13eacbc] >>> tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] >>> <6>[aec43c70] [c13eb27c] >>> tipc_topsrv_stop+0x45c/0x4f0 [tipc] >>> <6>[aec43ca0] [c13eb7a8] >>> tipc_nametbl_remove_publ+0x58/0x110 >>> [tipc] >>> <6>[aec43cd0] [c13ebc68] >>> tipc_nametbl_withdraw+0x68/0x140 [tipc] >>> <6>[aec43d00] [c13f3c34] >>> tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] >>> <6>[aec43d30] [c13f4848] >>> tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] >>> <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0 >>> <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30 >>> <6>[aec43d90] [80105844] __fput+0x94/0x200 >>> <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100 >>> <6>[aec43dd0] [80023620] do_exit+0x280/0x980 >>> <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0 >>> <6>[aec43e30] [80030344] get_signal+0x244/0x4f0 >>> <6>[aec43e80] [80007734] do_signal+0x34/0x1c0 >>> <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80 >>> <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4 >>> <6>--- interrupt: c00 at 0xf4f3d08 >>> <6> LR = 0xf4f3ce8 >>> >>> >>> On Mon, Nov 21, 2016 at 9:59 AM, John Thompson >>> <thompa....@gmail.com <mailto:thompa....@gmail.com> >>> <mailto:thompa....@gmail.com <mailto:thompa....@gmail.com>> >>> <mailto:thompa....@gmail.com >>> <mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com >>> <mailto:thompa....@gmail.com>>>> >>> wrote: >>> >>> Hi Partha, >>> >>> In my testing over the weekend the patch >>> performed well >>> - I didn't >>> see any kernel dumps due to this issue. >>> >>> Thanks for the quick response. >>> JT >>> >>> >>> On Fri, Nov 18, 2016 at 10:34 AM, John Thompson >>> <thompa....@gmail.com >>> <mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com >>> <mailto:thompa....@gmail.com>> >>> <mailto:thompa....@gmail.com >>> <mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com >>> <mailto:thompa....@gmail.com>>>> >>> wrote: >>> >>> Hi, >>> >>> I will be able to have some test results by >>> the >>> start of next >>> week on the first patch. >>> >>> Regards, >>> JT >>> >>> >>> On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue >>> <ying....@windriver.com >>> <mailto:ying....@windriver.com> >>> <mailto:ying....@windriver.com >>> <mailto:ying....@windriver.com>> >>> <mailto:ying....@windriver.com >>> <mailto:ying....@windriver.com> >>> <mailto:ying....@windriver.com >>> <mailto:ying....@windriver.com>>>> wrote: >>> >>> On 11/17/2016 07:04 AM, John Thompson >>> wrote: >>> >>> Hi Partha / Ying, >>> >>> I will try out the patch and let you >>> know >>> how it goes. >>> I also note about providing the >>> other CPU >>> core dumps - >>> in one of my cases I >>> didn't have them but in others I did >>> but >>> they were interleaved and so were >>> difficult >>> to interpret. >>> >>> >>> Thanks, it's unnecessary for us to >>> collect more >>> logs as its >>> soft lockup scenario should be just what >>> Partha >>> described. >>> >>> Regards, >>> Ying >>> >>> >>> >>> Thanks for getting a patch together >>> so quickly. >>> >>> JT >>> >>> On Wed, Nov 16, 2016 at 10:23 PM, >>> Parthasarathy Bhuvaragan < >>> >>> parthasarathy.bhuvara...@ericsson.com >>> <mailto:parthasarathy.bhuvara...@ericsson.com> >>> <mailto:parthasarathy.bhuvara...@ericsson.com >>> <mailto:parthasarathy.bhuvara...@ericsson.com>> >>> >>> <mailto:parthasarathy.bhuvara...@ericsson.com >>> <mailto:parthasarathy.bhuvara...@ericsson.com> >>> >>> <mailto:parthasarathy.bhuvara...@ericsson.com >>> <mailto:parthasarathy.bhuvara...@ericsson.com>>>> wrote: >>> >>> Hi Ying / John, >>> >>> The soft lock is the call chain >>> of >>> tipc_nametbl_withdraw(), when it >>> performs the >>> tipc_conn_kref_release() as >>> it tries to >>> grab nametbl_lock >>> again while holding it already. >>> >>> tipc_nametbl_withdraw >>> >>> spin_lock_bh(&tn->nametbl_lock); >>> tipc_nametbl_remove_publ >>> >>> spin_lock_bh(&seq->lock); >>> tipc_nameseq_remove_publ >>> >>> tipc_subscrp_report_overlap >>> >>> tipc_subscrp_send_event >>> tipc_conn_sendmsg >>> >>> << Here, the >>> (test_bit(CF_CONNECTED, >>> &con->flags)) >>> Fails, leading to the >>> else case where do do a >>> conn_put() and >>> that triggers >>> the cleanup as >>> refcount reached 0. Leading the >>> call >>> chain below : >> >>> tipc_conn_kref_release >>> tipc_sock_release >>> tipc_conn_release >>> tipc_subscrb_delete >>> tipc_subscrp_delete >>> >>> tipc_nametbl_unsubscribe >>> >>> spin_lock_bh(&tn->nametbl_lock); >>> << !! Soft Lockup >> >>> >>> One cause is that >>> tipc_exit_net() calls >>> first calls >>> tipc_topsrv_stop() and >>> then tipc_nametbl_withdraw() in >>> scope of >>> tipc_net_stop(). >>> >>> The above chain will only occur >>> in a >>> narrow window >>> for a given connection: >>> CPU#1: >>> tipc_nametbl_withdraw() manages >>> to perform >>> tipc_conn_lookup() and steps >>> the refcount to 2, while in >>> CPU#2 the >>> following occurs: >>> CPU#2: >>> tipc_server_stop() calls >>> tipc_close_conn(con). This >>> performs a conn_put() >>> decrementing refcount to 1. >>> Now, CPU#1 continues and detects >>> that >>> the connection >>> is not CF_CONNECTED >>> and does a conn_put(), >>> triggering the >>> release callback. >>> >>> Before commit 333f796235a527, >>> the above >>> wont happen. >>> >>> /Partha >>> >>> >>> On 11/15/2016 04:11 PM, Xue, >>> Ying wrote: >>> >>> Hi John, >>> >>> Regarding the stack trace you >>> provided below, I >>> get the two potential >>> call chains: >>> >>> tipc_nametbl_withdraw >>> >>> spin_lock_bh(&tn->nametbl_lock); >>> tipc_nametbl_remove_publ >>> >>> spin_lock_bh(&seq->lock); >>> tipc_nameseq_remove_publ >>> >>> tipc_subscrp_report_overlap >>> >>> tipc_subscrp_send_event >>> tipc_conn_sendmsg >>> >>> spin_lock_bh(&con->outqueue_lock); >>> >>> list_add_tail(&e->list, >>> &con->outqueue); >>> >>> >>> tipc_topsrv_stop >>> tipc_server_stop >>> tipc_close_conn >>> kernel_sock_shutdown >>> tipc_subscrb_delete >>> >>> spin_lock_bh(&subscriber->lock); >>> >>> tipc_nametbl_unsubscribe(sub); >>> >>> spin_lock_bh(&tn->nametbl_lock); >>> >>> Although I suspect this is a >>> revert >>> lock issue >>> leading to the soft >>> lockup, I am still unable to >>> understand which >>> lock together with >>> nametbl_lock is taken >>> reversely on >>> the two >>> different paths above. >>> However, you just gave us >>> the log >>> printed on >>> CPU#2, but the logs >>> outputted by other cores are >>> also >>> important. So >>> if possible, please share >>> them with us. >>> >>> By the way, I agree with >>> you, and it >>> seems that >>> commit 333f796235a527 is >>> related to the soft lockup. >>> >>> Regards, >>> Ying >>> >>> -----Original Message----- >>> From: John Thompson >>> [mailto:thompa....@gmail.com >>> <mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com >>> <mailto:thompa....@gmail.com>> >>> <mailto:thompa....@gmail.com >>> <mailto:thompa....@gmail.com> >>> <mailto:thompa....@gmail.com >>> <mailto:thompa....@gmail.com>>>] >>> Sent: Tuesday, November 15, >>> 2016 8:01 AM >>> To: >>> tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net>> >>> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net> >>> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net>>> >>> Subject: [tipc-discussion] >>> v4.7: >>> soft lockup >>> when releasing a socket >>> >>> Hi, >>> >>> I am seeing an occasional >>> kernel >>> soft lockup. I >>> have TIPC v4.7 and the >>> kernel dump occurs when the >>> system >>> is going down >>> for a reboot. >>> >>> The kernel dump is: >>> >>> <0>NMI watchdog: BUG: soft >>> lockup - >>> CPU#2 stuck >>> for 23s! [exfx:1474] >>> <6>Modules linked in: tipc >>> jitterentropy_rng >>> echainiv drbg >>> platform_driver(O) ipifwd(PO) >>> ... >>> <6> >>> <6>GPR00: c15333e8 a4e0fb80 >>> a4ee3600 >>> a51748ac >>> 00000000 ae475024 a537feec >>> fffffffd >>> <6>GPR08: a2197408 00000001 >>> 00000001 >>> 00000004 >>> 80691c00 <6>NIP [80691c40] >>> _raw_spin_lock_bh+0x40/0x70 >>> <6>LR >>> [c1534f30] >>> >>> tipc_nametbl_unsubscribe+0x50/0x120 >>> [tipc] <6>Call Trace: >>> <6>[a4e0fba0] [c15333e8] >>> tipc_named_reinit+0xf8/0x820 >>> [tipc] >>> <6>[a4e0fbb0] [c15336a0] >>> >>> tipc_named_reinit+0x3b0/0x820 [tipc] >>> <6>[a4e0fbd0] >>> [c1540bac] >>> tipc_nl_publ_dump+0x50c/0xed0 [tipc] >>> <6>[a4e0fc00] [c154164c] >>> tipc_conn_sendmsg+0xdc/0x170 >>> [tipc] >>> <6>[a4e0fc30] [c1533c9c] >>> >>> tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] >>> <6>[a4e0fc70] [c153425c] >>> tipc_topsrv_stop+0x45c/0x4f0 >>> [tipc] >>> <6>[a4e0fca0] [c1534788] >>> >>> tipc_nametbl_remove_publ+0x58/0x110 >>> [tipc] >>> <6>[a4e0fcd0] [c1534c48] >>> >>> tipc_nametbl_withdraw+0x68/0x140 [tipc] >>> <6>[a4e0fd00] [c153cc24] >>> >>> tipc_nl_node_dump_link+0x1904/0x45d0 >>> [tipc] >>> <6>[a4e0fd30] [c153d838] >>> >>> tipc_nl_node_dump_link+0x2518/0x45d0 >>> [tipc] >>> <6>[a4e0fd70] [804f2870] >>> sock_release+0x30/0xf0 >>> <6>[a4e0fd80] >>> [804f2944] >>> sock_close+0x14/0x30 >>> <6>[a4e0fd90] [80105844] >>> __fput+0x94/0x200 >>> <6>[a4e0fdb0] [8003dca4] >>> task_work_run+0xd4/0x100 >>> <6>[a4e0fdd0] >>> [80023620] >>> do_exit+0x280/0x980 >>> <6>[a4e0fe10] [80024c48] >>> do_group_exit+0x48/0xb0 >>> <6>[a4e0fe30] [80030344] >>> get_signal+0x244/0x4f0 >>> <6>[a4e0fe80] >>> [80007734] >>> do_signal+0x34/0x1c0 >>> <6>[a4e0ff30] [800079a8] >>> do_notify_resume+0x68/0x80 >>> <6>[a4e0ff40] >>> [8000fa1c] >>> do_user_signal+0x74/0xc4 >>> >>> >>> From the stack dump it looks >>> like >>> tipc_named_reinit is trying >>> to >>> >>> >>> acquire nametbl_lock. >>> >>> From looking at the call >>> chain I can >>> see that >>> tipc_conn_sendmsg can >>> >>> >>> send up calling conn_put >>> >>> which will go on and call the >>> tipc_named_reinit >>> via tipc_sock_release. >>> >>> As tipc_nametbl_withdraw >>> (from the >>> stack dump) >>> has already acquired the >>> nametbl_lock, >>> tipc_named_reinit >>> >>> cannot get it and so the >>> process hangs. >>> >>> >>> The call to >>> tipc_sock_release (added >>> in Commit >>> 333f796235a527 >>> >>> >>> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- >>> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-> >>> >>> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- >>> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->> >>> >>> >>> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- >>> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-> >>> >>> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux- >>> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->>> >>> >>> >>> stable.git/commit/?id=333f796235a52727db7e0a13888045f3aa3d5335>) >>> seems to have changed the >>> behaviour >>> >>> such that it tries to do a >>> lot more when >>> shutting the connection down. >>> >>> >>> If there is other >>> information I can >>> provide >>> please let me know. >>> >>> Regards, >>> >>> John >>> >>> >>> ------------------------------------------------------------ >>> ------------------ >>> >>> _______________________________________________ >>> tipc-discussion mailing list >>> >>> tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net>> >>> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net>>> >>> >>> >>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >>> >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> >>> >>> >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>> >>> >>> >>> ------------------------------------------------------------ >>> ------------------ >>> >>> _______________________________________________ >>> tipc-discussion mailing list >>> >>> tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net>> >>> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net>>> >>> >>> >>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >>> >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> >>> >>> >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>> >>> >>> >>> >>> >>> >>> ------------------------------------------------------------ >>> ------------------ >>> >>> _______________________________________________ >>> tipc-discussion mailing list >>> >>> tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net>> >>> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net>>> >>> >>> >>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >>> >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> >>> >>> >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>> >>> >>> >>> >>> >>> >>> >>> >>> ------------------------------------------------------------ >>> ------------------ >>> _______________________________________________ >>> tipc-discussion mailing list >>> tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net> >>> <mailto:tipc-discussion@lists.sourceforge.net >>> <mailto:tipc-discussion@lists.sourceforge.net>> >>> >>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion> >>> >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>> >>> >>> >>> >>> > ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion