Hi Partha,

I was doing some some more testing today and have still observed the
problem (contrary to what I had emailed earlier).

Here is the kernel dump.

<0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [Pluggable
Serve:2221]
<6>Modules linked in: tipc jitterentropy_rng echainiv drbg
platform_driver(O)
<6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P           O
<6>task: ae54ced0 ti: aec42000 task.ti: aec42000
<6>NIP: 8069257c LR: c13ebf50 CTR: 80692540
<6>REGS: aec43ad0 TRAP: 0901   Tainted: P           O
<6>MSR: 00029002 <CE,EE,ME>  CR: 48002444  XER: 00000000
<6>
<6>GPR00: c13ea408 aec43b80 ae54ced0 a624690c 00000000 a6271d84 a39a60cc
fffffffd
<6>GPR08: aeefbbc8 00000001 00000001 00000004 80692540
<6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70
<6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
<6>Call Trace:
<6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable)
<6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
<6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
<6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
<6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
<6>[aec43c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
<6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
<6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc]
<6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc]
<6>[aec43d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
<6>[aec43d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
<6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
<6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
<6>[aec43d90] [80105844] __fput+0x94/0x200
<6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100
<6>[aec43dd0] [80023620] do_exit+0x280/0x980
<6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0
<6>[aec43e30] [80030344] get_signal+0x244/0x4f0
<6>[aec43e80] [80007734] do_signal+0x34/0x1c0
<6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80
<6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4
<6>--- interrupt: c00 at 0xf4f3d08
<6>    LR = 0xf4f3ce8
<6>Instruction dump:
<6>912a0008 39400001 7d201828 2c090000 40820010 7d40192d 40a2fff0 7c2004ac
<6>2f890000 4dbe0020 7c210b78 81230000 <2f890000> 40befff4 7c421378
7d201828
<0>Kernel panic - not syncing: softlockup: hung tasks
<6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P           O L
<6>Call Trace:
<6>[aec43930] [80694e20] dump_stack+0x84/0xb0 (unreliable)
<6>[aec43940] [80692ca8] panic+0xd8/0x214
<6>[aec439a0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0
<6>[aec439f0] [8007ae58] __hrtimer_run_queues+0x118/0x1d0
<6>[aec43a30] [8007b608] hrtimer_interrupt+0xd8/0x270
<6>[aec43a80] [8000983c] __timer_interrupt+0xac/0x1b0
<6>[aec43aa0] [80009b70] timer_interrupt+0xb0/0xe0
<6>[aec43ac0] [8000f450] ret_from_except+0x0/0x18
<6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70
<6>    LR = tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
<6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable)
<6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
<6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
<6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
<6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
<6>[aec43c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
<6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
<6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc]
<6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc]
<6>[aec43d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
<6>[aec43d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
<6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
<6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
<6>[aec43d90] [80105844] __fput+0x94/0x200
<6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100
<6>[aec43dd0] [80023620] do_exit+0x280/0x980
<6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0
<6>[aec43e30] [80030344] get_signal+0x244/0x4f0
<6>[aec43e80] [80007734] do_signal+0x34/0x1c0
<6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80
<6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4
<6>--- interrupt: c00 at 0xf4f3d08
<6>    LR = 0xf4f3ce8


On Mon, Nov 21, 2016 at 9:59 AM, John Thompson <thompa....@gmail.com> wrote:

> Hi Partha,
>
> In my testing over the weekend the patch performed well - I didn't see any
> kernel dumps due to this issue.
>
> Thanks for the quick response.
> JT
>
>
> On Fri, Nov 18, 2016 at 10:34 AM, John Thompson <thompa....@gmail.com>
> wrote:
>
>> Hi,
>>
>> I will be able to have some test results by the start of next week on the
>> first patch.
>>
>> Regards,
>> JT
>>
>>
>> On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue <ying....@windriver.com>
>> wrote:
>>
>>> On 11/17/2016 07:04 AM, John Thompson wrote:
>>>
>>>> Hi Partha / Ying,
>>>>
>>>> I will try out the patch and let you know how it goes.
>>>> I also note about providing the other CPU core dumps - in one of my
>>>> cases I
>>>> didn't have them but in others I did but
>>>> they were interleaved and so were difficult to interpret.
>>>>
>>>
>>> Thanks, it's unnecessary for us to collect more logs as its soft lockup
>>> scenario should be just what Partha described.
>>>
>>> Regards,
>>> Ying
>>>
>>>
>>>
>>>> Thanks for getting a patch together so quickly.
>>>>
>>>> JT
>>>>
>>>> On Wed, Nov 16, 2016 at 10:23 PM, Parthasarathy Bhuvaragan <
>>>> parthasarathy.bhuvara...@ericsson.com> wrote:
>>>>
>>>> Hi Ying / John,
>>>>>
>>>>> The soft lock is the call chain of tipc_nametbl_withdraw(), when it
>>>>> performs the tipc_conn_kref_release() as it tries to grab nametbl_lock
>>>>> again while holding it already.
>>>>>
>>>>>> tipc_nametbl_withdraw
>>>>>>   spin_lock_bh(&tn->nametbl_lock);
>>>>>>   tipc_nametbl_remove_publ
>>>>>>      spin_lock_bh(&seq->lock);
>>>>>>      tipc_nameseq_remove_publ
>>>>>>        tipc_subscrp_report_overlap
>>>>>>          tipc_subscrp_send_event
>>>>>>             tipc_conn_sendmsg
>>>>>>
>>>>> << Here, the (test_bit(CF_CONNECTED, &con->flags)) Fails, leading to
>>>>> the
>>>>> else case where do do a conn_put() and that triggers the cleanup as
>>>>> refcount reached 0. Leading the call chain below : >>
>>>>> tipc_conn_kref_release
>>>>>    tipc_sock_release
>>>>>      tipc_conn_release
>>>>>         tipc_subscrb_delete
>>>>>            tipc_subscrp_delete
>>>>>               tipc_nametbl_unsubscribe
>>>>>                  spin_lock_bh(&tn->nametbl_lock);  << !! Soft Lockup
>>>>> >>
>>>>>
>>>>> One cause is that tipc_exit_net() calls first calls tipc_topsrv_stop()
>>>>> and
>>>>> then tipc_nametbl_withdraw() in scope of tipc_net_stop().
>>>>>
>>>>> The above chain will only occur in a narrow window for a given
>>>>> connection:
>>>>> CPU#1:
>>>>> tipc_nametbl_withdraw() manages to perform tipc_conn_lookup() and steps
>>>>> the refcount to 2, while in CPU#2 the following occurs:
>>>>> CPU#2:
>>>>> tipc_server_stop() calls tipc_close_conn(con). This performs a
>>>>> conn_put()
>>>>> decrementing refcount to 1.
>>>>> Now, CPU#1 continues and detects that the connection is not
>>>>> CF_CONNECTED
>>>>> and does a conn_put(), triggering the release callback.
>>>>>
>>>>> Before commit 333f796235a527, the above wont happen.
>>>>>
>>>>> /Partha
>>>>>
>>>>>
>>>>> On 11/15/2016 04:11 PM, Xue, Ying wrote:
>>>>>
>>>>> Hi John,
>>>>>>
>>>>>> Regarding the stack trace you provided below, I get the two potential
>>>>>> call chains:
>>>>>>
>>>>>> tipc_nametbl_withdraw
>>>>>>   spin_lock_bh(&tn->nametbl_lock);
>>>>>>   tipc_nametbl_remove_publ
>>>>>>      spin_lock_bh(&seq->lock);
>>>>>>      tipc_nameseq_remove_publ
>>>>>>        tipc_subscrp_report_overlap
>>>>>>          tipc_subscrp_send_event
>>>>>>             tipc_conn_sendmsg
>>>>>>                spin_lock_bh(&con->outqueue_lock);
>>>>>>                list_add_tail(&e->list, &con->outqueue);
>>>>>>
>>>>>>
>>>>>> tipc_topsrv_stop
>>>>>>   tipc_server_stop
>>>>>>     tipc_close_conn
>>>>>>       kernel_sock_shutdown
>>>>>>         tipc_subscrb_delete
>>>>>>           spin_lock_bh(&subscriber->lock);
>>>>>>           tipc_nametbl_unsubscribe(sub);
>>>>>>            spin_lock_bh(&tn->nametbl_lock);
>>>>>>
>>>>>> Although I suspect this is a revert lock issue leading to the soft
>>>>>> lockup, I am still unable to understand which lock together with
>>>>>> nametbl_lock is taken reversely on the two different paths above.
>>>>>> However, you just gave us the log printed on CPU#2, but the logs
>>>>>> outputted by other cores are also important.  So if possible, please
>>>>>> share
>>>>>> them with us.
>>>>>>
>>>>>> By the way, I agree with you, and it seems that commit 333f796235a527
>>>>>> is
>>>>>> related to the soft lockup.
>>>>>>
>>>>>> Regards,
>>>>>> Ying
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: John Thompson [mailto:thompa....@gmail.com]
>>>>>> Sent: Tuesday, November 15, 2016 8:01 AM
>>>>>> To: tipc-discussion@lists.sourceforge.net
>>>>>> Subject: [tipc-discussion] v4.7: soft lockup when releasing a socket
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am seeing an occasional kernel soft lockup.  I have TIPC v4.7 and
>>>>>> the
>>>>>> kernel dump occurs when the system is going down for a reboot.
>>>>>>
>>>>>> The kernel dump is:
>>>>>>
>>>>>> <0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [exfx:1474]
>>>>>> <6>Modules linked in: tipc jitterentropy_rng echainiv drbg
>>>>>> platform_driver(O) ipifwd(PO)
>>>>>> ...
>>>>>> <6>
>>>>>> <6>GPR00: c15333e8 a4e0fb80 a4ee3600 a51748ac 00000000 ae475024
>>>>>> a537feec
>>>>>> fffffffd
>>>>>> <6>GPR08: a2197408 00000001 00000001 00000004 80691c00 <6>NIP
>>>>>> [80691c40]
>>>>>> _raw_spin_lock_bh+0x40/0x70 <6>LR [c1534f30]
>>>>>> tipc_nametbl_unsubscribe+0x50/0x120
>>>>>> [tipc] <6>Call Trace:
>>>>>> <6>[a4e0fba0] [c15333e8] tipc_named_reinit+0xf8/0x820 [tipc]
>>>>>> <6>[a4e0fbb0] [c15336a0] tipc_named_reinit+0x3b0/0x820 [tipc]
>>>>>> <6>[a4e0fbd0]
>>>>>> [c1540bac] tipc_nl_publ_dump+0x50c/0xed0 [tipc] <6>[a4e0fc00]
>>>>>> [c154164c]
>>>>>> tipc_conn_sendmsg+0xdc/0x170 [tipc] <6>[a4e0fc30] [c1533c9c]
>>>>>> tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] <6>[a4e0fc70] [c153425c]
>>>>>> tipc_topsrv_stop+0x45c/0x4f0 [tipc] <6>[a4e0fca0] [c1534788]
>>>>>> tipc_nametbl_remove_publ+0x58/0x110 [tipc] <6>[a4e0fcd0] [c1534c48]
>>>>>> tipc_nametbl_withdraw+0x68/0x140 [tipc] <6>[a4e0fd00] [c153cc24]
>>>>>> tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] <6>[a4e0fd30] [c153d838]
>>>>>> tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] <6>[a4e0fd70] [804f2870]
>>>>>> sock_release+0x30/0xf0 <6>[a4e0fd80] [804f2944] sock_close+0x14/0x30
>>>>>> <6>[a4e0fd90] [80105844] __fput+0x94/0x200 <6>[a4e0fdb0] [8003dca4]
>>>>>> task_work_run+0xd4/0x100 <6>[a4e0fdd0] [80023620] do_exit+0x280/0x980
>>>>>> <6>[a4e0fe10] [80024c48] do_group_exit+0x48/0xb0 <6>[a4e0fe30]
>>>>>> [80030344]
>>>>>> get_signal+0x244/0x4f0 <6>[a4e0fe80] [80007734] do_signal+0x34/0x1c0
>>>>>> <6>[a4e0ff30] [800079a8] do_notify_resume+0x68/0x80 <6>[a4e0ff40]
>>>>>> [8000fa1c] do_user_signal+0x74/0xc4
>>>>>>
>>>>>>
>>>>>> From the stack dump it looks like tipc_named_reinit is trying to
>>>>>>
>>>>>>>
>>>>>>> acquire nametbl_lock.
>>>>>>
>>>>>> From looking at the call chain I can see that tipc_conn_sendmsg can
>>>>>>
>>>>>>>
>>>>>>> send up calling conn_put
>>>>>>
>>>>>> which will go on and call the tipc_named_reinit via tipc_sock_release.
>>>>>>
>>>>>> As tipc_nametbl_withdraw (from the stack dump) has already acquired
>>>>>> the
>>>>>> nametbl_lock, tipc_named_reinit
>>>>>>
>>>>>> cannot get it and so the process hangs.
>>>>>>
>>>>>>
>>>>>> The call to tipc_sock_release (added in Commit 333f796235a527
>>>>>> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
>>>>>> stable.git/commit/?id=333f796235a52727db7e0a13888045f3aa3d5335>)
>>>>>> seems to have changed the behaviour
>>>>>>
>>>>>> such that it tries to do a lot more when shutting the connection down.
>>>>>>
>>>>>>
>>>>>> If there is other information I can provide please let me know.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> John
>>>>>> ------------------------------------------------------------
>>>>>> ------------------
>>>>>> _______________________________________________
>>>>>> tipc-discussion mailing list
>>>>>> tipc-discussion@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>>>> ------------------------------------------------------------
>>>>>> ------------------
>>>>>> _______________________________________________
>>>>>> tipc-discussion mailing list
>>>>>> tipc-discussion@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>>>>
>>>>>>
>>>>>>
>>>>> ------------------------------------------------------------
>>>> ------------------
>>>> _______________________________________________
>>>> tipc-discussion mailing list
>>>> tipc-discussion@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>>
>>>>
>>>
>>
>
------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to