Hi Partha,

In my testing over the weekend the patch performed well - I didn't see any
kernel dumps due to this issue.

Thanks for the quick response.
JT


On Fri, Nov 18, 2016 at 10:34 AM, John Thompson <thompa....@gmail.com>
wrote:

> Hi,
>
> I will be able to have some test results by the start of next week on the
> first patch.
>
> Regards,
> JT
>
>
> On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue <ying....@windriver.com> wrote:
>
>> On 11/17/2016 07:04 AM, John Thompson wrote:
>>
>>> Hi Partha / Ying,
>>>
>>> I will try out the patch and let you know how it goes.
>>> I also note about providing the other CPU core dumps - in one of my
>>> cases I
>>> didn't have them but in others I did but
>>> they were interleaved and so were difficult to interpret.
>>>
>>
>> Thanks, it's unnecessary for us to collect more logs as its soft lockup
>> scenario should be just what Partha described.
>>
>> Regards,
>> Ying
>>
>>
>>
>>> Thanks for getting a patch together so quickly.
>>>
>>> JT
>>>
>>> On Wed, Nov 16, 2016 at 10:23 PM, Parthasarathy Bhuvaragan <
>>> parthasarathy.bhuvara...@ericsson.com> wrote:
>>>
>>> Hi Ying / John,
>>>>
>>>> The soft lock is the call chain of tipc_nametbl_withdraw(), when it
>>>> performs the tipc_conn_kref_release() as it tries to grab nametbl_lock
>>>> again while holding it already.
>>>>
>>>>> tipc_nametbl_withdraw
>>>>>   spin_lock_bh(&tn->nametbl_lock);
>>>>>   tipc_nametbl_remove_publ
>>>>>      spin_lock_bh(&seq->lock);
>>>>>      tipc_nameseq_remove_publ
>>>>>        tipc_subscrp_report_overlap
>>>>>          tipc_subscrp_send_event
>>>>>             tipc_conn_sendmsg
>>>>>
>>>> << Here, the (test_bit(CF_CONNECTED, &con->flags)) Fails, leading to the
>>>> else case where do do a conn_put() and that triggers the cleanup as
>>>> refcount reached 0. Leading the call chain below : >>
>>>> tipc_conn_kref_release
>>>>    tipc_sock_release
>>>>      tipc_conn_release
>>>>         tipc_subscrb_delete
>>>>            tipc_subscrp_delete
>>>>               tipc_nametbl_unsubscribe
>>>>                  spin_lock_bh(&tn->nametbl_lock);  << !! Soft Lockup >>
>>>>
>>>> One cause is that tipc_exit_net() calls first calls tipc_topsrv_stop()
>>>> and
>>>> then tipc_nametbl_withdraw() in scope of tipc_net_stop().
>>>>
>>>> The above chain will only occur in a narrow window for a given
>>>> connection:
>>>> CPU#1:
>>>> tipc_nametbl_withdraw() manages to perform tipc_conn_lookup() and steps
>>>> the refcount to 2, while in CPU#2 the following occurs:
>>>> CPU#2:
>>>> tipc_server_stop() calls tipc_close_conn(con). This performs a
>>>> conn_put()
>>>> decrementing refcount to 1.
>>>> Now, CPU#1 continues and detects that the connection is not CF_CONNECTED
>>>> and does a conn_put(), triggering the release callback.
>>>>
>>>> Before commit 333f796235a527, the above wont happen.
>>>>
>>>> /Partha
>>>>
>>>>
>>>> On 11/15/2016 04:11 PM, Xue, Ying wrote:
>>>>
>>>> Hi John,
>>>>>
>>>>> Regarding the stack trace you provided below, I get the two potential
>>>>> call chains:
>>>>>
>>>>> tipc_nametbl_withdraw
>>>>>   spin_lock_bh(&tn->nametbl_lock);
>>>>>   tipc_nametbl_remove_publ
>>>>>      spin_lock_bh(&seq->lock);
>>>>>      tipc_nameseq_remove_publ
>>>>>        tipc_subscrp_report_overlap
>>>>>          tipc_subscrp_send_event
>>>>>             tipc_conn_sendmsg
>>>>>                spin_lock_bh(&con->outqueue_lock);
>>>>>                list_add_tail(&e->list, &con->outqueue);
>>>>>
>>>>>
>>>>> tipc_topsrv_stop
>>>>>   tipc_server_stop
>>>>>     tipc_close_conn
>>>>>       kernel_sock_shutdown
>>>>>         tipc_subscrb_delete
>>>>>           spin_lock_bh(&subscriber->lock);
>>>>>           tipc_nametbl_unsubscribe(sub);
>>>>>            spin_lock_bh(&tn->nametbl_lock);
>>>>>
>>>>> Although I suspect this is a revert lock issue leading to the soft
>>>>> lockup, I am still unable to understand which lock together with
>>>>> nametbl_lock is taken reversely on the two different paths above.
>>>>> However, you just gave us the log printed on CPU#2, but the logs
>>>>> outputted by other cores are also important.  So if possible, please
>>>>> share
>>>>> them with us.
>>>>>
>>>>> By the way, I agree with you, and it seems that commit 333f796235a527
>>>>> is
>>>>> related to the soft lockup.
>>>>>
>>>>> Regards,
>>>>> Ying
>>>>>
>>>>> -----Original Message-----
>>>>> From: John Thompson [mailto:thompa....@gmail.com]
>>>>> Sent: Tuesday, November 15, 2016 8:01 AM
>>>>> To: tipc-discussion@lists.sourceforge.net
>>>>> Subject: [tipc-discussion] v4.7: soft lockup when releasing a socket
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am seeing an occasional kernel soft lockup.  I have TIPC v4.7 and the
>>>>> kernel dump occurs when the system is going down for a reboot.
>>>>>
>>>>> The kernel dump is:
>>>>>
>>>>> <0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [exfx:1474]
>>>>> <6>Modules linked in: tipc jitterentropy_rng echainiv drbg
>>>>> platform_driver(O) ipifwd(PO)
>>>>> ...
>>>>> <6>
>>>>> <6>GPR00: c15333e8 a4e0fb80 a4ee3600 a51748ac 00000000 ae475024
>>>>> a537feec
>>>>> fffffffd
>>>>> <6>GPR08: a2197408 00000001 00000001 00000004 80691c00 <6>NIP
>>>>> [80691c40]
>>>>> _raw_spin_lock_bh+0x40/0x70 <6>LR [c1534f30]
>>>>> tipc_nametbl_unsubscribe+0x50/0x120
>>>>> [tipc] <6>Call Trace:
>>>>> <6>[a4e0fba0] [c15333e8] tipc_named_reinit+0xf8/0x820 [tipc]
>>>>> <6>[a4e0fbb0] [c15336a0] tipc_named_reinit+0x3b0/0x820 [tipc]
>>>>> <6>[a4e0fbd0]
>>>>> [c1540bac] tipc_nl_publ_dump+0x50c/0xed0 [tipc] <6>[a4e0fc00]
>>>>> [c154164c]
>>>>> tipc_conn_sendmsg+0xdc/0x170 [tipc] <6>[a4e0fc30] [c1533c9c]
>>>>> tipc_subscrp_report_overlap+0xbc/0xd0 [tipc] <6>[a4e0fc70] [c153425c]
>>>>> tipc_topsrv_stop+0x45c/0x4f0 [tipc] <6>[a4e0fca0] [c1534788]
>>>>> tipc_nametbl_remove_publ+0x58/0x110 [tipc] <6>[a4e0fcd0] [c1534c48]
>>>>> tipc_nametbl_withdraw+0x68/0x140 [tipc] <6>[a4e0fd00] [c153cc24]
>>>>> tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] <6>[a4e0fd30] [c153d838]
>>>>> tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] <6>[a4e0fd70] [804f2870]
>>>>> sock_release+0x30/0xf0 <6>[a4e0fd80] [804f2944] sock_close+0x14/0x30
>>>>> <6>[a4e0fd90] [80105844] __fput+0x94/0x200 <6>[a4e0fdb0] [8003dca4]
>>>>> task_work_run+0xd4/0x100 <6>[a4e0fdd0] [80023620] do_exit+0x280/0x980
>>>>> <6>[a4e0fe10] [80024c48] do_group_exit+0x48/0xb0 <6>[a4e0fe30]
>>>>> [80030344]
>>>>> get_signal+0x244/0x4f0 <6>[a4e0fe80] [80007734] do_signal+0x34/0x1c0
>>>>> <6>[a4e0ff30] [800079a8] do_notify_resume+0x68/0x80 <6>[a4e0ff40]
>>>>> [8000fa1c] do_user_signal+0x74/0xc4
>>>>>
>>>>>
>>>>> From the stack dump it looks like tipc_named_reinit is trying to
>>>>>
>>>>>>
>>>>>> acquire nametbl_lock.
>>>>>
>>>>> From looking at the call chain I can see that tipc_conn_sendmsg can
>>>>>
>>>>>>
>>>>>> send up calling conn_put
>>>>>
>>>>> which will go on and call the tipc_named_reinit via tipc_sock_release.
>>>>>
>>>>> As tipc_nametbl_withdraw (from the stack dump) has already acquired the
>>>>> nametbl_lock, tipc_named_reinit
>>>>>
>>>>> cannot get it and so the process hangs.
>>>>>
>>>>>
>>>>> The call to tipc_sock_release (added in Commit 333f796235a527
>>>>> <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
>>>>> stable.git/commit/?id=333f796235a52727db7e0a13888045f3aa3d5335>)
>>>>> seems to have changed the behaviour
>>>>>
>>>>> such that it tries to do a lot more when shutting the connection down.
>>>>>
>>>>>
>>>>> If there is other information I can provide please let me know.
>>>>>
>>>>> Regards,
>>>>>
>>>>> John
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> _______________________________________________
>>>>> tipc-discussion mailing list
>>>>> tipc-discussion@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> _______________________________________________
>>>>> tipc-discussion mailing list
>>>>> tipc-discussion@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>>>
>>>>>
>>>>>
>>>> ------------------------------------------------------------
>>> ------------------
>>> _______________________________________________
>>> tipc-discussion mailing list
>>> tipc-discussion@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>
>>>
>>
>
------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to