Re: [tipc-discussion] v4.7: soft lockup when releasing a socket

John Thompson Mon, 21 Nov 2016 14:35:28 -0800

Hi Partha,

My test has 4 nodes, 2 of which are alternately rebooting.  When the
rebooted node rejoins a few minutes pass and then the other node is
rebooted.
I am not printing out link stats and believe that the the other code is not
doing so either, when nodes leave or rejoin.


JT


On Tue, Nov 22, 2016 at 2:22 AM, Parthasarathy Bhuvaragan <
parthasarathy.bhuvara...@ericsson.com> wrote:

> Hi,
>
> There is an other branch where softlockup for nametbl_lock occurs.
>
> tipc_named_rcv() Grabs nametbl_lock
>   tipc_update_nametbl() (publish/withdraw)
>     tipc_node_subscribe()/unsubscribe()
>       tipc_node_write_unlock()
>          << lockup occurs if it needs to process NODE UP/DOWN LINK
> UP/DOWN, as it grabs nametbl_lock again >>
>
> /Partha
>
>
> On 11/21/2016 01:04 PM, Parthasarathy Bhuvaragan wrote:
>
>> Hi,
>>
>> tipc_nametbl_withdraw() triggers the softlockup as it tries to grab
>> nametbl_lock twice if the node triggered a TIPC_NOTIFY_LINK_DOWN event
>> while its is running. The erroneous call chain is:
>>
>>   tipc_nametbl_withdraw() Grab nametbl_lock
>>     tipc_named_process_backlog()
>>       tipc_update_nametbl()
>>         if (dtype == WITHDRAWAL) tipc_node_unsubscribe()
>>           tipc_node_write_unlock()
>>             if (flags & TIPC_NOTIFY_LINK_DOWN) tipc_nametbl_withdraw()
>>                spin_lock_bh(&tn->nametbl_lock);  << Soft Lockup >>
>>
>> Three callers which can cause this under module exit:
>>
>> Case1:
>>   tipc_exit_net()
>>     tipc_nametbl_withdraw() Grab nametbl_lock
>>
>> Case2:
>>   tipc_server_stop()
>>     tipc_conn_kref_release
>>       tipc_sock_release
>>         sock_release()
>>           tipc_release()
>>             tipc_sk_withdraw()
>>               tipc_nametbl_withdraw()
>>
>> Case3:
>>   tipc_server_stop()
>>     tipc_conn_kref_release()
>>       kernel_bind()
>>         tipc_bind()
>>           tipc_sk_withdraw()
>>             tipc_nametbl_withdraw()
>>
>> I will work on a solution for this.
>>
>> What kind of test were you performing when this occurred (linkup/down)?
>> Do you read link statistics periodically in your tests?
>>
>> /Partha
>>
>> On 11/21/2016 05:30 AM, John Thompson wrote:
>>
>>> Hi Partha,
>>>
>>> I was doing some some more testing today and have still observed the
>>> problem (contrary to what I had emailed earlier).
>>>
>>> Here is the kernel dump.
>>>
>>> <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [Pluggable
>>> Serve:2221]
>>> <6>Modules linked in: tipc jitterentropy_rng echainiv drbg
>>> platform_driver(O)
>>> <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P           O
>>> <6>task: ae54ced0 ti: aec42000 task.ti: aec42000
>>> <6>NIP: 8069257c LR: c13ebf50 CTR: 80692540
>>> <6>REGS: aec43ad0 TRAP: 0901   Tainted: P           O
>>> <6>MSR: 00029002 <CE,EE,ME>  CR: 48002444  XER: 00000000
>>> <6>
>>> <6>GPR00: c13ea408 aec43b80 ae54ced0 a624690c 00000000 a6271d84 a39a60cc
>>> fffffffd
>>> <6>GPR08: aeefbbc8 00000001 00000001 00000004 80692540
>>> <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70
>>> <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
>>> <6>Call Trace:
>>> <6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable)
>>> <6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
>>> <6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
>>> <6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>>> <6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
>>> <6>[aec43c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>>> <6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>>> <6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc]
>>> <6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc]
>>> <6>[aec43d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>>> <6>[aec43d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>>>
>> TIPC_CMD_SHOW_LINK_STATS or TIPC_NL_LINK_GET
>>
>>> <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
>>> <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
>>> <6>[aec43d90] [80105844] __fput+0x94/0x200
>>> <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100
>>> <6>[aec43dd0] [80023620] do_exit+0x280/0x980
>>> <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0
>>> <6>[aec43e30] [80030344] get_signal+0x244/0x4f0
>>> <6>[aec43e80] [80007734] do_signal+0x34/0x1c0
>>> <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80
>>> <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4
>>> <6>--- interrupt: c00 at 0xf4f3d08
>>> <6>    LR = 0xf4f3ce8
>>> <6>Instruction dump:
>>> <6>912a0008 39400001 7d201828 2c090000 40820010 7d40192d 40a2fff0
>>> 7c2004ac
>>> <6>2f890000 4dbe0020 7c210b78 81230000 <2f890000> 40befff4 7c421378
>>> 7d201828
>>> <0>Kernel panic - not syncing: softlockup: hung tasks
>>> <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P           O L
>>> <6>Call Trace:
>>> <6>[aec43930] [80694e20] dump_stack+0x84/0xb0 (unreliable)
>>> <6>[aec43940] [80692ca8] panic+0xd8/0x214
>>> <6>[aec439a0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0
>>> <6>[aec439f0] [8007ae58] __hrtimer_run_queues+0x118/0x1d0
>>> <6>[aec43a30] [8007b608] hrtimer_interrupt+0xd8/0x270
>>> <6>[aec43a80] [8000983c] __timer_interrupt+0xac/0x1b0
>>> <6>[aec43aa0] [80009b70] timer_interrupt+0xb0/0xe0
>>> <6>[aec43ac0] [8000f450] ret_from_except+0x0/0x18
>>> <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70
>>> <6>    LR = tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
>>> <6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable)
>>> <6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
>>> <6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
>>> <6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>>> <6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
>>> <6>[aec43c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>>> <6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>>> <6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc]
>>> <6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc]
>>> <6>[aec43d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>>> <6>[aec43d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>>> <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
>>> <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
>>> <6>[aec43d90] [80105844] __fput+0x94/0x200
>>> <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100
>>> <6>[aec43dd0] [80023620] do_exit+0x280/0x980
>>> <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0
>>> <6>[aec43e30] [80030344] get_signal+0x244/0x4f0
>>> <6>[aec43e80] [80007734] do_signal+0x34/0x1c0
>>> <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80
>>> <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4
>>> <6>--- interrupt: c00 at 0xf4f3d08
>>> <6>    LR = 0xf4f3ce8
>>>
>>>
>>> On Mon, Nov 21, 2016 at 9:59 AM, John Thompson <thompa....@gmail.com
>>> <mailto:thompa....@gmail.com>> wrote:
>>>
>>>     Hi Partha,
>>>
>>>     In my testing over the weekend the patch performed well - I didn't
>>>     see any kernel dumps due to this issue.
>>>
>>>     Thanks for the quick response.
>>>     JT
>>>
>>>
>>>     On Fri, Nov 18, 2016 at 10:34 AM, John Thompson
>>>     <thompa....@gmail.com <mailto:thompa....@gmail.com>> wrote:
>>>
>>>         Hi,
>>>
>>>         I will be able to have some test results by the start of next
>>>         week on the first patch.
>>>
>>>         Regards,
>>>         JT
>>>
>>>
>>>         On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue
>>>         <ying....@windriver.com <mailto:ying....@windriver.com>> wrote:
>>>
>>>             On 11/17/2016 07:04 AM, John Thompson wrote:
>>>
>>>                 Hi Partha / Ying,
>>>
>>>                 I will try out the patch and let you know how it goes.
>>>                 I also note about providing the other CPU core dumps -
>>>                 in one of my cases I
>>>                 didn't have them but in others I did but
>>>                 they were interleaved and so were difficult to interpret.
>>>
>>>
>>>             Thanks, it's unnecessary for us to collect more logs as its
>>>             soft lockup scenario should be just what Partha described.
>>>
>>>             Regards,
>>>             Ying
>>>
>>>
>>>
>>>                 Thanks for getting a patch together so quickly.
>>>
>>>                 JT
>>>
>>>                 On Wed, Nov 16, 2016 at 10:23 PM, Parthasarathy
>>> Bhuvaragan <
>>>                 parthasarathy.bhuvara...@ericsson.com
>>>                 <mailto:parthasarathy.bhuvara...@ericsson.com>> wrote:
>>>
>>>                     Hi Ying / John,
>>>
>>>                     The soft lock is the call chain of
>>>                     tipc_nametbl_withdraw(), when it
>>>                     performs the tipc_conn_kref_release() as it tries to
>>>                     grab nametbl_lock
>>>                     again while holding it already.
>>>
>>>                         tipc_nametbl_withdraw
>>>                           spin_lock_bh(&tn->nametbl_lock);
>>>                           tipc_nametbl_remove_publ
>>>                              spin_lock_bh(&seq->lock);
>>>                              tipc_nameseq_remove_publ
>>>                                tipc_subscrp_report_overlap
>>>                                  tipc_subscrp_send_event
>>>                                     tipc_conn_sendmsg
>>>
>>>                     << Here, the (test_bit(CF_CONNECTED, &con->flags))
>>>                     Fails, leading to the
>>>                     else case where do do a conn_put() and that triggers
>>>                     the cleanup as
>>>                     refcount reached 0. Leading the call chain below : >>
>>>                     tipc_conn_kref_release
>>>                        tipc_sock_release
>>>                          tipc_conn_release
>>>                             tipc_subscrb_delete
>>>                                tipc_subscrp_delete
>>>                                   tipc_nametbl_unsubscribe
>>>                                      spin_lock_bh(&tn->nametbl_lock);
>>>                     << !! Soft Lockup >>
>>>
>>>                     One cause is that tipc_exit_net() calls first calls
>>>                     tipc_topsrv_stop() and
>>>                     then tipc_nametbl_withdraw() in scope of
>>>                     tipc_net_stop().
>>>
>>>                     The above chain will only occur in a narrow window
>>>                     for a given connection:
>>>                     CPU#1:
>>>                     tipc_nametbl_withdraw() manages to perform
>>>                     tipc_conn_lookup() and steps
>>>                     the refcount to 2, while in CPU#2 the following
>>> occurs:
>>>                     CPU#2:
>>>                     tipc_server_stop() calls tipc_close_conn(con). This
>>>                     performs a conn_put()
>>>                     decrementing refcount to 1.
>>>                     Now, CPU#1 continues and detects that the connection
>>>                     is not CF_CONNECTED
>>>                     and does a conn_put(), triggering the release
>>> callback.
>>>
>>>                     Before commit 333f796235a527, the above wont happen.
>>>
>>>                     /Partha
>>>
>>>
>>>                     On 11/15/2016 04:11 PM, Xue, Ying wrote:
>>>
>>>                         Hi John,
>>>
>>>                         Regarding the stack trace you provided below, I
>>>                         get the two potential
>>>                         call chains:
>>>
>>>                         tipc_nametbl_withdraw
>>>                           spin_lock_bh(&tn->nametbl_lock);
>>>                           tipc_nametbl_remove_publ
>>>                              spin_lock_bh(&seq->lock);
>>>                              tipc_nameseq_remove_publ
>>>                                tipc_subscrp_report_overlap
>>>                                  tipc_subscrp_send_event
>>>                                     tipc_conn_sendmsg
>>>                                        spin_lock_bh(&con->outqueue_l
>>> ock);
>>>                                        list_add_tail(&e->list,
>>>                         &con->outqueue);
>>>
>>>
>>>                         tipc_topsrv_stop
>>>                           tipc_server_stop
>>>                             tipc_close_conn
>>>                               kernel_sock_shutdown
>>>                                 tipc_subscrb_delete
>>>                                   spin_lock_bh(&subscriber->lock);
>>>                                   tipc_nametbl_unsubscribe(sub);
>>>                                    spin_lock_bh(&tn->nametbl_lock);
>>>
>>>                         Although I suspect this is a revert lock issue
>>>                         leading to the soft
>>>                         lockup, I am still unable to understand which
>>>                         lock together with
>>>                         nametbl_lock is taken reversely on the two
>>>                         different paths above.
>>>                         However, you just gave us the log printed on
>>>                         CPU#2, but the logs
>>>                         outputted by other cores are also important.  So
>>>                         if possible, please share
>>>                         them with us.
>>>
>>>                         By the way, I agree with you, and it seems that
>>>                         commit 333f796235a527 is
>>>                         related to the soft lockup.
>>>
>>>                         Regards,
>>>                         Ying
>>>
>>>                         -----Original Message-----
>>>                         From: John Thompson [mailto:thompa....@gmail.com
>>>                         <mailto:thompa....@gmail.com>]
>>>                         Sent: Tuesday, November 15, 2016 8:01 AM
>>>                         To: tipc-discussion@lists.sourceforge.net
>>>                         <mailto:tipc-discussion@lists.sourceforge.net>
>>>                         Subject: [tipc-discussion] v4.7: soft lockup
>>>                         when releasing a socket
>>>
>>>                         Hi,
>>>
>>>                         I am seeing an occasional kernel soft lockup.  I
>>>                         have TIPC v4.7 and the
>>>                         kernel dump occurs when the system is going down
>>>                         for a reboot.
>>>
>>>                         The kernel dump is:
>>>
>>>                         <0>NMI watchdog: BUG: soft lockup - CPU#2 stuck
>>>                         for 23s! [exfx:1474]
>>>                         <6>Modules linked in: tipc jitterentropy_rng
>>>                         echainiv drbg
>>>                         platform_driver(O) ipifwd(PO)
>>>                         ...
>>>                         <6>
>>>                         <6>GPR00: c15333e8 a4e0fb80 a4ee3600 a51748ac
>>>                         00000000 ae475024 a537feec
>>>                         fffffffd
>>>                         <6>GPR08: a2197408 00000001 00000001 00000004
>>>                         80691c00 <6>NIP [80691c40]
>>>                         _raw_spin_lock_bh+0x40/0x70 <6>LR [c1534f30]
>>>                         tipc_nametbl_unsubscribe+0x50/0x120
>>>                         [tipc] <6>Call Trace:
>>>                         <6>[a4e0fba0] [c15333e8]
>>>                         tipc_named_reinit+0xf8/0x820 [tipc]
>>>                         <6>[a4e0fbb0] [c15336a0]
>>>                         tipc_named_reinit+0x3b0/0x820 [tipc]
>>> <6>[a4e0fbd0]
>>>                         [c1540bac] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>>>                         <6>[a4e0fc00] [c154164c]
>>>                         tipc_conn_sendmsg+0xdc/0x170 [tipc]
>>>                         <6>[a4e0fc30] [c1533c9c]
>>>                         tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>>>                         <6>[a4e0fc70] [c153425c]
>>>                         tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>>>                         <6>[a4e0fca0] [c1534788]
>>>                         tipc_nametbl_remove_publ+0x58/0x110 [tipc]
>>>                         <6>[a4e0fcd0] [c1534c48]
>>>                         tipc_nametbl_withdraw+0x68/0x140 [tipc]
>>>                         <6>[a4e0fd00] [c153cc24]
>>>                         tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>>>                         <6>[a4e0fd30] [c153d838]
>>>                         tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>>>                         <6>[a4e0fd70] [804f2870]
>>>                         sock_release+0x30/0xf0 <6>[a4e0fd80] [804f2944]
>>>                         sock_close+0x14/0x30
>>>                         <6>[a4e0fd90] [80105844] __fput+0x94/0x200
>>>                         <6>[a4e0fdb0] [8003dca4]
>>>                         task_work_run+0xd4/0x100 <6>[a4e0fdd0]
>>>                         [80023620] do_exit+0x280/0x980
>>>                         <6>[a4e0fe10] [80024c48] do_group_exit+0x48/0xb0
>>>                         <6>[a4e0fe30] [80030344]
>>>                         get_signal+0x244/0x4f0 <6>[a4e0fe80] [80007734]
>>>                         do_signal+0x34/0x1c0
>>>                         <6>[a4e0ff30] [800079a8]
>>>                         do_notify_resume+0x68/0x80 <6>[a4e0ff40]
>>>                         [8000fa1c] do_user_signal+0x74/0xc4
>>>
>>>
>>>                         From the stack dump it looks like
>>>                         tipc_named_reinit is trying to
>>>
>>>
>>>                         acquire nametbl_lock.
>>>
>>>                         From looking at the call chain I can see that
>>>                         tipc_conn_sendmsg can
>>>
>>>
>>>                         send up calling conn_put
>>>
>>>                         which will go on and call the tipc_named_reinit
>>>                         via tipc_sock_release.
>>>
>>>                         As tipc_nametbl_withdraw (from the stack dump)
>>>                         has already acquired the
>>>                         nametbl_lock, tipc_named_reinit
>>>
>>>                         cannot get it and so the process hangs.
>>>
>>>
>>>                         The call to tipc_sock_release (added in Commit
>>>                         333f796235a527
>>>                         <http://git.atlnz.lc/cgit/cgit
>>> .cgi/upstream_imports/linux-
>>>                         <http://git.atlnz.lc/cgit/cgit
>>> .cgi/upstream_imports/linux->
>>>                         stable.git/commit/?id=333f7962
>>> 35a52727db7e0a13888045f3aa3d5335>)
>>>                         seems to have changed the behaviour
>>>
>>>                         such that it tries to do a lot more when
>>>                         shutting the connection down.
>>>
>>>
>>>                         If there is other information I can provide
>>>                         please let me know.
>>>
>>>                         Regards,
>>>
>>>                         John
>>>                         ------------------------------
>>> ------------------------------
>>>                         ------------------
>>>                         _______________________________________________
>>>                         tipc-discussion mailing list
>>>                         tipc-discussion@lists.sourceforge.net
>>>                         <mailto:tipc-discussion@lists.sourceforge.net>
>>>                         https://lists.sourceforge.net/
>>> lists/listinfo/tipc-discussion
>>>                         <https://lists.sourceforge.net
>>> /lists/listinfo/tipc-discussion>
>>>                         ------------------------------
>>> ------------------------------
>>>                         ------------------
>>>                         _______________________________________________
>>>                         tipc-discussion mailing list
>>>                         tipc-discussion@lists.sourceforge.net
>>>                         <mailto:tipc-discussion@lists.sourceforge.net>
>>>                         https://lists.sourceforge.net/
>>> lists/listinfo/tipc-discussion
>>>                         <https://lists.sourceforge.net
>>> /lists/listinfo/tipc-discussion>
>>>
>>>
>>>
>>>                 ------------------------------
>>> ------------------------------------------------
>>>                 _______________________________________________
>>>                 tipc-discussion mailing list
>>>                 tipc-discussion@lists.sourceforge.net
>>>                 <mailto:tipc-discussion@lists.sourceforge.net>
>>>                 https://lists.sourceforge.net/
>>> lists/listinfo/tipc-discussion
>>>                 <https://lists.sourceforge.net
>>> /lists/listinfo/tipc-discussion>
>>>
>>>
>>>
>>>
>>>
>>>
>> ------------------------------------------------------------
>> ------------------
>> _______________________________________________
>> tipc-discussion mailing list
>> tipc-discussion@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>
>>
------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] v4.7: soft lockup when releasing a socket

Reply via email to