Re: [tipc-discussion] v4.7: soft lockup when releasing a socket

Parthasarathy Bhuvaragan Wed, 23 Nov 2016 00:18:07 -0800

Hi John,

Ok. Can you test with the 2 patches I posted regarding the nametbl_lock?


/Partha

On 11/21/2016 11:32 PM, John Thompson wrote:
> Hi Partha,
>
> My test has 4 nodes, 2 of which are alternately rebooting.  When the
> rebooted node rejoins a few minutes pass and then the other node is
> rebooted.
> I am not printing out link stats and believe that the the other code is
> not doing so either, when nodes leave or rejoin.
>
> JT
>
>
> On Tue, Nov 22, 2016 at 2:22 AM, Parthasarathy Bhuvaragan
> <parthasarathy.bhuvara...@ericsson.com
> <mailto:parthasarathy.bhuvara...@ericsson.com>> wrote:
>
>     Hi,
>
>     There is an other branch where softlockup for nametbl_lock occurs.
>
>     tipc_named_rcv() Grabs nametbl_lock
>       tipc_update_nametbl() (publish/withdraw)
>         tipc_node_subscribe()/unsubscribe()
>           tipc_node_write_unlock()
>              << lockup occurs if it needs to process NODE UP/DOWN LINK
>     UP/DOWN, as it grabs nametbl_lock again >>
>
>     /Partha
>
>
>     On 11/21/2016 01:04 PM, Parthasarathy Bhuvaragan wrote:
>
>         Hi,
>
>         tipc_nametbl_withdraw() triggers the softlockup as it tries to grab
>         nametbl_lock twice if the node triggered a TIPC_NOTIFY_LINK_DOWN
>         event
>         while its is running. The erroneous call chain is:
>
>           tipc_nametbl_withdraw() Grab nametbl_lock
>             tipc_named_process_backlog()
>               tipc_update_nametbl()
>                 if (dtype == WITHDRAWAL) tipc_node_unsubscribe()
>                   tipc_node_write_unlock()
>                     if (flags & TIPC_NOTIFY_LINK_DOWN)
>         tipc_nametbl_withdraw()
>                        spin_lock_bh(&tn->nametbl_lock);  << Soft Lockup >>
>
>         Three callers which can cause this under module exit:
>
>         Case1:
>           tipc_exit_net()
>             tipc_nametbl_withdraw() Grab nametbl_lock
>
>         Case2:
>           tipc_server_stop()
>             tipc_conn_kref_release
>               tipc_sock_release
>                 sock_release()
>                   tipc_release()
>                     tipc_sk_withdraw()
>                       tipc_nametbl_withdraw()
>
>         Case3:
>           tipc_server_stop()
>             tipc_conn_kref_release()
>               kernel_bind()
>                 tipc_bind()
>                   tipc_sk_withdraw()
>                     tipc_nametbl_withdraw()
>
>         I will work on a solution for this.
>
>         What kind of test were you performing when this occurred
>         (linkup/down)?
>         Do you read link statistics periodically in your tests?
>
>         /Partha
>
>         On 11/21/2016 05:30 AM, John Thompson wrote:
>
>             Hi Partha,
>
>             I was doing some some more testing today and have still
>             observed the
>             problem (contrary to what I had emailed earlier).
>
>             Here is the kernel dump.
>
>             <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
>             [Pluggable
>             Serve:2221]
>             <6>Modules linked in: tipc jitterentropy_rng echainiv drbg
>             platform_driver(O)
>             <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P           O
>             <6>task: ae54ced0 ti: aec42000 task.ti: aec42000
>             <6>NIP: 8069257c LR: c13ebf50 CTR: 80692540
>             <6>REGS: aec43ad0 TRAP: 0901   Tainted: P           O
>             <6>MSR: 00029002 <CE,EE,ME>  CR: 48002444  XER: 00000000
>             <6>
>             <6>GPR00: c13ea408 aec43b80 ae54ced0 a624690c 00000000
>             a6271d84 a39a60cc
>             fffffffd
>             <6>GPR08: aeefbbc8 00000001 00000001 00000004 80692540
>             <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70
>             <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
>             <6>Call Trace:
>             <6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable)
>             <6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
>             <6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
>             <6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>             <6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
>             <6>[aec43c30] [c13eacbc]
>             tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>             <6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>             <6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110
>             [tipc]
>             <6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc]
>             <6>[aec43d00] [c13f3c34]
>             tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>             <6>[aec43d30] [c13f4848]
>             tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>
>         TIPC_CMD_SHOW_LINK_STATS or TIPC_NL_LINK_GET
>
>             <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
>             <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
>             <6>[aec43d90] [80105844] __fput+0x94/0x200
>             <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100
>             <6>[aec43dd0] [80023620] do_exit+0x280/0x980
>             <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0
>             <6>[aec43e30] [80030344] get_signal+0x244/0x4f0
>             <6>[aec43e80] [80007734] do_signal+0x34/0x1c0
>             <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80
>             <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4
>             <6>--- interrupt: c00 at 0xf4f3d08
>             <6>    LR = 0xf4f3ce8
>             <6>Instruction dump:
>             <6>912a0008 39400001 7d201828 2c090000 40820010 7d40192d
>             40a2fff0 7c2004ac
>             <6>2f890000 4dbe0020 7c210b78 81230000 <2f890000> 40befff4
>             7c421378
>             7d201828
>             <0>Kernel panic - not syncing: softlockup: hung tasks
>             <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P
>                O L
>             <6>Call Trace:
>             <6>[aec43930] [80694e20] dump_stack+0x84/0xb0 (unreliable)
>             <6>[aec43940] [80692ca8] panic+0xd8/0x214
>             <6>[aec439a0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0
>             <6>[aec439f0] [8007ae58] __hrtimer_run_queues+0x118/0x1d0
>             <6>[aec43a30] [8007b608] hrtimer_interrupt+0xd8/0x270
>             <6>[aec43a80] [8000983c] __timer_interrupt+0xac/0x1b0
>             <6>[aec43aa0] [80009b70] timer_interrupt+0xb0/0xe0
>             <6>[aec43ac0] [8000f450] ret_from_except+0x0/0x18
>             <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70
>             <6>    LR = tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
>             <6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable)
>             <6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
>             <6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
>             <6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>             <6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
>             <6>[aec43c30] [c13eacbc]
>             tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>             <6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>             <6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110
>             [tipc]
>             <6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc]
>             <6>[aec43d00] [c13f3c34]
>             tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>             <6>[aec43d30] [c13f4848]
>             tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>             <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
>             <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
>             <6>[aec43d90] [80105844] __fput+0x94/0x200
>             <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100
>             <6>[aec43dd0] [80023620] do_exit+0x280/0x980
>             <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0
>             <6>[aec43e30] [80030344] get_signal+0x244/0x4f0
>             <6>[aec43e80] [80007734] do_signal+0x34/0x1c0
>             <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80
>             <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4
>             <6>--- interrupt: c00 at 0xf4f3d08
>             <6>    LR = 0xf4f3ce8
>
>
>             On Mon, Nov 21, 2016 at 9:59 AM, John Thompson
>             <thompa....@gmail.com <mailto:thompa....@gmail.com>
>             <mailto:thompa....@gmail.com <mailto:thompa....@gmail.com>>>
>             wrote:
>
>                 Hi Partha,
>
>                 In my testing over the weekend the patch performed well
>             - I didn't
>                 see any kernel dumps due to this issue.
>
>                 Thanks for the quick response.
>                 JT
>
>
>                 On Fri, Nov 18, 2016 at 10:34 AM, John Thompson
>                 <thompa....@gmail.com <mailto:thompa....@gmail.com>
>             <mailto:thompa....@gmail.com <mailto:thompa....@gmail.com>>>
>             wrote:
>
>                     Hi,
>
>                     I will be able to have some test results by the
>             start of next
>                     week on the first patch.
>
>                     Regards,
>                     JT
>
>
>                     On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue
>                     <ying....@windriver.com
>             <mailto:ying....@windriver.com>
>             <mailto:ying....@windriver.com
>             <mailto:ying....@windriver.com>>> wrote:
>
>                         On 11/17/2016 07:04 AM, John Thompson wrote:
>
>                             Hi Partha / Ying,
>
>                             I will try out the patch and let you know
>             how it goes.
>                             I also note about providing the other CPU
>             core dumps -
>                             in one of my cases I
>                             didn't have them but in others I did but
>                             they were interleaved and so were difficult
>             to interpret.
>
>
>                         Thanks, it's unnecessary for us to collect more
>             logs as its
>                         soft lockup scenario should be just what Partha
>             described.
>
>                         Regards,
>                         Ying
>
>
>
>                             Thanks for getting a patch together so quickly.
>
>                             JT
>
>                             On Wed, Nov 16, 2016 at 10:23 PM,
>             Parthasarathy Bhuvaragan <
>                             parthasarathy.bhuvara...@ericsson.com
>             <mailto:parthasarathy.bhuvara...@ericsson.com>
>
>             <mailto:parthasarathy.bhuvara...@ericsson.com
>             <mailto:parthasarathy.bhuvara...@ericsson.com>>> wrote:
>
>                                 Hi Ying / John,
>
>                                 The soft lock is the call chain of
>                                 tipc_nametbl_withdraw(), when it
>                                 performs the tipc_conn_kref_release() as
>             it tries to
>                                 grab nametbl_lock
>                                 again while holding it already.
>
>                                     tipc_nametbl_withdraw
>                                       spin_lock_bh(&tn->nametbl_lock);
>                                       tipc_nametbl_remove_publ
>                                          spin_lock_bh(&seq->lock);
>                                          tipc_nameseq_remove_publ
>                                            tipc_subscrp_report_overlap
>                                              tipc_subscrp_send_event
>                                                 tipc_conn_sendmsg
>
>                                 << Here, the (test_bit(CF_CONNECTED,
>             &con->flags))
>                                 Fails, leading to the
>                                 else case where do do a conn_put() and
>             that triggers
>                                 the cleanup as
>                                 refcount reached 0. Leading the call
>             chain below : >>
>                                 tipc_conn_kref_release
>                                    tipc_sock_release
>                                      tipc_conn_release
>                                         tipc_subscrb_delete
>                                            tipc_subscrp_delete
>                                               tipc_nametbl_unsubscribe
>
>              spin_lock_bh(&tn->nametbl_lock);
>                                 << !! Soft Lockup >>
>
>                                 One cause is that tipc_exit_net() calls
>             first calls
>                                 tipc_topsrv_stop() and
>                                 then tipc_nametbl_withdraw() in scope of
>                                 tipc_net_stop().
>
>                                 The above chain will only occur in a
>             narrow window
>                                 for a given connection:
>                                 CPU#1:
>                                 tipc_nametbl_withdraw() manages to perform
>                                 tipc_conn_lookup() and steps
>                                 the refcount to 2, while in CPU#2 the
>             following occurs:
>                                 CPU#2:
>                                 tipc_server_stop() calls
>             tipc_close_conn(con). This
>                                 performs a conn_put()
>                                 decrementing refcount to 1.
>                                 Now, CPU#1 continues and detects that
>             the connection
>                                 is not CF_CONNECTED
>                                 and does a conn_put(), triggering the
>             release callback.
>
>                                 Before commit 333f796235a527, the above
>             wont happen.
>
>                                 /Partha
>
>
>                                 On 11/15/2016 04:11 PM, Xue, Ying wrote:
>
>                                     Hi John,
>
>                                     Regarding the stack trace you
>             provided below, I
>                                     get the two potential
>                                     call chains:
>
>                                     tipc_nametbl_withdraw
>                                       spin_lock_bh(&tn->nametbl_lock);
>                                       tipc_nametbl_remove_publ
>                                          spin_lock_bh(&seq->lock);
>                                          tipc_nameseq_remove_publ
>                                            tipc_subscrp_report_overlap
>                                              tipc_subscrp_send_event
>                                                 tipc_conn_sendmsg
>
>              spin_lock_bh(&con->outqueue_lock);
>                                                    list_add_tail(&e->list,
>                                     &con->outqueue);
>
>
>                                     tipc_topsrv_stop
>                                       tipc_server_stop
>                                         tipc_close_conn
>                                           kernel_sock_shutdown
>                                             tipc_subscrb_delete
>
>             spin_lock_bh(&subscriber->lock);
>                                               tipc_nametbl_unsubscribe(sub);
>
>              spin_lock_bh(&tn->nametbl_lock);
>
>                                     Although I suspect this is a revert
>             lock issue
>                                     leading to the soft
>                                     lockup, I am still unable to
>             understand which
>                                     lock together with
>                                     nametbl_lock is taken reversely on
>             the two
>                                     different paths above.
>                                     However, you just gave us the log
>             printed on
>                                     CPU#2, but the logs
>                                     outputted by other cores are also
>             important.  So
>                                     if possible, please share
>                                     them with us.
>
>                                     By the way, I agree with you, and it
>             seems that
>                                     commit 333f796235a527 is
>                                     related to the soft lockup.
>
>                                     Regards,
>                                     Ying
>
>                                     -----Original Message-----
>                                     From: John Thompson
>             [mailto:thompa....@gmail.com <mailto:thompa....@gmail.com>
>                                     <mailto:thompa....@gmail.com
>             <mailto:thompa....@gmail.com>>]
>                                     Sent: Tuesday, November 15, 2016 8:01 AM
>                                     To:
>             tipc-discussion@lists.sourceforge.net
>             <mailto:tipc-discussion@lists.sourceforge.net>
>
>             <mailto:tipc-discussion@lists.sourceforge.net
>             <mailto:tipc-discussion@lists.sourceforge.net>>
>                                     Subject: [tipc-discussion] v4.7:
>             soft lockup
>                                     when releasing a socket
>
>                                     Hi,
>
>                                     I am seeing an occasional kernel
>             soft lockup.  I
>                                     have TIPC v4.7 and the
>                                     kernel dump occurs when the system
>             is going down
>                                     for a reboot.
>
>                                     The kernel dump is:
>
>                                     <0>NMI watchdog: BUG: soft lockup -
>             CPU#2 stuck
>                                     for 23s! [exfx:1474]
>                                     <6>Modules linked in: tipc
>             jitterentropy_rng
>                                     echainiv drbg
>                                     platform_driver(O) ipifwd(PO)
>                                     ...
>                                     <6>
>                                     <6>GPR00: c15333e8 a4e0fb80 a4ee3600
>             a51748ac
>                                     00000000 ae475024 a537feec
>                                     fffffffd
>                                     <6>GPR08: a2197408 00000001 00000001
>             00000004
>                                     80691c00 <6>NIP [80691c40]
>                                     _raw_spin_lock_bh+0x40/0x70 <6>LR
>             [c1534f30]
>                                     tipc_nametbl_unsubscribe+0x50/0x120
>                                     [tipc] <6>Call Trace:
>                                     <6>[a4e0fba0] [c15333e8]
>                                     tipc_named_reinit+0xf8/0x820 [tipc]
>                                     <6>[a4e0fbb0] [c15336a0]
>                                     tipc_named_reinit+0x3b0/0x820 [tipc]
>             <6>[a4e0fbd0]
>                                     [c1540bac]
>             tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>                                     <6>[a4e0fc00] [c154164c]
>                                     tipc_conn_sendmsg+0xdc/0x170 [tipc]
>                                     <6>[a4e0fc30] [c1533c9c]
>
>             tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>                                     <6>[a4e0fc70] [c153425c]
>                                     tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>                                     <6>[a4e0fca0] [c1534788]
>                                     tipc_nametbl_remove_publ+0x58/0x110
>             [tipc]
>                                     <6>[a4e0fcd0] [c1534c48]
>                                     tipc_nametbl_withdraw+0x68/0x140 [tipc]
>                                     <6>[a4e0fd00] [c153cc24]
>                                     tipc_nl_node_dump_link+0x1904/0x45d0
>             [tipc]
>                                     <6>[a4e0fd30] [c153d838]
>                                     tipc_nl_node_dump_link+0x2518/0x45d0
>             [tipc]
>                                     <6>[a4e0fd70] [804f2870]
>                                     sock_release+0x30/0xf0 <6>[a4e0fd80]
>             [804f2944]
>                                     sock_close+0x14/0x30
>                                     <6>[a4e0fd90] [80105844]
>             __fput+0x94/0x200
>                                     <6>[a4e0fdb0] [8003dca4]
>                                     task_work_run+0xd4/0x100 <6>[a4e0fdd0]
>                                     [80023620] do_exit+0x280/0x980
>                                     <6>[a4e0fe10] [80024c48]
>             do_group_exit+0x48/0xb0
>                                     <6>[a4e0fe30] [80030344]
>                                     get_signal+0x244/0x4f0 <6>[a4e0fe80]
>             [80007734]
>                                     do_signal+0x34/0x1c0
>                                     <6>[a4e0ff30] [800079a8]
>                                     do_notify_resume+0x68/0x80 <6>[a4e0ff40]
>                                     [8000fa1c] do_user_signal+0x74/0xc4
>
>
>                                     From the stack dump it looks like
>                                     tipc_named_reinit is trying to
>
>
>                                     acquire nametbl_lock.
>
>                                     From looking at the call chain I can
>             see that
>                                     tipc_conn_sendmsg can
>
>
>                                     send up calling conn_put
>
>                                     which will go on and call the
>             tipc_named_reinit
>                                     via tipc_sock_release.
>
>                                     As tipc_nametbl_withdraw (from the
>             stack dump)
>                                     has already acquired the
>                                     nametbl_lock, tipc_named_reinit
>
>                                     cannot get it and so the process hangs.
>
>
>                                     The call to tipc_sock_release (added
>             in Commit
>                                     333f796235a527
>
>             <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
>             <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->
>
>             <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
>             <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->>
>
>             stable.git/commit/?id=333f796235a52727db7e0a13888045f3aa3d5335>)
>                                     seems to have changed the behaviour
>
>                                     such that it tries to do a lot more when
>                                     shutting the connection down.
>
>
>                                     If there is other information I can
>             provide
>                                     please let me know.
>
>                                     Regards,
>
>                                     John
>
>             ------------------------------------------------------------
>                                     ------------------
>
>             _______________________________________________
>                                     tipc-discussion mailing list
>
>             tipc-discussion@lists.sourceforge.net
>             <mailto:tipc-discussion@lists.sourceforge.net>
>
>             <mailto:tipc-discussion@lists.sourceforge.net
>             <mailto:tipc-discussion@lists.sourceforge.net>>
>
>             https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>             <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
>
>             <https://lists.sourceforge.net/lists/listinfo/tipc-discussion 
> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>
>
>             ------------------------------------------------------------
>                                     ------------------
>
>             _______________________________________________
>                                     tipc-discussion mailing list
>
>             tipc-discussion@lists.sourceforge.net
>             <mailto:tipc-discussion@lists.sourceforge.net>
>
>             <mailto:tipc-discussion@lists.sourceforge.net
>             <mailto:tipc-discussion@lists.sourceforge.net>>
>
>             https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>             <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
>
>             <https://lists.sourceforge.net/lists/listinfo/tipc-discussion 
> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>
>
>
>
>
>             
> ------------------------------------------------------------------------------
>                             _______________________________________________
>                             tipc-discussion mailing list
>                             tipc-discussion@lists.sourceforge.net
>             <mailto:tipc-discussion@lists.sourceforge.net>
>
>             <mailto:tipc-discussion@lists.sourceforge.net
>             <mailto:tipc-discussion@lists.sourceforge.net>>
>
>             https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>             <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
>
>             <https://lists.sourceforge.net/lists/listinfo/tipc-discussion 
> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>
>
>
>
>
>
>
>         
> ------------------------------------------------------------------------------
>         _______________________________________________
>         tipc-discussion mailing list
>         tipc-discussion@lists.sourceforge.net
>         <mailto:tipc-discussion@lists.sourceforge.net>
>         https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
>
>

------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] v4.7: soft lockup when releasing a socket

Reply via email to