Hi Ying,

Do you have any comments on the v2 of 3 patches that fixes this issue?


/Partha

________________________________
From: John Thompson <thompa....@gmail.com>
Sent: Monday, November 28, 2016 8:21:57 PM
To: Parthasarathy Bhuvaragan
Cc: Ying Xue; tipc-discussion@lists.sourceforge.net
Subject: Re: [tipc-discussion] v4.7: soft lockup when releasing a socket

Hi Partha,

I tested with the latest 3 patches last night and observed no soft lockups.

Thanks,
John


On Fri, Nov 25, 2016 at 11:50 AM, John Thompson 
<thompa....@gmail.com<mailto:thompa....@gmail.com>> wrote:
Hi Partha,

I rebuilt afresh and retried the test with the same lockup kernel dumps.
Yes I have multiple tipc clients subscribed to the topology server, at least 10 
clients.
They all use a subscription timeout of TIPC_WAIT_FOREVER

I will try the kernel command line parameter next week.
JT


On Fri, Nov 25, 2016 at 3:07 AM, Parthasarathy Bhuvaragan 
<parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>
 wrote:
Hi John,

Do you have several tipc clients subscribed to topology server?
What subscription timeout do they use?

Please enable kernel command line parameter:
softlockup_all_cpu_backtrace=1

/Partha

On 11/23/2016 11:04 PM, John Thompson wrote:
Hi Partha,

I tested overnight with the 2 patches you provided yesterday.
Testing is still showing problems, here is one of the soft lockups, the
other is the same as I sent the other day.
I am going to redo my build as I expected some change in behaviour with
your patches.

It is possible that I am doing some dumping of nodes or links as I am
not certain of all the code or paths.
I have found that we do a tipc-config -nt and tipc-config -ls in some
situations but it shouldn 't be initiated in this
reboot case.

<0>NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [pimd:1220]
<0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [rpc.13:2419]
<6>Modules linked in:
<0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [AIS listener:1600]
<6> tipc
<6>Modules linked in:
<6> jitterentropy_rng
<6> tipc
<6> echainiv
<6> jitterentropy_rng
<6> drbg
<6> echainiv
<6> platform_driver(O)
<6> drbg
<6> platform_driver(O)
<6>
<6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P           O
<6>CPU: 2 PID: 1600 Comm: AIS listener Tainted: P           O
<6>task: aed76d20 ti: ae70c000 task.ti: ae70c000
<6>task: aee3ced0 ti: ae686000 task.ti: ae686000
<6>NIP: 8069257c LR: c13ebc4c CTR: 80692540
<6>NIP: 80692578 LR: c13ebf50 CTR: 80692540
<6>REGS: ae70dc20 TRAP: 0901   Tainted: P           O
<6>REGS: ae687ad0 TRAP: 0901   Tainted: P           O
<6>MSR: 00029002
<6>MSR: 00029002
<6><
<6><
<6>CE
<6>CE
<6>,EE
<6>,EE
<6>,ME
<6>,ME
<6>>
<6>>
<6>  CR: 42002484  XER: 20000000
<6>  CR: 48002444  XER: 00000000
<6>
<6>GPR00:
<6>
<6>GPR00:
<6>c13f3c34
<6>c13ea408
<6>ae70dcd0
<6>ae687b80
<6>aed76d20
<6>aee3ced0
<6>ae55c8ec
<6>ae55c8ec
<6>00002711
<6>00000000
<6>00000005
<6>a30e7264
<6>8666592a
<6>ae5e070c
<6>8666592b
<6>fffffffd
<6>
<6>GPR08:
<6>
<6>GPR08:
<6>ae9dad20
<6>ae72fbc8
<6>00000001
<6>00000001
<6>00000001
<6>00000001
<6>00000000
<6>00000004
<6>80692540
<6>80692540
<6>
<6>
<6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70
<6>NIP [80692578] _raw_spin_lock_bh+0x38/0x70
<6>LR [c13ebc4c] tipc_nametbl_withdraw+0x4c/0x140 [tipc]
<6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
<6>Call Trace:
<6>Call Trace:
<6>[ae70dcd0] [a85d99a0] 0xa85d99a0
<6>[ae687b80] [800fa258] check_object+0xc8/0x270
<6> (unreliable)
<6> (unreliable)
<6>
<6>
<6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
<6>[ae687ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
<6>
<6>
<6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
<6>[ae687bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
<6>
<6>
<6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0
<6>[ae687bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
<6>
<6>
<6>[ae70dd80] [804f2ab4] sock_close+0x14/0x30
<6>[ae687c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
<6>
<6>
<6>[ae70dd90] [80105844] __fput+0x94/0x200
<6>[ae687c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
<6>
<6>
<6>[ae70ddb0] [8003dca4] task_work_run+0xd4/0x100
<6>[ae687c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
<6>
<6>
<6>[ae70ddd0] [80023620] do_exit+0x280/0x980
<6>[ae687ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc]
<6>
<6>
<6>[ae70de10] [80024c48] do_group_exit+0x48/0xb0
<6>[ae687cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc]
<6>
<6>
<6>[ae70de30] [80030344] get_signal+0x244/0x4f0
<6>[ae687d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
<6>
<6>
<6>[ae70de80] [80007734] do_signal+0x34/0x1c0
<6>[ae687d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
<6>
<6>
<6>[ae70df30] [800079a8] do_notify_resume+0x68/0x80
<6>[ae687d70] [804f29e0] sock_release+0x30/0xf0
<6>
<6>
<6>[ae70df40] [8000fa1c] do_user_signal+0x74/0xc4
<6>[ae687d80] [804f2ab4] sock_close+0x14/0x30
<6>
<6>
<6>--- interrupt: c00 at 0xfe490c4
<6>    LR = 0xfe490a0
<6>[ae687d90] [80105844] __fput+0x94/0x200
<6>Instruction dump:
<6>
<6>
<6>[ae687db0] [8003dca4] task_work_run+0xd4/0x100
<6>912a0008
<6>
<6>39400001
<6>[ae687dd0] [80023620] do_exit+0x280/0x980
<6>7d201828
<6>
<6>2c090000
<6>[ae687e10] [80024c48] do_group_exit+0x48/0xb0
<6>40820010
<6>
<6>7d40192d
<6>[ae687e30] [80030344] get_signal+0x244/0x4f0
<6>40a2fff0
<6>
<6>7c2004ac
<6>[ae687e80] [80007734] do_signal+0x34/0x1c0
<6>
<6>
<6>2f890000
<6>[ae687f30] [800079a8] do_notify_resume+0x68/0x80
<6>4dbe0020
<6>
<6>7c210b78
<6>[ae687f40] [8000fa1c] do_user_signal+0x74/0xc4
<6>81230000
<6>
<6><2f890000>
<6>--- interrupt: c00 at 0xf4f3d08
<6>    LR = 0xf4f3ce8
<6>40befff4
<6>Instruction dump:
<6>7c421378
<6>
<6>7d201828
<6>39290200
<6>
<6>912a0008 39400001
<0>Kernel panic - not syncing: softlockup: hung tasks
<6>7d201828
<6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P           O L
<6>2c090000
<6>Call Trace:
<6>40820010
<6>[ae70da80] [80694e20] dump_stack+0x84/0xb0
<6>7d40192d
<6> (unreliable)
<6>40a2fff0
<6>
<6>
<6>[ae70da90] [80692ca8] panic+0xd8/0x214
<6>7c2004ac
<6>
<6>2f890000
<6>[ae70daf0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0
<6>4dbe0020
<6>
<6>7c210b78
<6>[ae70db40] [8007ae58] __hrtimer_run_queues+0x118/0x1d0
<6><81230000>
<6>
<6>2f890000
<6>[ae70db80] [8007b608] hrtimer_interrupt+0xd8/0x270
<6>40befff4
<6>
<6>7c421378
<6>[ae70dbd0] [8000983c] __timer_interrupt+0xac/0x1b0
<6>
<6>
<6>[ae70dbf0] [80009b70] timer_interrupt+0xb0/0xe0
<6>[ae70dc10] [8000f450] ret_from_except+0x0/0x18
<6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70
<6>    LR = tipc_nametbl_withdraw+0x4c/0x140 [tipc]
<6>[ae70dcd0] [a85d99a0] 0xa85d99a0 (unreliable)
<6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
<6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
<6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0
<6>[ae70dd80] [804f2ab4] sock_close+0x14/0x30
<6>[ae70dd90] [80105844] __fput+0x94/0x200
<6>[ae70ddb0] [8003dca4] task_work_run+0xd4/0x100
<6>[ae70ddd0] [80023620] do_exit+0x280/0x980
<6>[ae70de10] [80024c48] do_group_exit+0x48/0xb0
<6>[ae70de30] [80030344] get_signal+0x244/0x4f0
<6>[ae70de80] [80007734] do_signal+0x34/0x1c0
<6>[ae70df30] [800079a8] do_notify_resume+0x68/0x80
<6>[ae70df40] [8000fa1c] do_user_signal+0x74/0xc4
<6>--- interrupt: c00 at 0xfe490c4
<6>    LR = 0xfe490a0



On Wed, Nov 23, 2016 at 9:16 PM, Parthasarathy Bhuvaragan
<parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>
<mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>>
 wrote:

    Hi John,

    Ok. Can you test with the 2 patches I posted regarding the nametbl_lock?

    /Partha

    On 11/21/2016 11:32 PM, John Thompson wrote:

        Hi Partha,

        My test has 4 nodes, 2 of which are alternately rebooting.  When the
        rebooted node rejoins a few minutes pass and then the other node is
        rebooted.
        I am not printing out link stats and believe that the the other
        code is
        not doing so either, when nodes leave or rejoin.

        JT


        On Tue, Nov 22, 2016 at 2:22 AM, Parthasarathy Bhuvaragan
        
<parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>
        
<mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>
        
<mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>
        
<mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>>>
 wrote:

            Hi,

            There is an other branch where softlockup for nametbl_lock
        occurs.

            tipc_named_rcv() Grabs nametbl_lock
              tipc_update_nametbl() (publish/withdraw)
                tipc_node_subscribe()/unsubscribe()
                  tipc_node_write_unlock()
                     << lockup occurs if it needs to process NODE
        UP/DOWN LINK
            UP/DOWN, as it grabs nametbl_lock again >>

            /Partha


            On 11/21/2016 01:04 PM, Parthasarathy Bhuvaragan wrote:

                Hi,

                tipc_nametbl_withdraw() triggers the softlockup as it
        tries to grab
                nametbl_lock twice if the node triggered a
        TIPC_NOTIFY_LINK_DOWN
                event
                while its is running. The erroneous call chain is:

                  tipc_nametbl_withdraw() Grab nametbl_lock
                    tipc_named_process_backlog()
                      tipc_update_nametbl()
                        if (dtype == WITHDRAWAL) tipc_node_unsubscribe()
                          tipc_node_write_unlock()
                            if (flags & TIPC_NOTIFY_LINK_DOWN)
                tipc_nametbl_withdraw()
                               spin_lock_bh(&tn->nametbl_lock);  << Soft
        Lockup >>

                Three callers which can cause this under module exit:

                Case1:
                  tipc_exit_net()
                    tipc_nametbl_withdraw() Grab nametbl_lock

                Case2:
                  tipc_server_stop()
                    tipc_conn_kref_release
                      tipc_sock_release
                        sock_release()
                          tipc_release()
                            tipc_sk_withdraw()
                              tipc_nametbl_withdraw()

                Case3:
                  tipc_server_stop()
                    tipc_conn_kref_release()
                      kernel_bind()
                        tipc_bind()
                          tipc_sk_withdraw()
                            tipc_nametbl_withdraw()

                I will work on a solution for this.

                What kind of test were you performing when this occurred
                (linkup/down)?
                Do you read link statistics periodically in your tests?

                /Partha

                On 11/21/2016 05:30 AM, John Thompson wrote:

                    Hi Partha,

                    I was doing some some more testing today and have still
                    observed the
                    problem (contrary to what I had emailed earlier).

                    Here is the kernel dump.

                    <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
                    [Pluggable
                    Serve:2221]
                    <6>Modules linked in: tipc jitterentropy_rng
        echainiv drbg
                    platform_driver(O)
                    <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted:
        P           O
                    <6>task: ae54ced0 ti: aec42000 task.ti: aec42000
                    <6>NIP: 8069257c LR: c13ebf50 CTR: 80692540
                    <6>REGS: aec43ad0 TRAP: 0901   Tainted: P           O
                    <6>MSR: 00029002 <CE,EE,ME>  CR: 48002444  XER: 00000000
                    <6>
                    <6>GPR00: c13ea408 aec43b80 ae54ced0 a624690c 00000000
                    a6271d84 a39a60cc
                    fffffffd
                    <6>GPR08: aeefbbc8 00000001 00000001 00000004 80692540
                    <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70
                    <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120
        [tipc]
                    <6>Call Trace:
                    <6>[aec43b80] [800fa258] check_object+0xc8/0x270
        (unreliable)
                    <6>[aec43ba0] [c13ea408]
        tipc_named_reinit+0xf8/0x820 [tipc]
                    <6>[aec43bb0] [c13ea6c0]
        tipc_named_reinit+0x3b0/0x820 [tipc]
                    <6>[aec43bd0] [c13f7bbc]
        tipc_nl_publ_dump+0x50c/0xed0 [tipc]
                    <6>[aec43c00] [c13f865c]
        tipc_conn_sendmsg+0xdc/0x170 [tipc]
                    <6>[aec43c30] [c13eacbc]
                    tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
                    <6>[aec43c70] [c13eb27c]
        tipc_topsrv_stop+0x45c/0x4f0 [tipc]
                    <6>[aec43ca0] [c13eb7a8]
        tipc_nametbl_remove_publ+0x58/0x110
                    [tipc]
                    <6>[aec43cd0] [c13ebc68]
        tipc_nametbl_withdraw+0x68/0x140 [tipc]
                    <6>[aec43d00] [c13f3c34]
                    tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
                    <6>[aec43d30] [c13f4848]
                    tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]

                TIPC_CMD_SHOW_LINK_STATS or TIPC_NL_LINK_GET

                    <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
                    <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
                    <6>[aec43d90] [80105844] __fput+0x94/0x200
                    <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100
                    <6>[aec43dd0] [80023620] do_exit+0x280/0x980
                    <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0
                    <6>[aec43e30] [80030344] get_signal+0x244/0x4f0
                    <6>[aec43e80] [80007734] do_signal+0x34/0x1c0
                    <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80
                    <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4
                    <6>--- interrupt: c00 at 0xf4f3d08
                    <6>    LR = 0xf4f3ce8
                    <6>Instruction dump:
                    <6>912a0008 39400001 7d201828 2c090000 40820010 7d40192d
                    40a2fff0 7c2004ac
                    <6>2f890000 4dbe0020 7c210b78 81230000 <2f890000>
        40befff4
                    7c421378
                    7d201828
                    <0>Kernel panic - not syncing: softlockup: hung tasks
                    <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P
                       O L
                    <6>Call Trace:
                    <6>[aec43930] [80694e20] dump_stack+0x84/0xb0
        (unreliable)
                    <6>[aec43940] [80692ca8] panic+0xd8/0x214
                    <6>[aec439a0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0
                    <6>[aec439f0] [8007ae58]
        __hrtimer_run_queues+0x118/0x1d0
                    <6>[aec43a30] [8007b608] hrtimer_interrupt+0xd8/0x270
                    <6>[aec43a80] [8000983c] __timer_interrupt+0xac/0x1b0
                    <6>[aec43aa0] [80009b70] timer_interrupt+0xb0/0xe0
                    <6>[aec43ac0] [8000f450] ret_from_except+0x0/0x18
                    <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70
                    <6>    LR = tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
                    <6>[aec43b80] [800fa258] check_object+0xc8/0x270
        (unreliable)
                    <6>[aec43ba0] [c13ea408]
        tipc_named_reinit+0xf8/0x820 [tipc]
                    <6>[aec43bb0] [c13ea6c0]
        tipc_named_reinit+0x3b0/0x820 [tipc]
                    <6>[aec43bd0] [c13f7bbc]
        tipc_nl_publ_dump+0x50c/0xed0 [tipc]
                    <6>[aec43c00] [c13f865c]
        tipc_conn_sendmsg+0xdc/0x170 [tipc]
                    <6>[aec43c30] [c13eacbc]
                    tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
                    <6>[aec43c70] [c13eb27c]
        tipc_topsrv_stop+0x45c/0x4f0 [tipc]
                    <6>[aec43ca0] [c13eb7a8]
        tipc_nametbl_remove_publ+0x58/0x110
                    [tipc]
                    <6>[aec43cd0] [c13ebc68]
        tipc_nametbl_withdraw+0x68/0x140 [tipc]
                    <6>[aec43d00] [c13f3c34]
                    tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
                    <6>[aec43d30] [c13f4848]
                    tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
                    <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
                    <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
                    <6>[aec43d90] [80105844] __fput+0x94/0x200
                    <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100
                    <6>[aec43dd0] [80023620] do_exit+0x280/0x980
                    <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0
                    <6>[aec43e30] [80030344] get_signal+0x244/0x4f0
                    <6>[aec43e80] [80007734] do_signal+0x34/0x1c0
                    <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80
                    <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4
                    <6>--- interrupt: c00 at 0xf4f3d08
                    <6>    LR = 0xf4f3ce8


                    On Mon, Nov 21, 2016 at 9:59 AM, John Thompson
                    <thompa....@gmail.com<mailto:thompa....@gmail.com> 
<mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>
        <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com> 
<mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>>
                    <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>
        <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>> 
<mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>
        <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>>>>
                    wrote:

                        Hi Partha,

                        In my testing over the weekend the patch
        performed well
                    - I didn't
                        see any kernel dumps due to this issue.

                        Thanks for the quick response.
                        JT


                        On Fri, Nov 18, 2016 at 10:34 AM, John Thompson
                        <thompa....@gmail.com<mailto:thompa....@gmail.com>
        <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>> 
<mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>
        <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>>
                    <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>
        <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>> 
<mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>
        <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>>>>
                    wrote:

                            Hi,

                            I will be able to have some test results by the
                    start of next
                            week on the first patch.

                            Regards,
                            JT


                            On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue
                            
<ying....@windriver.com<mailto:ying....@windriver.com>
        <mailto:ying....@windriver.com<mailto:ying....@windriver.com>>
                    
<mailto:ying....@windriver.com<mailto:ying....@windriver.com>
        <mailto:ying....@windriver.com<mailto:ying....@windriver.com>>>
                    
<mailto:ying....@windriver.com<mailto:ying....@windriver.com>
        <mailto:ying....@windriver.com<mailto:ying....@windriver.com>>
                    
<mailto:ying....@windriver.com<mailto:ying....@windriver.com>
        <mailto:ying....@windriver.com<mailto:ying....@windriver.com>>>>> wrote:

                                On 11/17/2016 07:04 AM, John Thompson wrote:

                                    Hi Partha / Ying,

                                    I will try out the patch and let you
        know
                    how it goes.
                                    I also note about providing the
        other CPU
                    core dumps -
                                    in one of my cases I
                                    didn't have them but in others I did but
                                    they were interleaved and so were
        difficult
                    to interpret.


                                Thanks, it's unnecessary for us to
        collect more
                    logs as its
                                soft lockup scenario should be just what
        Partha
                    described.

                                Regards,
                                Ying



                                    Thanks for getting a patch together
        so quickly.

                                    JT

                                    On Wed, Nov 16, 2016 at 10:23 PM,
                    Parthasarathy Bhuvaragan <

        
parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>
        
<mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>
                    
<mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>
        
<mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>>

                    
<mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>
        
<mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>

                    
<mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>
        
<mailto:parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>>>>
 wrote:

                                        Hi Ying / John,

                                        The soft lock is the call chain of
                                        tipc_nametbl_withdraw(), when it
                                        performs the
        tipc_conn_kref_release() as
                    it tries to
                                        grab nametbl_lock
                                        again while holding it already.

                                            tipc_nametbl_withdraw

        spin_lock_bh(&tn->nametbl_lock);
                                              tipc_nametbl_remove_publ
                                                 spin_lock_bh(&seq->lock);
                                                 tipc_nameseq_remove_publ

         tipc_subscrp_report_overlap
                                                     tipc_subscrp_send_event
                                                        tipc_conn_sendmsg

                                        << Here, the (test_bit(CF_CONNECTED,
                    &con->flags))
                                        Fails, leading to the
                                        else case where do do a
        conn_put() and
                    that triggers
                                        the cleanup as
                                        refcount reached 0. Leading the call
                    chain below : >>
                                        tipc_conn_kref_release
                                           tipc_sock_release
                                             tipc_conn_release
                                                tipc_subscrb_delete
                                                   tipc_subscrp_delete

        tipc_nametbl_unsubscribe

                     spin_lock_bh(&tn->nametbl_lock);
                                        << !! Soft Lockup >>

                                        One cause is that
        tipc_exit_net() calls
                    first calls
                                        tipc_topsrv_stop() and
                                        then tipc_nametbl_withdraw() in
        scope of
                                        tipc_net_stop().

                                        The above chain will only occur in a
                    narrow window
                                        for a given connection:
                                        CPU#1:
                                        tipc_nametbl_withdraw() manages
        to perform
                                        tipc_conn_lookup() and steps
                                        the refcount to 2, while in
        CPU#2 the
                    following occurs:
                                        CPU#2:
                                        tipc_server_stop() calls
                    tipc_close_conn(con). This
                                        performs a conn_put()
                                        decrementing refcount to 1.
                                        Now, CPU#1 continues and detects
        that
                    the connection
                                        is not CF_CONNECTED
                                        and does a conn_put(),
        triggering the
                    release callback.

                                        Before commit 333f796235a527,
        the above
                    wont happen.

                                        /Partha


                                        On 11/15/2016 04:11 PM, Xue,
        Ying wrote:

                                            Hi John,

                                            Regarding the stack trace you
                    provided below, I
                                            get the two potential
                                            call chains:

                                            tipc_nametbl_withdraw

        spin_lock_bh(&tn->nametbl_lock);
                                              tipc_nametbl_remove_publ
                                                 spin_lock_bh(&seq->lock);
                                                 tipc_nameseq_remove_publ

         tipc_subscrp_report_overlap
                                                     tipc_subscrp_send_event
                                                        tipc_conn_sendmsg

                     spin_lock_bh(&con->outqueue_lock);

         list_add_tail(&e->list,
                                            &con->outqueue);


                                            tipc_topsrv_stop
                                              tipc_server_stop
                                                tipc_close_conn
                                                  kernel_sock_shutdown
                                                    tipc_subscrb_delete

                    spin_lock_bh(&subscriber->lock);

        tipc_nametbl_unsubscribe(sub);

                     spin_lock_bh(&tn->nametbl_lock);

                                            Although I suspect this is a
        revert
                    lock issue
                                            leading to the soft
                                            lockup, I am still unable to
                    understand which
                                            lock together with
                                            nametbl_lock is taken
        reversely on
                    the two
                                            different paths above.
                                            However, you just gave us
        the log
                    printed on
                                            CPU#2, but the logs
                                            outputted by other cores are
        also
                    important.  So
                                            if possible, please share
                                            them with us.

                                            By the way, I agree with
        you, and it
                    seems that
                                            commit 333f796235a527 is
                                            related to the soft lockup.

                                            Regards,
                                            Ying

                                            -----Original Message-----
                                            From: John Thompson
                    [mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>
        <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>> 
<mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>
        <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>>
                                            
<mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>
        <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>
                    <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>
        <mailto:thompa....@gmail.com<mailto:thompa....@gmail.com>>>>]
                                            Sent: Tuesday, November 15,
        2016 8:01 AM
                                            To:
                    
tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>
                    
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>

                    
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>

                    
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>>
                                            Subject: [tipc-discussion] v4.7:
                    soft lockup
                                            when releasing a socket

                                            Hi,

                                            I am seeing an occasional kernel
                    soft lockup.  I
                                            have TIPC v4.7 and the
                                            kernel dump occurs when the
        system
                    is going down
                                            for a reboot.

                                            The kernel dump is:

                                            <0>NMI watchdog: BUG: soft
        lockup -
                    CPU#2 stuck
                                            for 23s! [exfx:1474]
                                            <6>Modules linked in: tipc
                    jitterentropy_rng
                                            echainiv drbg
                                            platform_driver(O) ipifwd(PO)
                                            ...
                                            <6>
                                            <6>GPR00: c15333e8 a4e0fb80
        a4ee3600
                    a51748ac
                                            00000000 ae475024 a537feec
                                            fffffffd
                                            <6>GPR08: a2197408 00000001
        00000001
                    00000004
                                            80691c00 <6>NIP [80691c40]
                                            _raw_spin_lock_bh+0x40/0x70
        <6>LR
                    [c1534f30]

        tipc_nametbl_unsubscribe+0x50/0x120
                                            [tipc] <6>Call Trace:
                                            <6>[a4e0fba0] [c15333e8]
                                            tipc_named_reinit+0xf8/0x820
        [tipc]
                                            <6>[a4e0fbb0] [c15336a0]

        tipc_named_reinit+0x3b0/0x820 [tipc]
                    <6>[a4e0fbd0]
                                            [c1540bac]
                    tipc_nl_publ_dump+0x50c/0xed0 [tipc]
                                            <6>[a4e0fc00] [c154164c]
                                            tipc_conn_sendmsg+0xdc/0x170
        [tipc]
                                            <6>[a4e0fc30] [c1533c9c]

                    tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
                                            <6>[a4e0fc70] [c153425c]
                                            tipc_topsrv_stop+0x45c/0x4f0
        [tipc]
                                            <6>[a4e0fca0] [c1534788]

        tipc_nametbl_remove_publ+0x58/0x110
                    [tipc]
                                            <6>[a4e0fcd0] [c1534c48]

        tipc_nametbl_withdraw+0x68/0x140 [tipc]
                                            <6>[a4e0fd00] [c153cc24]

        tipc_nl_node_dump_link+0x1904/0x45d0
                    [tipc]
                                            <6>[a4e0fd30] [c153d838]

        tipc_nl_node_dump_link+0x2518/0x45d0
                    [tipc]
                                            <6>[a4e0fd70] [804f2870]
                                            sock_release+0x30/0xf0
        <6>[a4e0fd80]
                    [804f2944]
                                            sock_close+0x14/0x30
                                            <6>[a4e0fd90] [80105844]
                    __fput+0x94/0x200
                                            <6>[a4e0fdb0] [8003dca4]
                                            task_work_run+0xd4/0x100
        <6>[a4e0fdd0]
                                            [80023620] do_exit+0x280/0x980
                                            <6>[a4e0fe10] [80024c48]
                    do_group_exit+0x48/0xb0
                                            <6>[a4e0fe30] [80030344]
                                            get_signal+0x244/0x4f0
        <6>[a4e0fe80]
                    [80007734]
                                            do_signal+0x34/0x1c0
                                            <6>[a4e0ff30] [800079a8]
                                            do_notify_resume+0x68/0x80
        <6>[a4e0ff40]
                                            [8000fa1c]
        do_user_signal+0x74/0xc4


                                            From the stack dump it looks
        like
                                            tipc_named_reinit is trying to


                                            acquire nametbl_lock.

                                            From looking at the call
        chain I can
                    see that
                                            tipc_conn_sendmsg can


                                            send up calling conn_put

                                            which will go on and call the
                    tipc_named_reinit
                                            via tipc_sock_release.

                                            As tipc_nametbl_withdraw
        (from the
                    stack dump)
                                            has already acquired the
                                            nametbl_lock, tipc_named_reinit

                                            cannot get it and so the
        process hangs.


                                            The call to
        tipc_sock_release (added
                    in Commit
                                            333f796235a527


        <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
        <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->

        <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
        <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->>


        <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
        <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->

        <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
        <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->>>


        stable.git/commit/?id=333f796235a52727db7e0a13888045f3aa3d5335>)
                                            seems to have changed the
        behaviour

                                            such that it tries to do a
        lot more when
                                            shutting the connection down.


                                            If there is other
        information I can
                    provide
                                            please let me know.

                                            Regards,

                                            John


        ------------------------------------------------------------
                                            ------------------

                    _______________________________________________
                                            tipc-discussion mailing list

                    
tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>
                    
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>

                    
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>
                    
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>>


        https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>

        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>


        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>>


        ------------------------------------------------------------
                                            ------------------

                    _______________________________________________
                                            tipc-discussion mailing list

                    
tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>
                    
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>

                    
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>
                    
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>>


        https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>

        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>


        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>>





        
------------------------------------------------------------------------------

        _______________________________________________
                                    tipc-discussion mailing list

        
tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>
                    
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>

                    
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>
                    
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>>


        https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>

        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>


        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>>







        
------------------------------------------------------------------------------
                _______________________________________________
                tipc-discussion mailing list
                
tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>
                
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
        
<mailto:tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>>>

        https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>

        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
        <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>





------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to