Hi Partha,

I tested overnight with the 2 patches you provided yesterday.
Testing is still showing problems, here is one of the soft lockups, the
other is the same as I sent the other day.
I am going to redo my build as I expected some change in behaviour with
your patches.

It is possible that I am doing some dumping of nodes or links as I am not
certain of all the code or paths.
I have found that we do a tipc-config -nt and tipc-config -ls in some
situations but it shouldn 't be initiated in this
reboot case.

<0>NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [pimd:1220]
<0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [rpc.13:2419]
<6>Modules linked in:
<0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [AIS listener:1600]
<6> tipc
<6>Modules linked in:
<6> jitterentropy_rng
<6> tipc
<6> echainiv
<6> jitterentropy_rng
<6> drbg
<6> echainiv
<6> platform_driver(O)
<6> drbg
<6> platform_driver(O)
<6>
<6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P           O
<6>CPU: 2 PID: 1600 Comm: AIS listener Tainted: P           O
<6>task: aed76d20 ti: ae70c000 task.ti: ae70c000
<6>task: aee3ced0 ti: ae686000 task.ti: ae686000
<6>NIP: 8069257c LR: c13ebc4c CTR: 80692540
<6>NIP: 80692578 LR: c13ebf50 CTR: 80692540
<6>REGS: ae70dc20 TRAP: 0901   Tainted: P           O
<6>REGS: ae687ad0 TRAP: 0901   Tainted: P           O
<6>MSR: 00029002
<6>MSR: 00029002
<6><
<6><
<6>CE
<6>CE
<6>,EE
<6>,EE
<6>,ME
<6>,ME
<6>>
<6>>
<6>  CR: 42002484  XER: 20000000
<6>  CR: 48002444  XER: 00000000
<6>
<6>GPR00:
<6>
<6>GPR00:
<6>c13f3c34
<6>c13ea408
<6>ae70dcd0
<6>ae687b80
<6>aed76d20
<6>aee3ced0
<6>ae55c8ec
<6>ae55c8ec
<6>00002711
<6>00000000
<6>00000005
<6>a30e7264
<6>8666592a
<6>ae5e070c
<6>8666592b
<6>fffffffd
<6>
<6>GPR08:
<6>
<6>GPR08:
<6>ae9dad20
<6>ae72fbc8
<6>00000001
<6>00000001
<6>00000001
<6>00000001
<6>00000000
<6>00000004
<6>80692540
<6>80692540
<6>
<6>
<6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70
<6>NIP [80692578] _raw_spin_lock_bh+0x38/0x70
<6>LR [c13ebc4c] tipc_nametbl_withdraw+0x4c/0x140 [tipc]
<6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
<6>Call Trace:
<6>Call Trace:
<6>[ae70dcd0] [a85d99a0] 0xa85d99a0
<6>[ae687b80] [800fa258] check_object+0xc8/0x270
<6> (unreliable)
<6> (unreliable)
<6>
<6>
<6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
<6>[ae687ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
<6>
<6>
<6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
<6>[ae687bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
<6>
<6>
<6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0
<6>[ae687bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
<6>
<6>
<6>[ae70dd80] [804f2ab4] sock_close+0x14/0x30
<6>[ae687c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
<6>
<6>
<6>[ae70dd90] [80105844] __fput+0x94/0x200
<6>[ae687c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
<6>
<6>
<6>[ae70ddb0] [8003dca4] task_work_run+0xd4/0x100
<6>[ae687c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
<6>
<6>
<6>[ae70ddd0] [80023620] do_exit+0x280/0x980
<6>[ae687ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc]
<6>
<6>
<6>[ae70de10] [80024c48] do_group_exit+0x48/0xb0
<6>[ae687cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc]
<6>
<6>
<6>[ae70de30] [80030344] get_signal+0x244/0x4f0
<6>[ae687d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
<6>
<6>
<6>[ae70de80] [80007734] do_signal+0x34/0x1c0
<6>[ae687d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
<6>
<6>
<6>[ae70df30] [800079a8] do_notify_resume+0x68/0x80
<6>[ae687d70] [804f29e0] sock_release+0x30/0xf0
<6>
<6>
<6>[ae70df40] [8000fa1c] do_user_signal+0x74/0xc4
<6>[ae687d80] [804f2ab4] sock_close+0x14/0x30
<6>
<6>
<6>--- interrupt: c00 at 0xfe490c4
<6>    LR = 0xfe490a0
<6>[ae687d90] [80105844] __fput+0x94/0x200
<6>Instruction dump:
<6>
<6>
<6>[ae687db0] [8003dca4] task_work_run+0xd4/0x100
<6>912a0008
<6>
<6>39400001
<6>[ae687dd0] [80023620] do_exit+0x280/0x980
<6>7d201828
<6>
<6>2c090000
<6>[ae687e10] [80024c48] do_group_exit+0x48/0xb0
<6>40820010
<6>
<6>7d40192d
<6>[ae687e30] [80030344] get_signal+0x244/0x4f0
<6>40a2fff0
<6>
<6>7c2004ac
<6>[ae687e80] [80007734] do_signal+0x34/0x1c0
<6>
<6>
<6>2f890000
<6>[ae687f30] [800079a8] do_notify_resume+0x68/0x80
<6>4dbe0020
<6>
<6>7c210b78
<6>[ae687f40] [8000fa1c] do_user_signal+0x74/0xc4
<6>81230000
<6>
<6><2f890000>
<6>--- interrupt: c00 at 0xf4f3d08
<6>    LR = 0xf4f3ce8
<6>40befff4
<6>Instruction dump:
<6>7c421378
<6>
<6>7d201828
<6>39290200
<6>
<6>912a0008 39400001
<0>Kernel panic - not syncing: softlockup: hung tasks
<6>7d201828
<6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P           O L
<6>2c090000
<6>Call Trace:
<6>40820010
<6>[ae70da80] [80694e20] dump_stack+0x84/0xb0
<6>7d40192d
<6> (unreliable)
<6>40a2fff0
<6>
<6>
<6>[ae70da90] [80692ca8] panic+0xd8/0x214
<6>7c2004ac
<6>
<6>2f890000
<6>[ae70daf0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0
<6>4dbe0020
<6>
<6>7c210b78
<6>[ae70db40] [8007ae58] __hrtimer_run_queues+0x118/0x1d0
<6><81230000>
<6>
<6>2f890000
<6>[ae70db80] [8007b608] hrtimer_interrupt+0xd8/0x270
<6>40befff4
<6>
<6>7c421378
<6>[ae70dbd0] [8000983c] __timer_interrupt+0xac/0x1b0
<6>
<6>
<6>[ae70dbf0] [80009b70] timer_interrupt+0xb0/0xe0
<6>[ae70dc10] [8000f450] ret_from_except+0x0/0x18
<6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70
<6>    LR = tipc_nametbl_withdraw+0x4c/0x140 [tipc]
<6>[ae70dcd0] [a85d99a0] 0xa85d99a0 (unreliable)
<6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
<6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
<6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0
<6>[ae70dd80] [804f2ab4] sock_close+0x14/0x30
<6>[ae70dd90] [80105844] __fput+0x94/0x200
<6>[ae70ddb0] [8003dca4] task_work_run+0xd4/0x100
<6>[ae70ddd0] [80023620] do_exit+0x280/0x980
<6>[ae70de10] [80024c48] do_group_exit+0x48/0xb0
<6>[ae70de30] [80030344] get_signal+0x244/0x4f0
<6>[ae70de80] [80007734] do_signal+0x34/0x1c0
<6>[ae70df30] [800079a8] do_notify_resume+0x68/0x80
<6>[ae70df40] [8000fa1c] do_user_signal+0x74/0xc4
<6>--- interrupt: c00 at 0xfe490c4
<6>    LR = 0xfe490a0



On Wed, Nov 23, 2016 at 9:16 PM, Parthasarathy Bhuvaragan <
parthasarathy.bhuvara...@ericsson.com> wrote:

> Hi John,
>
> Ok. Can you test with the 2 patches I posted regarding the nametbl_lock?
>
> /Partha
>
> On 11/21/2016 11:32 PM, John Thompson wrote:
>
>> Hi Partha,
>>
>> My test has 4 nodes, 2 of which are alternately rebooting.  When the
>> rebooted node rejoins a few minutes pass and then the other node is
>> rebooted.
>> I am not printing out link stats and believe that the the other code is
>> not doing so either, when nodes leave or rejoin.
>>
>> JT
>>
>>
>> On Tue, Nov 22, 2016 at 2:22 AM, Parthasarathy Bhuvaragan
>> <parthasarathy.bhuvara...@ericsson.com
>> <mailto:parthasarathy.bhuvara...@ericsson.com>> wrote:
>>
>>     Hi,
>>
>>     There is an other branch where softlockup for nametbl_lock occurs.
>>
>>     tipc_named_rcv() Grabs nametbl_lock
>>       tipc_update_nametbl() (publish/withdraw)
>>         tipc_node_subscribe()/unsubscribe()
>>           tipc_node_write_unlock()
>>              << lockup occurs if it needs to process NODE UP/DOWN LINK
>>     UP/DOWN, as it grabs nametbl_lock again >>
>>
>>     /Partha
>>
>>
>>     On 11/21/2016 01:04 PM, Parthasarathy Bhuvaragan wrote:
>>
>>         Hi,
>>
>>         tipc_nametbl_withdraw() triggers the softlockup as it tries to
>> grab
>>         nametbl_lock twice if the node triggered a TIPC_NOTIFY_LINK_DOWN
>>         event
>>         while its is running. The erroneous call chain is:
>>
>>           tipc_nametbl_withdraw() Grab nametbl_lock
>>             tipc_named_process_backlog()
>>               tipc_update_nametbl()
>>                 if (dtype == WITHDRAWAL) tipc_node_unsubscribe()
>>                   tipc_node_write_unlock()
>>                     if (flags & TIPC_NOTIFY_LINK_DOWN)
>>         tipc_nametbl_withdraw()
>>                        spin_lock_bh(&tn->nametbl_lock);  << Soft Lockup
>> >>
>>
>>         Three callers which can cause this under module exit:
>>
>>         Case1:
>>           tipc_exit_net()
>>             tipc_nametbl_withdraw() Grab nametbl_lock
>>
>>         Case2:
>>           tipc_server_stop()
>>             tipc_conn_kref_release
>>               tipc_sock_release
>>                 sock_release()
>>                   tipc_release()
>>                     tipc_sk_withdraw()
>>                       tipc_nametbl_withdraw()
>>
>>         Case3:
>>           tipc_server_stop()
>>             tipc_conn_kref_release()
>>               kernel_bind()
>>                 tipc_bind()
>>                   tipc_sk_withdraw()
>>                     tipc_nametbl_withdraw()
>>
>>         I will work on a solution for this.
>>
>>         What kind of test were you performing when this occurred
>>         (linkup/down)?
>>         Do you read link statistics periodically in your tests?
>>
>>         /Partha
>>
>>         On 11/21/2016 05:30 AM, John Thompson wrote:
>>
>>             Hi Partha,
>>
>>             I was doing some some more testing today and have still
>>             observed the
>>             problem (contrary to what I had emailed earlier).
>>
>>             Here is the kernel dump.
>>
>>             <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
>>             [Pluggable
>>             Serve:2221]
>>             <6>Modules linked in: tipc jitterentropy_rng echainiv drbg
>>             platform_driver(O)
>>             <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P
>>  O
>>             <6>task: ae54ced0 ti: aec42000 task.ti: aec42000
>>             <6>NIP: 8069257c LR: c13ebf50 CTR: 80692540
>>             <6>REGS: aec43ad0 TRAP: 0901   Tainted: P           O
>>             <6>MSR: 00029002 <CE,EE,ME>  CR: 48002444  XER: 00000000
>>             <6>
>>             <6>GPR00: c13ea408 aec43b80 ae54ced0 a624690c 00000000
>>             a6271d84 a39a60cc
>>             fffffffd
>>             <6>GPR08: aeefbbc8 00000001 00000001 00000004 80692540
>>             <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70
>>             <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
>>             <6>Call Trace:
>>             <6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable)
>>             <6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
>>             <6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
>>             <6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>>             <6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
>>             <6>[aec43c30] [c13eacbc]
>>             tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>>             <6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>>             <6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110
>>             [tipc]
>>             <6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140
>> [tipc]
>>             <6>[aec43d00] [c13f3c34]
>>             tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>>             <6>[aec43d30] [c13f4848]
>>             tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>>
>>         TIPC_CMD_SHOW_LINK_STATS or TIPC_NL_LINK_GET
>>
>>             <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
>>             <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
>>             <6>[aec43d90] [80105844] __fput+0x94/0x200
>>             <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100
>>             <6>[aec43dd0] [80023620] do_exit+0x280/0x980
>>             <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0
>>             <6>[aec43e30] [80030344] get_signal+0x244/0x4f0
>>             <6>[aec43e80] [80007734] do_signal+0x34/0x1c0
>>             <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80
>>             <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4
>>             <6>--- interrupt: c00 at 0xf4f3d08
>>             <6>    LR = 0xf4f3ce8
>>             <6>Instruction dump:
>>             <6>912a0008 39400001 7d201828 2c090000 40820010 7d40192d
>>             40a2fff0 7c2004ac
>>             <6>2f890000 4dbe0020 7c210b78 81230000 <2f890000> 40befff4
>>             7c421378
>>             7d201828
>>             <0>Kernel panic - not syncing: softlockup: hung tasks
>>             <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P
>>                O L
>>             <6>Call Trace:
>>             <6>[aec43930] [80694e20] dump_stack+0x84/0xb0 (unreliable)
>>             <6>[aec43940] [80692ca8] panic+0xd8/0x214
>>             <6>[aec439a0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0
>>             <6>[aec439f0] [8007ae58] __hrtimer_run_queues+0x118/0x1d0
>>             <6>[aec43a30] [8007b608] hrtimer_interrupt+0xd8/0x270
>>             <6>[aec43a80] [8000983c] __timer_interrupt+0xac/0x1b0
>>             <6>[aec43aa0] [80009b70] timer_interrupt+0xb0/0xe0
>>             <6>[aec43ac0] [8000f450] ret_from_except+0x0/0x18
>>             <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70
>>             <6>    LR = tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
>>             <6>[aec43b80] [800fa258] check_object+0xc8/0x270 (unreliable)
>>             <6>[aec43ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
>>             <6>[aec43bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
>>             <6>[aec43bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>>             <6>[aec43c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
>>             <6>[aec43c30] [c13eacbc]
>>             tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>>             <6>[aec43c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>>             <6>[aec43ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110
>>             [tipc]
>>             <6>[aec43cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140
>> [tipc]
>>             <6>[aec43d00] [c13f3c34]
>>             tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>>             <6>[aec43d30] [c13f4848]
>>             tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>>             <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
>>             <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
>>             <6>[aec43d90] [80105844] __fput+0x94/0x200
>>             <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100
>>             <6>[aec43dd0] [80023620] do_exit+0x280/0x980
>>             <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0
>>             <6>[aec43e30] [80030344] get_signal+0x244/0x4f0
>>             <6>[aec43e80] [80007734] do_signal+0x34/0x1c0
>>             <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80
>>             <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4
>>             <6>--- interrupt: c00 at 0xf4f3d08
>>             <6>    LR = 0xf4f3ce8
>>
>>
>>             On Mon, Nov 21, 2016 at 9:59 AM, John Thompson
>>             <thompa....@gmail.com <mailto:thompa....@gmail.com>
>>             <mailto:thompa....@gmail.com <mailto:thompa....@gmail.com>>>
>>             wrote:
>>
>>                 Hi Partha,
>>
>>                 In my testing over the weekend the patch performed well
>>             - I didn't
>>                 see any kernel dumps due to this issue.
>>
>>                 Thanks for the quick response.
>>                 JT
>>
>>
>>                 On Fri, Nov 18, 2016 at 10:34 AM, John Thompson
>>                 <thompa....@gmail.com <mailto:thompa....@gmail.com>
>>             <mailto:thompa....@gmail.com <mailto:thompa....@gmail.com>>>
>>             wrote:
>>
>>                     Hi,
>>
>>                     I will be able to have some test results by the
>>             start of next
>>                     week on the first patch.
>>
>>                     Regards,
>>                     JT
>>
>>
>>                     On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue
>>                     <ying....@windriver.com
>>             <mailto:ying....@windriver.com>
>>             <mailto:ying....@windriver.com
>>             <mailto:ying....@windriver.com>>> wrote:
>>
>>                         On 11/17/2016 07:04 AM, John Thompson wrote:
>>
>>                             Hi Partha / Ying,
>>
>>                             I will try out the patch and let you know
>>             how it goes.
>>                             I also note about providing the other CPU
>>             core dumps -
>>                             in one of my cases I
>>                             didn't have them but in others I did but
>>                             they were interleaved and so were difficult
>>             to interpret.
>>
>>
>>                         Thanks, it's unnecessary for us to collect more
>>             logs as its
>>                         soft lockup scenario should be just what Partha
>>             described.
>>
>>                         Regards,
>>                         Ying
>>
>>
>>
>>                             Thanks for getting a patch together so
>> quickly.
>>
>>                             JT
>>
>>                             On Wed, Nov 16, 2016 at 10:23 PM,
>>             Parthasarathy Bhuvaragan <
>>                             parthasarathy.bhuvara...@ericsson.com
>>             <mailto:parthasarathy.bhuvara...@ericsson.com>
>>
>>             <mailto:parthasarathy.bhuvara...@ericsson.com
>>
>>             <mailto:parthasarathy.bhuvara...@ericsson.com>>> wrote:
>>
>>                                 Hi Ying / John,
>>
>>                                 The soft lock is the call chain of
>>                                 tipc_nametbl_withdraw(), when it
>>                                 performs the tipc_conn_kref_release() as
>>             it tries to
>>                                 grab nametbl_lock
>>                                 again while holding it already.
>>
>>                                     tipc_nametbl_withdraw
>>                                       spin_lock_bh(&tn->nametbl_lock);
>>                                       tipc_nametbl_remove_publ
>>                                          spin_lock_bh(&seq->lock);
>>                                          tipc_nameseq_remove_publ
>>                                            tipc_subscrp_report_overlap
>>                                              tipc_subscrp_send_event
>>                                                 tipc_conn_sendmsg
>>
>>                                 << Here, the (test_bit(CF_CONNECTED,
>>             &con->flags))
>>                                 Fails, leading to the
>>                                 else case where do do a conn_put() and
>>             that triggers
>>                                 the cleanup as
>>                                 refcount reached 0. Leading the call
>>             chain below : >>
>>                                 tipc_conn_kref_release
>>                                    tipc_sock_release
>>                                      tipc_conn_release
>>                                         tipc_subscrb_delete
>>                                            tipc_subscrp_delete
>>                                               tipc_nametbl_unsubscribe
>>
>>              spin_lock_bh(&tn->nametbl_lock);
>>                                 << !! Soft Lockup >>
>>
>>                                 One cause is that tipc_exit_net() calls
>>             first calls
>>                                 tipc_topsrv_stop() and
>>                                 then tipc_nametbl_withdraw() in scope of
>>                                 tipc_net_stop().
>>
>>                                 The above chain will only occur in a
>>             narrow window
>>                                 for a given connection:
>>                                 CPU#1:
>>                                 tipc_nametbl_withdraw() manages to perform
>>                                 tipc_conn_lookup() and steps
>>                                 the refcount to 2, while in CPU#2 the
>>             following occurs:
>>                                 CPU#2:
>>                                 tipc_server_stop() calls
>>             tipc_close_conn(con). This
>>                                 performs a conn_put()
>>                                 decrementing refcount to 1.
>>                                 Now, CPU#1 continues and detects that
>>             the connection
>>                                 is not CF_CONNECTED
>>                                 and does a conn_put(), triggering the
>>             release callback.
>>
>>                                 Before commit 333f796235a527, the above
>>             wont happen.
>>
>>                                 /Partha
>>
>>
>>                                 On 11/15/2016 04:11 PM, Xue, Ying wrote:
>>
>>                                     Hi John,
>>
>>                                     Regarding the stack trace you
>>             provided below, I
>>                                     get the two potential
>>                                     call chains:
>>
>>                                     tipc_nametbl_withdraw
>>                                       spin_lock_bh(&tn->nametbl_lock);
>>                                       tipc_nametbl_remove_publ
>>                                          spin_lock_bh(&seq->lock);
>>                                          tipc_nameseq_remove_publ
>>                                            tipc_subscrp_report_overlap
>>                                              tipc_subscrp_send_event
>>                                                 tipc_conn_sendmsg
>>
>>              spin_lock_bh(&con->outqueue_lock);
>>                                                    list_add_tail(&e->list,
>>                                     &con->outqueue);
>>
>>
>>                                     tipc_topsrv_stop
>>                                       tipc_server_stop
>>                                         tipc_close_conn
>>                                           kernel_sock_shutdown
>>                                             tipc_subscrb_delete
>>
>>             spin_lock_bh(&subscriber->lock);
>>
>> tipc_nametbl_unsubscribe(sub);
>>
>>              spin_lock_bh(&tn->nametbl_lock);
>>
>>                                     Although I suspect this is a revert
>>             lock issue
>>                                     leading to the soft
>>                                     lockup, I am still unable to
>>             understand which
>>                                     lock together with
>>                                     nametbl_lock is taken reversely on
>>             the two
>>                                     different paths above.
>>                                     However, you just gave us the log
>>             printed on
>>                                     CPU#2, but the logs
>>                                     outputted by other cores are also
>>             important.  So
>>                                     if possible, please share
>>                                     them with us.
>>
>>                                     By the way, I agree with you, and it
>>             seems that
>>                                     commit 333f796235a527 is
>>                                     related to the soft lockup.
>>
>>                                     Regards,
>>                                     Ying
>>
>>                                     -----Original Message-----
>>                                     From: John Thompson
>>             [mailto:thompa....@gmail.com <mailto:thompa....@gmail.com>
>>                                     <mailto:thompa....@gmail.com
>>             <mailto:thompa....@gmail.com>>]
>>                                     Sent: Tuesday, November 15, 2016 8:01
>> AM
>>                                     To:
>>             tipc-discussion@lists.sourceforge.net
>>             <mailto:tipc-discussion@lists.sourceforge.net>
>>
>>             <mailto:tipc-discussion@lists.sourceforge.net
>>
>>             <mailto:tipc-discussion@lists.sourceforge.net>>
>>                                     Subject: [tipc-discussion] v4.7:
>>             soft lockup
>>                                     when releasing a socket
>>
>>                                     Hi,
>>
>>                                     I am seeing an occasional kernel
>>             soft lockup.  I
>>                                     have TIPC v4.7 and the
>>                                     kernel dump occurs when the system
>>             is going down
>>                                     for a reboot.
>>
>>                                     The kernel dump is:
>>
>>                                     <0>NMI watchdog: BUG: soft lockup -
>>             CPU#2 stuck
>>                                     for 23s! [exfx:1474]
>>                                     <6>Modules linked in: tipc
>>             jitterentropy_rng
>>                                     echainiv drbg
>>                                     platform_driver(O) ipifwd(PO)
>>                                     ...
>>                                     <6>
>>                                     <6>GPR00: c15333e8 a4e0fb80 a4ee3600
>>             a51748ac
>>                                     00000000 ae475024 a537feec
>>                                     fffffffd
>>                                     <6>GPR08: a2197408 00000001 00000001
>>             00000004
>>                                     80691c00 <6>NIP [80691c40]
>>                                     _raw_spin_lock_bh+0x40/0x70 <6>LR
>>             [c1534f30]
>>                                     tipc_nametbl_unsubscribe+0x50/0x120
>>                                     [tipc] <6>Call Trace:
>>                                     <6>[a4e0fba0] [c15333e8]
>>                                     tipc_named_reinit+0xf8/0x820 [tipc]
>>                                     <6>[a4e0fbb0] [c15336a0]
>>                                     tipc_named_reinit+0x3b0/0x820 [tipc]
>>             <6>[a4e0fbd0]
>>                                     [c1540bac]
>>             tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>>                                     <6>[a4e0fc00] [c154164c]
>>                                     tipc_conn_sendmsg+0xdc/0x170 [tipc]
>>                                     <6>[a4e0fc30] [c1533c9c]
>>
>>             tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>>                                     <6>[a4e0fc70] [c153425c]
>>                                     tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>>                                     <6>[a4e0fca0] [c1534788]
>>                                     tipc_nametbl_remove_publ+0x58/0x110
>>             [tipc]
>>                                     <6>[a4e0fcd0] [c1534c48]
>>                                     tipc_nametbl_withdraw+0x68/0x140
>> [tipc]
>>                                     <6>[a4e0fd00] [c153cc24]
>>                                     tipc_nl_node_dump_link+0x1904/0x45d0
>>             [tipc]
>>                                     <6>[a4e0fd30] [c153d838]
>>                                     tipc_nl_node_dump_link+0x2518/0x45d0
>>             [tipc]
>>                                     <6>[a4e0fd70] [804f2870]
>>                                     sock_release+0x30/0xf0 <6>[a4e0fd80]
>>             [804f2944]
>>                                     sock_close+0x14/0x30
>>                                     <6>[a4e0fd90] [80105844]
>>             __fput+0x94/0x200
>>                                     <6>[a4e0fdb0] [8003dca4]
>>                                     task_work_run+0xd4/0x100 <6>[a4e0fdd0]
>>                                     [80023620] do_exit+0x280/0x980
>>                                     <6>[a4e0fe10] [80024c48]
>>             do_group_exit+0x48/0xb0
>>                                     <6>[a4e0fe30] [80030344]
>>                                     get_signal+0x244/0x4f0 <6>[a4e0fe80]
>>             [80007734]
>>                                     do_signal+0x34/0x1c0
>>                                     <6>[a4e0ff30] [800079a8]
>>                                     do_notify_resume+0x68/0x80
>> <6>[a4e0ff40]
>>                                     [8000fa1c] do_user_signal+0x74/0xc4
>>
>>
>>                                     From the stack dump it looks like
>>                                     tipc_named_reinit is trying to
>>
>>
>>                                     acquire nametbl_lock.
>>
>>                                     From looking at the call chain I can
>>             see that
>>                                     tipc_conn_sendmsg can
>>
>>
>>                                     send up calling conn_put
>>
>>                                     which will go on and call the
>>             tipc_named_reinit
>>                                     via tipc_sock_release.
>>
>>                                     As tipc_nametbl_withdraw (from the
>>             stack dump)
>>                                     has already acquired the
>>                                     nametbl_lock, tipc_named_reinit
>>
>>                                     cannot get it and so the process
>> hangs.
>>
>>
>>                                     The call to tipc_sock_release (added
>>             in Commit
>>                                     333f796235a527
>>
>>             <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
>>             <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->
>>
>>             <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
>>             <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->>
>>
>>             stable.git/commit/?id=333f796235a52727db7e0a13888045f3aa3d53
>> 35>)
>>                                     seems to have changed the behaviour
>>
>>                                     such that it tries to do a lot more
>> when
>>                                     shutting the connection down.
>>
>>
>>                                     If there is other information I can
>>             provide
>>                                     please let me know.
>>
>>                                     Regards,
>>
>>                                     John
>>
>>             ------------------------------------------------------------
>>                                     ------------------
>>
>>             _______________________________________________
>>                                     tipc-discussion mailing list
>>
>>             tipc-discussion@lists.sourceforge.net
>>             <mailto:tipc-discussion@lists.sourceforge.net>
>>
>>             <mailto:tipc-discussion@lists.sourceforge.net
>>             <mailto:tipc-discussion@lists.sourceforge.net>>
>>
>>             https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>             <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>> >
>>
>>             <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>
>>
>>             ------------------------------------------------------------
>>                                     ------------------
>>
>>             _______________________________________________
>>                                     tipc-discussion mailing list
>>
>>             tipc-discussion@lists.sourceforge.net
>>             <mailto:tipc-discussion@lists.sourceforge.net>
>>
>>             <mailto:tipc-discussion@lists.sourceforge.net
>>             <mailto:tipc-discussion@lists.sourceforge.net>>
>>
>>             https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>             <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>> >
>>
>>             <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>
>>
>>
>>
>>
>>             ------------------------------------------------------------
>> ------------------
>>                             ______________________________
>> _________________
>>                             tipc-discussion mailing list
>>                             tipc-discussion@lists.sourceforge.net
>>             <mailto:tipc-discussion@lists.sourceforge.net>
>>
>>             <mailto:tipc-discussion@lists.sourceforge.net
>>             <mailto:tipc-discussion@lists.sourceforge.net>>
>>
>>             https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>             <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>> >
>>
>>             <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>> <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>
>>
>>
>>
>>
>>
>>
>>         ------------------------------------------------------------
>> ------------------
>>         _______________________________________________
>>         tipc-discussion mailing list
>>         tipc-discussion@lists.sourceforge.net
>>         <mailto:tipc-discussion@lists.sourceforge.net>
>>         https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
>>
>>
>>
------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to