Hi Partha,

I tested with the latest 3 patches last night and observed no soft lockups.

Thanks,
John


On Fri, Nov 25, 2016 at 11:50 AM, John Thompson <thompa....@gmail.com>
wrote:

> Hi Partha,
>
> I rebuilt afresh and retried the test with the same lockup kernel dumps.
> Yes I have multiple tipc clients subscribed to the topology server, at
> least 10 clients.
> They all use a subscription timeout of TIPC_WAIT_FOREVER
>
> I will try the kernel command line parameter next week.
> JT
>
>
> On Fri, Nov 25, 2016 at 3:07 AM, Parthasarathy Bhuvaragan <
> parthasarathy.bhuvara...@ericsson.com> wrote:
>
>> Hi John,
>>
>> Do you have several tipc clients subscribed to topology server?
>> What subscription timeout do they use?
>>
>> Please enable kernel command line parameter:
>> softlockup_all_cpu_backtrace=1
>>
>> /Partha
>>
>> On 11/23/2016 11:04 PM, John Thompson wrote:
>>
>>> Hi Partha,
>>>
>>> I tested overnight with the 2 patches you provided yesterday.
>>> Testing is still showing problems, here is one of the soft lockups, the
>>> other is the same as I sent the other day.
>>> I am going to redo my build as I expected some change in behaviour with
>>> your patches.
>>>
>>> It is possible that I am doing some dumping of nodes or links as I am
>>> not certain of all the code or paths.
>>> I have found that we do a tipc-config -nt and tipc-config -ls in some
>>> situations but it shouldn 't be initiated in this
>>> reboot case.
>>>
>>> <0>NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [pimd:1220]
>>> <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [rpc.13:2419]
>>> <6>Modules linked in:
>>> <0>NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [AIS
>>> listener:1600]
>>> <6> tipc
>>> <6>Modules linked in:
>>> <6> jitterentropy_rng
>>> <6> tipc
>>> <6> echainiv
>>> <6> jitterentropy_rng
>>> <6> drbg
>>> <6> echainiv
>>> <6> platform_driver(O)
>>> <6> drbg
>>> <6> platform_driver(O)
>>> <6>
>>> <6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P           O
>>> <6>CPU: 2 PID: 1600 Comm: AIS listener Tainted: P           O
>>> <6>task: aed76d20 ti: ae70c000 task.ti: ae70c000
>>> <6>task: aee3ced0 ti: ae686000 task.ti: ae686000
>>> <6>NIP: 8069257c LR: c13ebc4c CTR: 80692540
>>> <6>NIP: 80692578 LR: c13ebf50 CTR: 80692540
>>> <6>REGS: ae70dc20 TRAP: 0901   Tainted: P           O
>>> <6>REGS: ae687ad0 TRAP: 0901   Tainted: P           O
>>> <6>MSR: 00029002
>>> <6>MSR: 00029002
>>> <6><
>>> <6><
>>> <6>CE
>>> <6>CE
>>> <6>,EE
>>> <6>,EE
>>> <6>,ME
>>> <6>,ME
>>> <6>>
>>> <6>>
>>> <6>  CR: 42002484  XER: 20000000
>>> <6>  CR: 48002444  XER: 00000000
>>> <6>
>>> <6>GPR00:
>>> <6>
>>> <6>GPR00:
>>> <6>c13f3c34
>>> <6>c13ea408
>>> <6>ae70dcd0
>>> <6>ae687b80
>>> <6>aed76d20
>>> <6>aee3ced0
>>> <6>ae55c8ec
>>> <6>ae55c8ec
>>> <6>00002711
>>> <6>00000000
>>> <6>00000005
>>> <6>a30e7264
>>> <6>8666592a
>>> <6>ae5e070c
>>> <6>8666592b
>>> <6>fffffffd
>>> <6>
>>> <6>GPR08:
>>> <6>
>>> <6>GPR08:
>>> <6>ae9dad20
>>> <6>ae72fbc8
>>> <6>00000001
>>> <6>00000001
>>> <6>00000001
>>> <6>00000001
>>> <6>00000000
>>> <6>00000004
>>> <6>80692540
>>> <6>80692540
>>> <6>
>>> <6>
>>> <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70
>>> <6>NIP [80692578] _raw_spin_lock_bh+0x38/0x70
>>> <6>LR [c13ebc4c] tipc_nametbl_withdraw+0x4c/0x140 [tipc]
>>> <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
>>> <6>Call Trace:
>>> <6>Call Trace:
>>> <6>[ae70dcd0] [a85d99a0] 0xa85d99a0
>>> <6>[ae687b80] [800fa258] check_object+0xc8/0x270
>>> <6> (unreliable)
>>> <6> (unreliable)
>>> <6>
>>> <6>
>>> <6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>>> <6>[ae687ba0] [c13ea408] tipc_named_reinit+0xf8/0x820 [tipc]
>>> <6>
>>> <6>
>>> <6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>>> <6>[ae687bb0] [c13ea6c0] tipc_named_reinit+0x3b0/0x820 [tipc]
>>> <6>
>>> <6>
>>> <6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0
>>> <6>[ae687bd0] [c13f7bbc] tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>>> <6>
>>> <6>
>>> <6>[ae70dd80] [804f2ab4] sock_close+0x14/0x30
>>> <6>[ae687c00] [c13f865c] tipc_conn_sendmsg+0xdc/0x170 [tipc]
>>> <6>
>>> <6>
>>> <6>[ae70dd90] [80105844] __fput+0x94/0x200
>>> <6>[ae687c30] [c13eacbc] tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>>> <6>
>>> <6>
>>> <6>[ae70ddb0] [8003dca4] task_work_run+0xd4/0x100
>>> <6>[ae687c70] [c13eb27c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>>> <6>
>>> <6>
>>> <6>[ae70ddd0] [80023620] do_exit+0x280/0x980
>>> <6>[ae687ca0] [c13eb7a8] tipc_nametbl_remove_publ+0x58/0x110 [tipc]
>>> <6>
>>> <6>
>>> <6>[ae70de10] [80024c48] do_group_exit+0x48/0xb0
>>> <6>[ae687cd0] [c13ebc68] tipc_nametbl_withdraw+0x68/0x140 [tipc]
>>> <6>
>>> <6>
>>> <6>[ae70de30] [80030344] get_signal+0x244/0x4f0
>>> <6>[ae687d00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>>> <6>
>>> <6>
>>> <6>[ae70de80] [80007734] do_signal+0x34/0x1c0
>>> <6>[ae687d30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>>> <6>
>>> <6>
>>> <6>[ae70df30] [800079a8] do_notify_resume+0x68/0x80
>>> <6>[ae687d70] [804f29e0] sock_release+0x30/0xf0
>>> <6>
>>> <6>
>>> <6>[ae70df40] [8000fa1c] do_user_signal+0x74/0xc4
>>> <6>[ae687d80] [804f2ab4] sock_close+0x14/0x30
>>> <6>
>>> <6>
>>> <6>--- interrupt: c00 at 0xfe490c4
>>> <6>    LR = 0xfe490a0
>>> <6>[ae687d90] [80105844] __fput+0x94/0x200
>>> <6>Instruction dump:
>>> <6>
>>> <6>
>>> <6>[ae687db0] [8003dca4] task_work_run+0xd4/0x100
>>> <6>912a0008
>>> <6>
>>> <6>39400001
>>> <6>[ae687dd0] [80023620] do_exit+0x280/0x980
>>> <6>7d201828
>>> <6>
>>> <6>2c090000
>>> <6>[ae687e10] [80024c48] do_group_exit+0x48/0xb0
>>> <6>40820010
>>> <6>
>>> <6>7d40192d
>>> <6>[ae687e30] [80030344] get_signal+0x244/0x4f0
>>> <6>40a2fff0
>>> <6>
>>> <6>7c2004ac
>>> <6>[ae687e80] [80007734] do_signal+0x34/0x1c0
>>> <6>
>>> <6>
>>> <6>2f890000
>>> <6>[ae687f30] [800079a8] do_notify_resume+0x68/0x80
>>> <6>4dbe0020
>>> <6>
>>> <6>7c210b78
>>> <6>[ae687f40] [8000fa1c] do_user_signal+0x74/0xc4
>>> <6>81230000
>>> <6>
>>> <6><2f890000>
>>> <6>--- interrupt: c00 at 0xf4f3d08
>>> <6>    LR = 0xf4f3ce8
>>> <6>40befff4
>>> <6>Instruction dump:
>>> <6>7c421378
>>> <6>
>>> <6>7d201828
>>> <6>39290200
>>> <6>
>>> <6>912a0008 39400001
>>> <0>Kernel panic - not syncing: softlockup: hung tasks
>>> <6>7d201828
>>> <6>CPU: 0 PID: 2419 Comm: rpc.13 Tainted: P           O L
>>> <6>2c090000
>>> <6>Call Trace:
>>> <6>40820010
>>> <6>[ae70da80] [80694e20] dump_stack+0x84/0xb0
>>> <6>7d40192d
>>> <6> (unreliable)
>>> <6>40a2fff0
>>> <6>
>>> <6>
>>> <6>[ae70da90] [80692ca8] panic+0xd8/0x214
>>> <6>7c2004ac
>>> <6>
>>> <6>2f890000
>>> <6>[ae70daf0] [800a0258] watchdog_timer_fn+0x2d8/0x2e0
>>> <6>4dbe0020
>>> <6>
>>> <6>7c210b78
>>> <6>[ae70db40] [8007ae58] __hrtimer_run_queues+0x118/0x1d0
>>> <6><81230000>
>>> <6>
>>> <6>2f890000
>>> <6>[ae70db80] [8007b608] hrtimer_interrupt+0xd8/0x270
>>> <6>40befff4
>>> <6>
>>> <6>7c421378
>>> <6>[ae70dbd0] [8000983c] __timer_interrupt+0xac/0x1b0
>>> <6>
>>> <6>
>>> <6>[ae70dbf0] [80009b70] timer_interrupt+0xb0/0xe0
>>> <6>[ae70dc10] [8000f450] ret_from_except+0x0/0x18
>>> <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70
>>> <6>    LR = tipc_nametbl_withdraw+0x4c/0x140 [tipc]
>>> <6>[ae70dcd0] [a85d99a0] 0xa85d99a0 (unreliable)
>>> <6>[ae70dd00] [c13f3c34] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>>> <6>[ae70dd30] [c13f4848] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>>> <6>[ae70dd70] [804f29e0] sock_release+0x30/0xf0
>>> <6>[ae70dd80] [804f2ab4] sock_close+0x14/0x30
>>> <6>[ae70dd90] [80105844] __fput+0x94/0x200
>>> <6>[ae70ddb0] [8003dca4] task_work_run+0xd4/0x100
>>> <6>[ae70ddd0] [80023620] do_exit+0x280/0x980
>>> <6>[ae70de10] [80024c48] do_group_exit+0x48/0xb0
>>> <6>[ae70de30] [80030344] get_signal+0x244/0x4f0
>>> <6>[ae70de80] [80007734] do_signal+0x34/0x1c0
>>> <6>[ae70df30] [800079a8] do_notify_resume+0x68/0x80
>>> <6>[ae70df40] [8000fa1c] do_user_signal+0x74/0xc4
>>> <6>--- interrupt: c00 at 0xfe490c4
>>> <6>    LR = 0xfe490a0
>>>
>>>
>>>
>>> On Wed, Nov 23, 2016 at 9:16 PM, Parthasarathy Bhuvaragan
>>> <parthasarathy.bhuvara...@ericsson.com
>>> <mailto:parthasarathy.bhuvara...@ericsson.com>> wrote:
>>>
>>>     Hi John,
>>>
>>>     Ok. Can you test with the 2 patches I posted regarding the
>>> nametbl_lock?
>>>
>>>     /Partha
>>>
>>>     On 11/21/2016 11:32 PM, John Thompson wrote:
>>>
>>>         Hi Partha,
>>>
>>>         My test has 4 nodes, 2 of which are alternately rebooting.  When
>>> the
>>>         rebooted node rejoins a few minutes pass and then the other node
>>> is
>>>         rebooted.
>>>         I am not printing out link stats and believe that the the other
>>>         code is
>>>         not doing so either, when nodes leave or rejoin.
>>>
>>>         JT
>>>
>>>
>>>         On Tue, Nov 22, 2016 at 2:22 AM, Parthasarathy Bhuvaragan
>>>         <parthasarathy.bhuvara...@ericsson.com
>>>         <mailto:parthasarathy.bhuvara...@ericsson.com>
>>>         <mailto:parthasarathy.bhuvara...@ericsson.com
>>>         <mailto:parthasarathy.bhuvara...@ericsson.com>>> wrote:
>>>
>>>             Hi,
>>>
>>>             There is an other branch where softlockup for nametbl_lock
>>>         occurs.
>>>
>>>             tipc_named_rcv() Grabs nametbl_lock
>>>               tipc_update_nametbl() (publish/withdraw)
>>>                 tipc_node_subscribe()/unsubscribe()
>>>                   tipc_node_write_unlock()
>>>                      << lockup occurs if it needs to process NODE
>>>         UP/DOWN LINK
>>>             UP/DOWN, as it grabs nametbl_lock again >>
>>>
>>>             /Partha
>>>
>>>
>>>             On 11/21/2016 01:04 PM, Parthasarathy Bhuvaragan wrote:
>>>
>>>                 Hi,
>>>
>>>                 tipc_nametbl_withdraw() triggers the softlockup as it
>>>         tries to grab
>>>                 nametbl_lock twice if the node triggered a
>>>         TIPC_NOTIFY_LINK_DOWN
>>>                 event
>>>                 while its is running. The erroneous call chain is:
>>>
>>>                   tipc_nametbl_withdraw() Grab nametbl_lock
>>>                     tipc_named_process_backlog()
>>>                       tipc_update_nametbl()
>>>                         if (dtype == WITHDRAWAL) tipc_node_unsubscribe()
>>>                           tipc_node_write_unlock()
>>>                             if (flags & TIPC_NOTIFY_LINK_DOWN)
>>>                 tipc_nametbl_withdraw()
>>>                                spin_lock_bh(&tn->nametbl_lock);  << Soft
>>>         Lockup >>
>>>
>>>                 Three callers which can cause this under module exit:
>>>
>>>                 Case1:
>>>                   tipc_exit_net()
>>>                     tipc_nametbl_withdraw() Grab nametbl_lock
>>>
>>>                 Case2:
>>>                   tipc_server_stop()
>>>                     tipc_conn_kref_release
>>>                       tipc_sock_release
>>>                         sock_release()
>>>                           tipc_release()
>>>                             tipc_sk_withdraw()
>>>                               tipc_nametbl_withdraw()
>>>
>>>                 Case3:
>>>                   tipc_server_stop()
>>>                     tipc_conn_kref_release()
>>>                       kernel_bind()
>>>                         tipc_bind()
>>>                           tipc_sk_withdraw()
>>>                             tipc_nametbl_withdraw()
>>>
>>>                 I will work on a solution for this.
>>>
>>>                 What kind of test were you performing when this occurred
>>>                 (linkup/down)?
>>>                 Do you read link statistics periodically in your tests?
>>>
>>>                 /Partha
>>>
>>>                 On 11/21/2016 05:30 AM, John Thompson wrote:
>>>
>>>                     Hi Partha,
>>>
>>>                     I was doing some some more testing today and have
>>> still
>>>                     observed the
>>>                     problem (contrary to what I had emailed earlier).
>>>
>>>                     Here is the kernel dump.
>>>
>>>                     <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for
>>> 22s!
>>>                     [Pluggable
>>>                     Serve:2221]
>>>                     <6>Modules linked in: tipc jitterentropy_rng
>>>         echainiv drbg
>>>                     platform_driver(O)
>>>                     <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted:
>>>         P           O
>>>                     <6>task: ae54ced0 ti: aec42000 task.ti: aec42000
>>>                     <6>NIP: 8069257c LR: c13ebf50 CTR: 80692540
>>>                     <6>REGS: aec43ad0 TRAP: 0901   Tainted: P           O
>>>                     <6>MSR: 00029002 <CE,EE,ME>  CR: 48002444  XER:
>>> 00000000
>>>                     <6>
>>>                     <6>GPR00: c13ea408 aec43b80 ae54ced0 a624690c
>>> 00000000
>>>                     a6271d84 a39a60cc
>>>                     fffffffd
>>>                     <6>GPR08: aeefbbc8 00000001 00000001 00000004
>>> 80692540
>>>                     <6>NIP [8069257c] _raw_spin_lock_bh+0x3c/0x70
>>>                     <6>LR [c13ebf50] tipc_nametbl_unsubscribe+0x50/0x120
>>>         [tipc]
>>>                     <6>Call Trace:
>>>                     <6>[aec43b80] [800fa258] check_object+0xc8/0x270
>>>         (unreliable)
>>>                     <6>[aec43ba0] [c13ea408]
>>>         tipc_named_reinit+0xf8/0x820 [tipc]
>>>                     <6>[aec43bb0] [c13ea6c0]
>>>         tipc_named_reinit+0x3b0/0x820 [tipc]
>>>                     <6>[aec43bd0] [c13f7bbc]
>>>         tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>>>                     <6>[aec43c00] [c13f865c]
>>>         tipc_conn_sendmsg+0xdc/0x170 [tipc]
>>>                     <6>[aec43c30] [c13eacbc]
>>>                     tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>>>                     <6>[aec43c70] [c13eb27c]
>>>         tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>>>                     <6>[aec43ca0] [c13eb7a8]
>>>         tipc_nametbl_remove_publ+0x58/0x110
>>>                     [tipc]
>>>                     <6>[aec43cd0] [c13ebc68]
>>>         tipc_nametbl_withdraw+0x68/0x140 [tipc]
>>>                     <6>[aec43d00] [c13f3c34]
>>>                     tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>>>                     <6>[aec43d30] [c13f4848]
>>>                     tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>>>
>>>                 TIPC_CMD_SHOW_LINK_STATS or TIPC_NL_LINK_GET
>>>
>>>                     <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
>>>                     <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
>>>                     <6>[aec43d90] [80105844] __fput+0x94/0x200
>>>                     <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100
>>>                     <6>[aec43dd0] [80023620] do_exit+0x280/0x980
>>>                     <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0
>>>                     <6>[aec43e30] [80030344] get_signal+0x244/0x4f0
>>>                     <6>[aec43e80] [80007734] do_signal+0x34/0x1c0
>>>                     <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80
>>>                     <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4
>>>                     <6>--- interrupt: c00 at 0xf4f3d08
>>>                     <6>    LR = 0xf4f3ce8
>>>                     <6>Instruction dump:
>>>                     <6>912a0008 39400001 7d201828 2c090000 40820010
>>> 7d40192d
>>>                     40a2fff0 7c2004ac
>>>                     <6>2f890000 4dbe0020 7c210b78 81230000 <2f890000>
>>>         40befff4
>>>                     7c421378
>>>                     7d201828
>>>                     <0>Kernel panic - not syncing: softlockup: hung tasks
>>>                     <6>CPU: 0 PID: 2221 Comm: Pluggable Serve Tainted: P
>>>                        O L
>>>                     <6>Call Trace:
>>>                     <6>[aec43930] [80694e20] dump_stack+0x84/0xb0
>>>         (unreliable)
>>>                     <6>[aec43940] [80692ca8] panic+0xd8/0x214
>>>                     <6>[aec439a0] [800a0258]
>>> watchdog_timer_fn+0x2d8/0x2e0
>>>                     <6>[aec439f0] [8007ae58]
>>>         __hrtimer_run_queues+0x118/0x1d0
>>>                     <6>[aec43a30] [8007b608] hrtimer_interrupt+0xd8/0x270
>>>                     <6>[aec43a80] [8000983c] __timer_interrupt+0xac/0x1b0
>>>                     <6>[aec43aa0] [80009b70] timer_interrupt+0xb0/0xe0
>>>                     <6>[aec43ac0] [8000f450] ret_from_except+0x0/0x18
>>>                     <6>--- interrupt: 901 at _raw_spin_lock_bh+0x3c/0x70
>>>                     <6>    LR = tipc_nametbl_unsubscribe+0x50/0x120
>>> [tipc]
>>>                     <6>[aec43b80] [800fa258] check_object+0xc8/0x270
>>>         (unreliable)
>>>                     <6>[aec43ba0] [c13ea408]
>>>         tipc_named_reinit+0xf8/0x820 [tipc]
>>>                     <6>[aec43bb0] [c13ea6c0]
>>>         tipc_named_reinit+0x3b0/0x820 [tipc]
>>>                     <6>[aec43bd0] [c13f7bbc]
>>>         tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>>>                     <6>[aec43c00] [c13f865c]
>>>         tipc_conn_sendmsg+0xdc/0x170 [tipc]
>>>                     <6>[aec43c30] [c13eacbc]
>>>                     tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>>>                     <6>[aec43c70] [c13eb27c]
>>>         tipc_topsrv_stop+0x45c/0x4f0 [tipc]
>>>                     <6>[aec43ca0] [c13eb7a8]
>>>         tipc_nametbl_remove_publ+0x58/0x110
>>>                     [tipc]
>>>                     <6>[aec43cd0] [c13ebc68]
>>>         tipc_nametbl_withdraw+0x68/0x140 [tipc]
>>>                     <6>[aec43d00] [c13f3c34]
>>>                     tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
>>>                     <6>[aec43d30] [c13f4848]
>>>                     tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
>>>                     <6>[aec43d70] [804f29e0] sock_release+0x30/0xf0
>>>                     <6>[aec43d80] [804f2ab4] sock_close+0x14/0x30
>>>                     <6>[aec43d90] [80105844] __fput+0x94/0x200
>>>                     <6>[aec43db0] [8003dca4] task_work_run+0xd4/0x100
>>>                     <6>[aec43dd0] [80023620] do_exit+0x280/0x980
>>>                     <6>[aec43e10] [80024c48] do_group_exit+0x48/0xb0
>>>                     <6>[aec43e30] [80030344] get_signal+0x244/0x4f0
>>>                     <6>[aec43e80] [80007734] do_signal+0x34/0x1c0
>>>                     <6>[aec43f30] [800079a8] do_notify_resume+0x68/0x80
>>>                     <6>[aec43f40] [8000fa1c] do_user_signal+0x74/0xc4
>>>                     <6>--- interrupt: c00 at 0xf4f3d08
>>>                     <6>    LR = 0xf4f3ce8
>>>
>>>
>>>                     On Mon, Nov 21, 2016 at 9:59 AM, John Thompson
>>>                     <thompa....@gmail.com <mailto:thompa....@gmail.com>
>>>         <mailto:thompa....@gmail.com <mailto:thompa....@gmail.com>>
>>>                     <mailto:thompa....@gmail.com
>>>         <mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com
>>>         <mailto:thompa....@gmail.com>>>>
>>>                     wrote:
>>>
>>>                         Hi Partha,
>>>
>>>                         In my testing over the weekend the patch
>>>         performed well
>>>                     - I didn't
>>>                         see any kernel dumps due to this issue.
>>>
>>>                         Thanks for the quick response.
>>>                         JT
>>>
>>>
>>>                         On Fri, Nov 18, 2016 at 10:34 AM, John Thompson
>>>                         <thompa....@gmail.com
>>>         <mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com
>>>         <mailto:thompa....@gmail.com>>
>>>                     <mailto:thompa....@gmail.com
>>>         <mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com
>>>         <mailto:thompa....@gmail.com>>>>
>>>                     wrote:
>>>
>>>                             Hi,
>>>
>>>                             I will be able to have some test results by
>>> the
>>>                     start of next
>>>                             week on the first patch.
>>>
>>>                             Regards,
>>>                             JT
>>>
>>>
>>>                             On Thu, Nov 17, 2016 at 11:27 PM, Ying Xue
>>>                             <ying....@windriver.com
>>>         <mailto:ying....@windriver.com>
>>>                     <mailto:ying....@windriver.com
>>>         <mailto:ying....@windriver.com>>
>>>                     <mailto:ying....@windriver.com
>>>         <mailto:ying....@windriver.com>
>>>                     <mailto:ying....@windriver.com
>>>         <mailto:ying....@windriver.com>>>> wrote:
>>>
>>>                                 On 11/17/2016 07:04 AM, John Thompson
>>> wrote:
>>>
>>>                                     Hi Partha / Ying,
>>>
>>>                                     I will try out the patch and let you
>>>         know
>>>                     how it goes.
>>>                                     I also note about providing the
>>>         other CPU
>>>                     core dumps -
>>>                                     in one of my cases I
>>>                                     didn't have them but in others I did
>>> but
>>>                                     they were interleaved and so were
>>>         difficult
>>>                     to interpret.
>>>
>>>
>>>                                 Thanks, it's unnecessary for us to
>>>         collect more
>>>                     logs as its
>>>                                 soft lockup scenario should be just what
>>>         Partha
>>>                     described.
>>>
>>>                                 Regards,
>>>                                 Ying
>>>
>>>
>>>
>>>                                     Thanks for getting a patch together
>>>         so quickly.
>>>
>>>                                     JT
>>>
>>>                                     On Wed, Nov 16, 2016 at 10:23 PM,
>>>                     Parthasarathy Bhuvaragan <
>>>
>>>         parthasarathy.bhuvara...@ericsson.com
>>>         <mailto:parthasarathy.bhuvara...@ericsson.com>
>>>                     <mailto:parthasarathy.bhuvara...@ericsson.com
>>>         <mailto:parthasarathy.bhuvara...@ericsson.com>>
>>>
>>>                     <mailto:parthasarathy.bhuvara...@ericsson.com
>>>         <mailto:parthasarathy.bhuvara...@ericsson.com>
>>>
>>>                     <mailto:parthasarathy.bhuvara...@ericsson.com
>>>         <mailto:parthasarathy.bhuvara...@ericsson.com>>>> wrote:
>>>
>>>                                         Hi Ying / John,
>>>
>>>                                         The soft lock is the call chain
>>> of
>>>                                         tipc_nametbl_withdraw(), when it
>>>                                         performs the
>>>         tipc_conn_kref_release() as
>>>                     it tries to
>>>                                         grab nametbl_lock
>>>                                         again while holding it already.
>>>
>>>                                             tipc_nametbl_withdraw
>>>
>>>         spin_lock_bh(&tn->nametbl_lock);
>>>                                               tipc_nametbl_remove_publ
>>>
>>>  spin_lock_bh(&seq->lock);
>>>                                                  tipc_nameseq_remove_publ
>>>
>>>          tipc_subscrp_report_overlap
>>>
>>>  tipc_subscrp_send_event
>>>                                                         tipc_conn_sendmsg
>>>
>>>                                         << Here, the
>>> (test_bit(CF_CONNECTED,
>>>                     &con->flags))
>>>                                         Fails, leading to the
>>>                                         else case where do do a
>>>         conn_put() and
>>>                     that triggers
>>>                                         the cleanup as
>>>                                         refcount reached 0. Leading the
>>> call
>>>                     chain below : >>
>>>                                         tipc_conn_kref_release
>>>                                            tipc_sock_release
>>>                                              tipc_conn_release
>>>                                                 tipc_subscrb_delete
>>>                                                    tipc_subscrp_delete
>>>
>>>         tipc_nametbl_unsubscribe
>>>
>>>                      spin_lock_bh(&tn->nametbl_lock);
>>>                                         << !! Soft Lockup >>
>>>
>>>                                         One cause is that
>>>         tipc_exit_net() calls
>>>                     first calls
>>>                                         tipc_topsrv_stop() and
>>>                                         then tipc_nametbl_withdraw() in
>>>         scope of
>>>                                         tipc_net_stop().
>>>
>>>                                         The above chain will only occur
>>> in a
>>>                     narrow window
>>>                                         for a given connection:
>>>                                         CPU#1:
>>>                                         tipc_nametbl_withdraw() manages
>>>         to perform
>>>                                         tipc_conn_lookup() and steps
>>>                                         the refcount to 2, while in
>>>         CPU#2 the
>>>                     following occurs:
>>>                                         CPU#2:
>>>                                         tipc_server_stop() calls
>>>                     tipc_close_conn(con). This
>>>                                         performs a conn_put()
>>>                                         decrementing refcount to 1.
>>>                                         Now, CPU#1 continues and detects
>>>         that
>>>                     the connection
>>>                                         is not CF_CONNECTED
>>>                                         and does a conn_put(),
>>>         triggering the
>>>                     release callback.
>>>
>>>                                         Before commit 333f796235a527,
>>>         the above
>>>                     wont happen.
>>>
>>>                                         /Partha
>>>
>>>
>>>                                         On 11/15/2016 04:11 PM, Xue,
>>>         Ying wrote:
>>>
>>>                                             Hi John,
>>>
>>>                                             Regarding the stack trace you
>>>                     provided below, I
>>>                                             get the two potential
>>>                                             call chains:
>>>
>>>                                             tipc_nametbl_withdraw
>>>
>>>         spin_lock_bh(&tn->nametbl_lock);
>>>                                               tipc_nametbl_remove_publ
>>>
>>>  spin_lock_bh(&seq->lock);
>>>                                                  tipc_nameseq_remove_publ
>>>
>>>          tipc_subscrp_report_overlap
>>>
>>>  tipc_subscrp_send_event
>>>                                                         tipc_conn_sendmsg
>>>
>>>                      spin_lock_bh(&con->outqueue_lock);
>>>
>>>          list_add_tail(&e->list,
>>>                                             &con->outqueue);
>>>
>>>
>>>                                             tipc_topsrv_stop
>>>                                               tipc_server_stop
>>>                                                 tipc_close_conn
>>>                                                   kernel_sock_shutdown
>>>                                                     tipc_subscrb_delete
>>>
>>>                     spin_lock_bh(&subscriber->lock);
>>>
>>>         tipc_nametbl_unsubscribe(sub);
>>>
>>>                      spin_lock_bh(&tn->nametbl_lock);
>>>
>>>                                             Although I suspect this is a
>>>         revert
>>>                     lock issue
>>>                                             leading to the soft
>>>                                             lockup, I am still unable to
>>>                     understand which
>>>                                             lock together with
>>>                                             nametbl_lock is taken
>>>         reversely on
>>>                     the two
>>>                                             different paths above.
>>>                                             However, you just gave us
>>>         the log
>>>                     printed on
>>>                                             CPU#2, but the logs
>>>                                             outputted by other cores are
>>>         also
>>>                     important.  So
>>>                                             if possible, please share
>>>                                             them with us.
>>>
>>>                                             By the way, I agree with
>>>         you, and it
>>>                     seems that
>>>                                             commit 333f796235a527 is
>>>                                             related to the soft lockup.
>>>
>>>                                             Regards,
>>>                                             Ying
>>>
>>>                                             -----Original Message-----
>>>                                             From: John Thompson
>>>                     [mailto:thompa....@gmail.com
>>>         <mailto:thompa....@gmail.com> <mailto:thompa....@gmail.com
>>>         <mailto:thompa....@gmail.com>>
>>>                                             <mailto:thompa....@gmail.com
>>>         <mailto:thompa....@gmail.com>
>>>                     <mailto:thompa....@gmail.com
>>>         <mailto:thompa....@gmail.com>>>]
>>>                                             Sent: Tuesday, November 15,
>>>         2016 8:01 AM
>>>                                             To:
>>>                     tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>
>>>                     <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>>
>>>
>>>                     <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>
>>>
>>>                     <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>>>
>>>                                             Subject: [tipc-discussion]
>>> v4.7:
>>>                     soft lockup
>>>                                             when releasing a socket
>>>
>>>                                             Hi,
>>>
>>>                                             I am seeing an occasional
>>> kernel
>>>                     soft lockup.  I
>>>                                             have TIPC v4.7 and the
>>>                                             kernel dump occurs when the
>>>         system
>>>                     is going down
>>>                                             for a reboot.
>>>
>>>                                             The kernel dump is:
>>>
>>>                                             <0>NMI watchdog: BUG: soft
>>>         lockup -
>>>                     CPU#2 stuck
>>>                                             for 23s! [exfx:1474]
>>>                                             <6>Modules linked in: tipc
>>>                     jitterentropy_rng
>>>                                             echainiv drbg
>>>                                             platform_driver(O) ipifwd(PO)
>>>                                             ...
>>>                                             <6>
>>>                                             <6>GPR00: c15333e8 a4e0fb80
>>>         a4ee3600
>>>                     a51748ac
>>>                                             00000000 ae475024 a537feec
>>>                                             fffffffd
>>>                                             <6>GPR08: a2197408 00000001
>>>         00000001
>>>                     00000004
>>>                                             80691c00 <6>NIP [80691c40]
>>>                                             _raw_spin_lock_bh+0x40/0x70
>>>         <6>LR
>>>                     [c1534f30]
>>>
>>>         tipc_nametbl_unsubscribe+0x50/0x120
>>>                                             [tipc] <6>Call Trace:
>>>                                             <6>[a4e0fba0] [c15333e8]
>>>                                             tipc_named_reinit+0xf8/0x820
>>>         [tipc]
>>>                                             <6>[a4e0fbb0] [c15336a0]
>>>
>>>         tipc_named_reinit+0x3b0/0x820 [tipc]
>>>                     <6>[a4e0fbd0]
>>>                                             [c1540bac]
>>>                     tipc_nl_publ_dump+0x50c/0xed0 [tipc]
>>>                                             <6>[a4e0fc00] [c154164c]
>>>                                             tipc_conn_sendmsg+0xdc/0x170
>>>         [tipc]
>>>                                             <6>[a4e0fc30] [c1533c9c]
>>>
>>>                     tipc_subscrp_report_overlap+0xbc/0xd0 [tipc]
>>>                                             <6>[a4e0fc70] [c153425c]
>>>                                             tipc_topsrv_stop+0x45c/0x4f0
>>>         [tipc]
>>>                                             <6>[a4e0fca0] [c1534788]
>>>
>>>         tipc_nametbl_remove_publ+0x58/0x110
>>>                     [tipc]
>>>                                             <6>[a4e0fcd0] [c1534c48]
>>>
>>>         tipc_nametbl_withdraw+0x68/0x140 [tipc]
>>>                                             <6>[a4e0fd00] [c153cc24]
>>>
>>>         tipc_nl_node_dump_link+0x1904/0x45d0
>>>                     [tipc]
>>>                                             <6>[a4e0fd30] [c153d838]
>>>
>>>         tipc_nl_node_dump_link+0x2518/0x45d0
>>>                     [tipc]
>>>                                             <6>[a4e0fd70] [804f2870]
>>>                                             sock_release+0x30/0xf0
>>>         <6>[a4e0fd80]
>>>                     [804f2944]
>>>                                             sock_close+0x14/0x30
>>>                                             <6>[a4e0fd90] [80105844]
>>>                     __fput+0x94/0x200
>>>                                             <6>[a4e0fdb0] [8003dca4]
>>>                                             task_work_run+0xd4/0x100
>>>         <6>[a4e0fdd0]
>>>                                             [80023620]
>>> do_exit+0x280/0x980
>>>                                             <6>[a4e0fe10] [80024c48]
>>>                     do_group_exit+0x48/0xb0
>>>                                             <6>[a4e0fe30] [80030344]
>>>                                             get_signal+0x244/0x4f0
>>>         <6>[a4e0fe80]
>>>                     [80007734]
>>>                                             do_signal+0x34/0x1c0
>>>                                             <6>[a4e0ff30] [800079a8]
>>>                                             do_notify_resume+0x68/0x80
>>>         <6>[a4e0ff40]
>>>                                             [8000fa1c]
>>>         do_user_signal+0x74/0xc4
>>>
>>>
>>>                                             From the stack dump it looks
>>>         like
>>>                                             tipc_named_reinit is trying
>>> to
>>>
>>>
>>>                                             acquire nametbl_lock.
>>>
>>>                                             From looking at the call
>>>         chain I can
>>>                     see that
>>>                                             tipc_conn_sendmsg can
>>>
>>>
>>>                                             send up calling conn_put
>>>
>>>                                             which will go on and call the
>>>                     tipc_named_reinit
>>>                                             via tipc_sock_release.
>>>
>>>                                             As tipc_nametbl_withdraw
>>>         (from the
>>>                     stack dump)
>>>                                             has already acquired the
>>>                                             nametbl_lock,
>>> tipc_named_reinit
>>>
>>>                                             cannot get it and so the
>>>         process hangs.
>>>
>>>
>>>                                             The call to
>>>         tipc_sock_release (added
>>>                     in Commit
>>>                                             333f796235a527
>>>
>>>
>>>         <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
>>>         <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->
>>>
>>>         <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
>>>         <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->>
>>>
>>>
>>>         <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
>>>         <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->
>>>
>>>         <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux-
>>>         <http://git.atlnz.lc/cgit/cgit.cgi/upstream_imports/linux->>>
>>>
>>>
>>>         stable.git/commit/?id=333f796235a52727db7e0a13888045f3aa3d5335>)
>>>                                             seems to have changed the
>>>         behaviour
>>>
>>>                                             such that it tries to do a
>>>         lot more when
>>>                                             shutting the connection down.
>>>
>>>
>>>                                             If there is other
>>>         information I can
>>>                     provide
>>>                                             please let me know.
>>>
>>>                                             Regards,
>>>
>>>                                             John
>>>
>>>
>>>         ------------------------------------------------------------
>>>                                             ------------------
>>>
>>>                     _______________________________________________
>>>                                             tipc-discussion mailing list
>>>
>>>                     tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>
>>>                     <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>>
>>>
>>>                     <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>
>>>                     <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>>>
>>>
>>>
>>>         https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
>>>
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>
>>>
>>>
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>>
>>>
>>>
>>>         ------------------------------------------------------------
>>>                                             ------------------
>>>
>>>                     _______________________________________________
>>>                                             tipc-discussion mailing list
>>>
>>>                     tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>
>>>                     <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>>
>>>
>>>                     <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>
>>>                     <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>>>
>>>
>>>
>>>         https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
>>>
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>
>>>
>>>
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>>
>>>
>>>
>>>
>>>
>>>
>>>         ------------------------------------------------------------
>>> ------------------
>>>
>>>         _______________________________________________
>>>                                     tipc-discussion mailing list
>>>
>>>         tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>
>>>                     <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>>
>>>
>>>                     <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>
>>>                     <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>>>
>>>
>>>
>>>         https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
>>>
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>
>>>
>>>
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>         ------------------------------------------------------------
>>> ------------------
>>>                 _______________________________________________
>>>                 tipc-discussion mailing list
>>>                 tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>
>>>                 <mailto:tipc-discussion@lists.sourceforge.net
>>>         <mailto:tipc-discussion@lists.sourceforge.net>>
>>>
>>>         https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>
>>>
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>         <https://lists.sourceforge.net/lists/listinfo/tipc-discussion>>
>>>
>>>
>>>
>>>
>
------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to