Hi Jon,

Thanks for your analysis and suggestion below.

I agree with you except that tipc_nametbl_unsubscribe() should be moved from 
tipc_subscrp_kref_release() to tipc_subscrb_subscrp_delete() instead of 
tipc_subscrp_delete(). This is because tipc_subscrp_delete() is not in the 
scope of subscriber->lock, but if we move tipc_nametbl_unsubscribe() to 
tipc_subscrb_subscrp_delete(), it can be protected by subscriber->lock.

I also adjusted the places where to hold subscription refcunt having the code 
simpler. But I need a bit more time to validate them, please wait a moment.

Thanks,
Ying

-----Original Message-----
From: Jon Maloy [mailto:jon.ma...@ericsson.com] 
Sent: Thursday, February 16, 2017 1:05 PM
To: Xue, Ying
Cc: thompa....@gmail.com; tipc-discussion@lists.sourceforge.net; Parthasarathy 
Bhuvaragan
Subject: RE: [tipc-discussion] Nametable soft lockup

Hi Ying,
I ran the standard test suite, and found it didn't work even once. So the 
topology subscription is totally broken, and the fact that this was delivered 
to 'net' makes it even more urgent to fix it.
Since Partha will be out of office for the next three weeks, it must be you or 
me who fix this.

My findings so far:

- Replacing the premature 'return' statements with gotos made the test go a 
little further, but we still get a deadlock.

- I identified the following scenario, but there are certainly others, since  
John is seeing a different, probably related one:

Thread 1:                                                      Thread 2:
--------------                                                      ------------
nametbl_publish()
   nametbl_lock()
   nametbl_insert()
        subscrp_report_overlap()  // kref == 2
                                                                  
tipc_close_conn() // rcv error, - a normal event
                                                                      conn_put()
                                                                          
tipc_conn_kref_release()
                                                                             
tipc_sock_release()
                                                                                
tipc_subscrb_delete()
                                                                                
    tipc_subscrb_subscrp_delete()
                                                                                
         spin_lock_bh(sub_lock)
                                                                                
         subscrp_get()  // kref 2->3
                                                                                
         spin_unlock(sub_lock)
                                                                                
         tipc_subscrp_delete()
                                                                                
               subscrp_put()  //kref 3->2
                                                                                
         spin_lock(sub_lock)
                                                                                
         subscrp_put() //kref 2->1
         subscrp_put() // 1->0
           tipc_subscrp_kref_release()
               spink_lock(sub_lock)
               nametbl_unsubscribe()
                     nametbl_lock()                   //DEADLOCK!



Moving the  tipc_nametbl_unsubscribe() call from  tipc_subscrp_release() to 
tipc_subscrp_delete()  as follows:

static void tipc_subscrp_delete(struct tipc_subscription *sub) {
        u32 timeout = htohl(sub->evt.s.timeout, sub->swap);

        if (timeout == TIPC_WAIT_FOREVER || del_timer(&sub->timer)) {
                tipc_nametbl_unsubscribe(sub);
                tipc_subscrp_put(sub);
        }
}

made the test suite go through, but I need a second opinion on this, and I 
think you know this part of the code better than me now. 
Maybe it could be something for John to test too?

BR
///jon


> -----Original Message-----
> From: Xue, Ying [mailto:ying....@windriver.com]
> Sent: Tuesday, February 14, 2017 05:16 AM
> To: John Thompson <thompa....@gmail.com>; tipc- 
> discuss...@lists.sourceforge.net
> Subject: Re: [tipc-discussion] Nametable soft lockup
> 
> Hi John,
> 
> Thanks for the report and good analysis!
> 
> You are absolutely right, and I will figure out how to fix the 
> potential deadlock issue. The second issue you pointed out below also really 
> exists. Good catch!
> 
> Regards,
> Ying
> 
> -----Original Message-----
> From: John Thompson [mailto:thompa....@gmail.com]
> Sent: Thursday, February 09, 2017 6:37 AM
> To: tipc-discussion@lists.sourceforge.net
> Subject: [tipc-discussion] Nametable soft lockup
> 
> Hi,
> 
> I have been using the patches Partha had provided for the nametable 
> soft lockup, and that I had tested.  This was seen when testing on a SMP 
> system.
> 
> Unfortunately I have come across another nametable soft lockup:
> 
> <0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [AIS 
> listener:1591] <6>Modules linked in: tipc jitterentropy_rng echainiv 
> drbg
> platform_driver(O) ipifwd(PO)
> <6>CPU: 0 PID: 1591 Comm: AIS listener Tainted: P           O
> <6>task: ae393600 ti: ae286000 task.ti: ae286000
> <6>NIP: 806952bc LR: c160bfe0 CTR: 80695280
> <6>REGS: ae287b40 TRAP: 0901   Tainted: P           O
> <6>MSR: 00029002 <CE,EE,ME>  CR: 48002484  XER: 00000000 <6>
> <6>GPR00: c160a64c ae287bf0 ae393600 a20f18ac 00000000 00000000 
> ae064fbc
> 00000030
> <6>GPR08: 01001006 00000001 00000001 00000006 80695280 <6>NIP 
> [806952bc] _raw_spin_lock_bh+0x3c/0x70 <6>LR [c160bfe0]
> tipc_nametbl_unsubscribe+0x50/0x120 [tipc] <6>Call Trace:
> <6>[ae287c10] [c160a64c] tipc_named_reinit+0x33c/0x8a0 [tipc] 
> <6>[ae287c30] [c160ad44] tipc_subscrp_report_overlap+0xc4/0xe0 [tipc] 
> <6>[ae287c70] [c160b30c] tipc_topsrv_stop+0x45c/0x4f0 [tipc] 
> <6>[ae287ca0] [c160b838] tipc_nametbl_remove_publ+0x58/0x110 [tipc] 
> <6>[ae287cd0] [c160bcf8] tipc_nametbl_withdraw+0x68/0x140 [tipc] 
> <6>[ae287d00] [c1613cd4] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc] 
> <6>[ae287d30] [c16148e8] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc] 
> <6>[ae287d70] [804f5a40] sock_release+0x30/0xf0 <6>[ae287d80] 
> [804f5b14] sock_close+0x14/0x30 <6>[ae287d90] [80105844]
> __fput+0x94/0x200 <6>[ae287db0] [8003dca4] task_work_run+0xd4/0x100 
> <6>[ae287dd0] [80023620] do_exit+0x280/0x980 <6>[ae287e10] [80024c48]
> do_group_exit+0x48/0xb0 <6>[ae287e30] [80030344]
> get_signal+0x244/0x4f0 <6>[ae287e80] [80007734] do_signal+0x34/0x1c0 
> <6>[ae287f30] [800079a8] do_notify_resume+0x68/0x80 <6>[ae287f40] 
> [8000fa1c] do_user_signal+0x74/0xc4
> 
> 
> I have gone through the code and I think I have found a place where 
> there is a potential soft lockup.
> The call chain is:
> tipc_nametbl_stop() Grabs nametbl_lock
>    tipc_purge_publications()
>       tipc_nameseq_remove_publ()
>          tipc_subscrp_report_overlap()
>             tipc_subscrp_put() Calls kref_put when kref == 0 -- could 
> have been put by a different CPU
>                tipc_subscrp_kref_release()
>                   tipc_nametbl_unsubscribe()
>                      << lockup occurs as it grabs the
>                          nametbl_lock again >>
> 
> 
> Another possible issue is in tipc_subscrp_report_overlap(), there are 
> 2 early returns after a tipc_subscrp_get() before the tipc_subscrp_put().
> Could this end up with an incorrect kref?
> 
> JT
> ----------------------------------------------------------------------
> -------- Check out the vibrant tech community on one of the world's 
> most engaging tech sites, SlashDot.org! http://sdm.link/slashdot 
> _______________________________________________
> tipc-discussion mailing list
> tipc-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
> 
> ----------------------------------------------------------------------
> -------- Check out the vibrant tech community on one of the world's 
> most engaging tech sites, SlashDot.org! http://sdm.link/slashdot 
> _______________________________________________
> tipc-discussion mailing list
> tipc-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to