Hi Parth, Sorry for late response.
As I could not find your v3 version, I just give comments based on the version. On 11/22/2016 12:27 AM, Parthasarathy Bhuvaragan wrote: > Commit 333f796235a527 ("tipc: fix a race condition leading to > subscriber refcnt bug") reveals a soft lockup while acquiring > nametbl_lock. > > Before commit 333f796235a527, we call tipc_conn_shutdown() from > tipc_close_conn() in the context of tipc_topsrv_stop(). In that > context, we are allowed to grab the nametbl_lock. > > In commit 333f796235a527, i moved the tipc_conn_release (renamed from > tipc_conn_shutdown) to the connection refcount cleanup. Can you please confirm whether the soft lockup doesn't happen any more if we don't adjust the sequence of tipc_conn_release? If yes, I think we can propose other method to fix the issue solved by commit 333f796235a527, but in the new method it's unnecessary to adjust the order of tipc_conn_release. In fact the order of tipc_conn_shutdown, tipc_unregister_callbacks and tipc_conn_release is very important. When I wrote that code, I spent much time considering how to carefully close connection. In my opinion, the ideal order is still as belows: 1, Close connection; 2. Call tipc_unregister_callbacks to let sk->sk_user_data. As long as sk->sk_user_data is 0, no more data will be submitted to con->rwork/on->swork works. 3. Release socket. Regards, Ying This allows > either tipc_nametbl_withdraw() or tipc_topsrv_stop() to perform > tipc_sock_release(). > > Since tipc_exit_net() first calls tipc_topsrv_stop() and then > tipc_nametble_withdraw() increases the chances for the later to > perform the connection cleanup. > > The soft lockup occurs in the call chain of tipc_nametbl_withdraw(), > when it performs the tipc_conn_kref_release() as it tries to grab > nametbl_lock again while holding it already. > tipc_nametbl_withdraw() grabs nametbl_lock > tipc_nametbl_remove_publ() > tipc_subscrp_report_overlap() > tipc_subscrp_send_event() > tipc_conn_sendmsg() > << if (con->flags != CF_CONNECTED) we do conn_put(), > triggering the cleanup as refcount=0. >> > tipc_conn_kref_release > tipc_sock_release > tipc_conn_release > tipc_subscrb_delete > tipc_subscrp_delete > tipc_nametbl_unsubscribe << Soft Lockup >> > > Until now, tipc_server_stop() grabs and releases the idr_lock twice > for every connection. Once to find the connection and second to unset > connection flag, remove it from list and decrement the refcount. > The above lockup occurs when tipc_nametbl_withdraw() grabs the > connection in between the two, thus owning the connection > and triggering the cleanup while decrementing the refcount later. > > In this commit, we perform: > - call tipc_nametbl_withdraw() before tipc_topsrv_stop() to avoid: > 1. soft lockup > 2. its too late to actually notify the subscribers, as the topology > server might already have started shutting down. > - In tipc_server_stop(), we remove all the connections from connection > list in the scope of the idr_lock to prevent any other thread finding > a connection which has unset CF_CONNECTED in its flags. > > Fixes: 333f796235a52727 ("tipc: fix a race condition leading to subscriber > refcnt bug") > Signed-off-by: Parthasarathy Bhuvaragan > <parthasarathy.bhuvara...@ericsson.com> > --- > v2: commit message update. > cleanup tipc_server_stop() as per Ying Xue. > --- > net/tipc/core.c | 4 ++++ > net/tipc/net.c | 2 -- > net/tipc/server.c | 13 ++++++++----- > 3 files changed, 12 insertions(+), 7 deletions(-) > > diff --git a/net/tipc/core.c b/net/tipc/core.c > index 236b043a4156..3ea1452e2f06 100644 > --- a/net/tipc/core.c > +++ b/net/tipc/core.c > @@ -93,6 +93,10 @@ static int __net_init tipc_init_net(struct net *net) > > static void __net_exit tipc_exit_net(struct net *net) > { > + struct tipc_net *tn = net_generic(net, tipc_net_id); > + > + tipc_nametbl_withdraw(net, TIPC_CFG_SRV, tn->own_addr, 0, > + tn->own_addr); > tipc_topsrv_stop(net); > tipc_net_stop(net); > tipc_bcast_stop(net); > diff --git a/net/tipc/net.c b/net/tipc/net.c > index 28bf4feeb81c..92696cc6e763 100644 > --- a/net/tipc/net.c > +++ b/net/tipc/net.c > @@ -130,8 +130,6 @@ void tipc_net_stop(struct net *net) > if (!tn->own_addr) > return; > > - tipc_nametbl_withdraw(net, TIPC_CFG_SRV, tn->own_addr, 0, > - tn->own_addr); > rtnl_lock(); > tipc_bearer_stop(net); > tipc_node_stop(net); > diff --git a/net/tipc/server.c b/net/tipc/server.c > index 215849ce453d..54c23ecccd26 100644 > --- a/net/tipc/server.c > +++ b/net/tipc/server.c > @@ -623,16 +623,19 @@ int tipc_server_start(struct tipc_server *s) > void tipc_server_stop(struct tipc_server *s) > { > struct tipc_conn *con; > - int total = 0; > int id; > > spin_lock_bh(&s->idr_lock); > - for (id = 0; total < s->idr_in_use; id++) { > + for (id = 0; s->idr_in_use; id++) { > con = idr_find(&s->conn_idr, id); > - if (con) { > - total++; > + if (con && test_and_clear_bit(CF_CONNECTED, &con->flags)) { > + idr_remove(&s->conn_idr, con->conid); > + s->idr_in_use--; > + /* release spinlock as kernel_sock_shutdown can sleep. > + */ > spin_unlock_bh(&s->idr_lock); > - tipc_close_conn(con); > + kernel_sock_shutdown(con->sock, SHUT_RDWR); > + conn_put(con); > spin_lock_bh(&s->idr_lock); > } > } > ------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today.http://sdm.link/xeonphi _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion