> -----Original Message-----
> From: Ying Xue [mailto:[email protected]]
> Sent: Wednesday, December 06, 2017 03:29
> To: Jon Maloy <[email protected]>; [email protected]
> Cc: [email protected]
> Subject: Re: [net-next PATCH 1/2] tipc: prevent swork item from queuing
> workqueue after connection is closed
>
>
>
> On 12/06/2017 03:31 AM, Jon Maloy wrote:
> >> If you in remove the connection item from the idr list *before* you
> >> flush the queues and remove the socket, there is no risk that this
> >> will happen.
> >> If tipc_conn_sendmsg() obtains a pointer, we can be certain that the
> >> queue has not been flushed yet, and that this will not happen until
> >> conn_put() is called.
> >> And, you won't need to use the callback lock or touch the socket at all
> here.
> > It's even better: if you introduce the !kref_get_unless_zero(&con->kref))
> test, there is no way tipc_conn_sendmsg() will obtain a pointer while
> tipc_conn_kref_release() is being executed, i.e., after the queue is flushed.
> > I suggest you make both these changes.
>
>
> Please let me use below call chains to show what the race condition is:
>
> CPU 1: CPU 2:
> ========== ===============
> T1: tipc_conn_sendmsg
> T2: con = tipc_conn_lookup()
> T3: test_bit(CF_CONNECTED, &con->flags)
> T4: =====> interrupted by another thread;
> T5: tipc_close_conn()
> T6: test_and_clear_bit(con->flag)
> T7: tipc_unregister_callbacks()
> T8: flush_work(&con->swork)
> T9: flush_work(&con->rwork)
> T10: queue_work(s->send_wq, &con->swork)
>
>
> The key point is that the whole process of identifying whether conn is/being
> closed with test_bit(CF_CONNECTED, &con->flags) or
> sk->sk_user_data and queuing swork item into workqueue is not atomic.
That is true. But looking at the kref value is, because if that value is zero
nobody else has a pointer to the connection item. And vice versa, if it is
non-zero the queues cannot not have been flushed yet.
I am not even sure that the CF_CONNECTED flag is needed anymore, but I need to
think more about that.
What I am suggesting is that we move most of the functionality that is
currently in close_conn() to keref_release(). Something like this:
static void tipc_close_conn(struct tipc_conn *con, bool flush)
{
if (test_and_clear_bit(CF_CONNECTED, &con->flags))
conn_put(con);
}
static void tipc_conn_kref_release(struct kref *kref)
{
struct tipc_conn *con = container_of(kref, struct tipc_conn, kref);
struct tipc_server *s = con->server;
struct sockaddr_tipc *saddr = s->saddr;
struct socket *sock = con->sock;
struct sock *sk;
spin_lock_bh(&s->idr_lock);
idr_remove(&s->conn_idr, con->conid);
s->idr_in_use--;
spin_unlock_bh(&s->idr_lock);
if (sock) {
sk = sock->sk;
if (test_bit(CF_SERVER, &con->flags)) {
__module_get(sock->ops->owner);
__module_get(sk->sk_prot_creator->owner);
}
flush_work(&con->swork);
flush_work(&con->rwork);
saddr->scope = -TIPC_NODE_SCOPE;
kernel_bind(sock, (struct sockaddr *)saddr, sizeof(*saddr));
if (con->conid)
s->tipc_conn_release(con->conid, con->usr_data);
sock_release(sock);
con->sock = NULL;
}
tipc_clean_outqueues(con);
kfree(con);
}
///jon
> Even if all checking for connection status are well done and connection is
> still
> connected before swork item is queued, if the thread of running
> tipc_conn_sendmsg() is interrupted at T4 and it's backed at T10, a swork
> might be queued into workqueue after its queue has been flushed in
> tipc_close_conn() on CPU 2.
>
> That's why we need use sk_callback_lock read lock to guarantee that the
> entire process of identifying connection status and queuing swork item
> cannot be interrupted.
>
> >
> > ///jon
> >
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion