Re: netns refcnt leak for kernel accept sock
On (07/27/15 11:37), Cong Wang wrote: > > dlm uses a kernel TCP socket too, but it allocates a new socket and calls > ->accept() by itself. ;) sure, and rds does this in rds_tcp_accept_one() too. But the newsk being created in sk_clone_lock is the one on an incoming syn, i.e., the one that is saved up as part of listen backlog, to be returned later on the accept. I dont know the details of dlm- can you have one dlm instance per network namespace? That's where I'm running into this issue- when we try to have one rds listen socket per netns, and want to be able to do both - dynamically build/tear down new network namepsaces, without unloading rds_tcp globally - unload rds_tcp globally withouth tearing down individual netns. But perhaps we digress. Fundamental issue remains: newsk is the syn_recv version of the listen socket. If the listen socket is a "kernel" socket (kern == 1 for sk_alloc, and the listen socket thus has no sk_net_refcnt), the syn_recv socket must also have that behavior, so that it is cleaned up in the same way. --Sowmini -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netns refcnt leak for kernel accept sock
On Mon, Jul 27, 2015 at 11:19 AM, Sowmini Varadhan wrote: > On (07/27/15 11:13), Cong Wang wrote: >> >> That refcnt should be released in sock destructor too, when the tcp >> connection is terminated. > > yes, but in my case, the listen socket is opened as part of > the ->init indirection in pernet_operations (thus it is a kernel socket) > and the expectation is that this listen socket, and any accept sockets > derived from it, will be closed in ->exit. > > But if the accept socket is treated as a uspace socket (thus holds a > get_net()) > then it will block cleanup_net() and the associated ->exit cleanup operations. > > This is probably not a problem for other systems like vxlan/gue/geneve etc > because they all use udp sockets, thus dont have the "accept" equivalent. > dlm uses a kernel TCP socket too, but it allocates a new socket and calls ->accept() by itself. ;) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netns refcnt leak for kernel accept sock
On (07/27/15 11:13), Cong Wang wrote: > > That refcnt should be released in sock destructor too, when the tcp > connection is terminated. yes, but in my case, the listen socket is opened as part of the ->init indirection in pernet_operations (thus it is a kernel socket) and the expectation is that this listen socket, and any accept sockets derived from it, will be closed in ->exit. But if the accept socket is treated as a uspace socket (thus holds a get_net()) then it will block cleanup_net() and the associated ->exit cleanup operations. This is probably not a problem for other systems like vxlan/gue/geneve etc because they all use udp sockets, thus dont have the "accept" equivalent. But fundamentally, its wrong for a kspace listen socket to result in a "uspace" accept socket. > Given the fact that sk_destruct() checks for sk_net_refcnt, your > patch makes sense to me. But I am not sure how a TCP kernel > socket is supposed to use. Thanks for the confirmation - I think RDS is a bit of a maverick here in that it uses tcp sockets unlike vxlan etc. For those curious about RDS-TCP, I've actually updated the documentation at https://oss.oracle.com/projects/rds/dist/documentation/rds-3.1-spec.html recently. I hope that helps. --Sowmini -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netns refcnt leak for kernel accept sock
On Mon, Jul 27, 2015 at 7:21 AM, Sowmini Varadhan wrote: > > I'm running into a netns refcnt issue, and I suspect that > eeb1bd5c has something to do with it (perhaps we need an > additional change in sk_clone_lock() after eeb1bd5c). > Here's the problem: > > When we create an syn_recv sock based on a kernel listen sock, we > take a get_net() ref with a stack similar to the one shown below. > Note that the parent (kernel, listen) sock itself has not taken > a get_net() ref, because it explicitly calls sock_create_kern(). > > get_net /* for the newsk */ > sk_clone_lock > inet_csk_clone_lock > tcp_create_openreq_child > tcp_v4_syn_recv_sock > tcp_check_req > tcp_v4_do_rcv > tcp_v4_rcv >: > > But it's not clear to me where this refcnt will be released: > in my case, I expect to create/cleanup kernel sockets as part > of ->init/->exit for my module, but because the accept socket > has a netns refcnt, it blocks cleanup_net(), thus my ->exit > pernet_subsys op cannot run and clean this up, and we have a leak. That refcnt should be released in sock destructor too, when the tcp connection is terminated. > > I think that sk_clone_lock() should only do a get_net() if the parent > is not a kernel socket (making this similar to sk_alloc()), i.e., > Given the fact that sk_destruct() checks for sk_net_refcnt, your patch makes sense to me. But I am not sure how a TCP kernel socket is supposed to use. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netns refcnt leak for kernel accept sock
On (07/27/15 12:40), ebied...@xmission.com wrote: > sock_create_kern and friends are specialied interfaces for special > purposes. At a quick read through I don't think we have a single in > tree user doing with them what you are trying to do. That doesnt change the fact that the architecture is questionable. and my description should make it quite clear why this is so. > > Without seeing code using the interfaces in the way are trying to use > them I do not have enough information to comment intelligently. Ok, here you go. I'm still testing it, but there's enough there for you to see the bug quite clearly. Enjoy. I think my other mail had better "information to comment intelligently" but ymmv. --Sowmini diff --git a/net/core/sock.c b/net/core/sock.c index 08f16db..371d1b7 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1497,7 +1497,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority) sock_copy(newsk, sk); /* SANITY */ - get_net(sock_net(newsk)); + if (likely(newsk->sk_net_refcnt)) + get_net(sock_net(newsk)); sk_node_init(&newsk->sk_node); sock_lock_init(newsk); bh_lock_sock(newsk); diff --git a/net/rds/bind.c b/net/rds/bind.c index 4ebd29c..dd666fb 100644 --- a/net/rds/bind.c +++ b/net/rds/bind.c @@ -185,7 +185,8 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) ret = 0; goto out; } - trans = rds_trans_get_preferred(sin->sin_addr.s_addr); + trans = rds_trans_get_preferred(sock_net(sock->sk), + sin->sin_addr.s_addr); if (!trans) { ret = -EADDRNOTAVAIL; rds_remove_bound(rs); diff --git a/net/rds/connection.c b/net/rds/connection.c index da6da57..273fa6c 100644 --- a/net/rds/connection.c +++ b/net/rds/connection.c @@ -117,7 +117,8 @@ static void rds_conn_reset(struct rds_connection *conn) * For now they are not garbage collected once they're created. They * are torn down as the module is removed, if ever. */ -static struct rds_connection *__rds_conn_create(__be32 laddr, __be32 faddr, +static struct rds_connection *__rds_conn_create(struct net *net, + __be32 laddr, __be32 faddr, struct rds_transport *trans, gfp_t gfp, int is_outgoing) { @@ -157,7 +158,7 @@ new_conn: conn->c_faddr = faddr; spin_lock_init(&conn->c_lock); conn->c_next_tx_seq = 1; - + write_pnet(&conn->c_net, net); init_waitqueue_head(&conn->c_waitq); INIT_LIST_HEAD(&conn->c_send_queue); INIT_LIST_HEAD(&conn->c_retrans); @@ -174,7 +175,7 @@ new_conn: * can bind to the destination address then we'd rather the messages * flow through loopback rather than either transport. */ - loop_trans = rds_trans_get_preferred(faddr); + loop_trans = rds_trans_get_preferred(net, faddr); if (loop_trans) { rds_trans_put(loop_trans); conn->c_loopback = 1; @@ -260,17 +261,19 @@ out: return conn; } -struct rds_connection *rds_conn_create(__be32 laddr, __be32 faddr, +struct rds_connection *rds_conn_create(struct net *net, + __be32 laddr, __be32 faddr, struct rds_transport *trans, gfp_t gfp) { - return __rds_conn_create(laddr, faddr, trans, gfp, 0); + return __rds_conn_create(net, laddr, faddr, trans, gfp, 0); } EXPORT_SYMBOL_GPL(rds_conn_create); -struct rds_connection *rds_conn_create_outgoing(__be32 laddr, __be32 faddr, +struct rds_connection *rds_conn_create_outgoing(struct net *net, + __be32 laddr, __be32 faddr, struct rds_transport *trans, gfp_t gfp) { - return __rds_conn_create(laddr, faddr, trans, gfp, 1); + return __rds_conn_create(net, laddr, faddr, trans, gfp, 1); } EXPORT_SYMBOL_GPL(rds_conn_create_outgoing); diff --git a/net/rds/ib.c b/net/rds/ib.c index ba2dffe..1381422 100644 --- a/net/rds/ib.c +++ b/net/rds/ib.c @@ -317,7 +317,7 @@ static void rds_ib_ic_info(struct socket *sock, unsigned int len, * allowed to influence which paths have priority. We could call userspace * asserting this policy "routing". */ -static int rds_ib_laddr_check(__be32 addr) +static int rds_ib_laddr_check(struct net *net, __be32 addr) { int ret; struct rdma_cm_id *cm_id; diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c index 0da2a45..c38d8a0 100644 --- a/net/rds/ib_cm.c +++ b/net/rds/ib_cm.c @@ -448,8 +448,8 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id, (unsigned long long)be64_to_cpu(lguid), (unsigned long long)be64_to_cpu(fguid)); - co
Re: netns refcnt leak for kernel accept sock
sock_create_kern and friends are specialied interfaces for special purposes. At a quick read through I don't think we have a single in tree user doing with them what you are trying to do. Without seeing code using the interfaces in the way are trying to use them I do not have enough information to comment intelligently. Eric Sowmini Varadhan writes: > I'm running into a netns refcnt issue, and I suspect that > eeb1bd5c has something to do with it (perhaps we need an > additional change in sk_clone_lock() after eeb1bd5c). > Here's the problem: > > When we create an syn_recv sock based on a kernel listen sock, we > take a get_net() ref with a stack similar to the one shown below. > Note that the parent (kernel, listen) sock itself has not taken > a get_net() ref, because it explicitly calls sock_create_kern(). > > get_net /* for the newsk */ > sk_clone_lock > inet_csk_clone_lock > tcp_create_openreq_child > tcp_v4_syn_recv_sock > tcp_check_req > tcp_v4_do_rcv > tcp_v4_rcv >: > > But it's not clear to me where this refcnt will be released: > in my case, I expect to create/cleanup kernel sockets as part > of ->init/->exit for my module, but because the accept socket > has a netns refcnt, it blocks cleanup_net(), thus my ->exit > pernet_subsys op cannot run and clean this up, and we have a leak. > > I think that sk_clone_lock() should only do a get_net() if the parent > is not a kernel socket (making this similar to sk_alloc()), i.e., > > diff --git a/net/core/sock.c b/net/core/sock.c > index 08f16db..371d1b7 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -1497,7 +1497,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const > gf > sock_copy(newsk, sk); > > /* SANITY */ > - get_net(sock_net(newsk)); > + if (likely(newsk->sk_net_refcnt)) > + get_net(sock_net(newsk)); > sk_node_init(&newsk->sk_node); > sock_lock_init(newsk); > bh_lock_sock(newsk); > > Does this sound right? > > --Sowmini -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
netns refcnt leak for kernel accept sock
I'm running into a netns refcnt issue, and I suspect that eeb1bd5c has something to do with it (perhaps we need an additional change in sk_clone_lock() after eeb1bd5c). Here's the problem: When we create an syn_recv sock based on a kernel listen sock, we take a get_net() ref with a stack similar to the one shown below. Note that the parent (kernel, listen) sock itself has not taken a get_net() ref, because it explicitly calls sock_create_kern(). get_net /* for the newsk */ sk_clone_lock inet_csk_clone_lock tcp_create_openreq_child tcp_v4_syn_recv_sock tcp_check_req tcp_v4_do_rcv tcp_v4_rcv : But it's not clear to me where this refcnt will be released: in my case, I expect to create/cleanup kernel sockets as part of ->init/->exit for my module, but because the accept socket has a netns refcnt, it blocks cleanup_net(), thus my ->exit pernet_subsys op cannot run and clean this up, and we have a leak. I think that sk_clone_lock() should only do a get_net() if the parent is not a kernel socket (making this similar to sk_alloc()), i.e., diff --git a/net/core/sock.c b/net/core/sock.c index 08f16db..371d1b7 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1497,7 +1497,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gf sock_copy(newsk, sk); /* SANITY */ - get_net(sock_net(newsk)); + if (likely(newsk->sk_net_refcnt)) + get_net(sock_net(newsk)); sk_node_init(&newsk->sk_node); sock_lock_init(newsk); bh_lock_sock(newsk); Does this sound right? --Sowmini -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html