Re: netns refcnt leak for kernel accept sock

2015-07-27 Thread Sowmini Varadhan
On (07/27/15 11:37), Cong Wang wrote:
> 
> dlm uses a kernel TCP socket too, but it allocates a new socket and calls
> ->accept() by itself. ;)

sure, and rds does this in rds_tcp_accept_one() too.

But the newsk being created in sk_clone_lock  is the one on an 
incoming syn, i.e., the one that is saved up as part of listen backlog, 
to be returned later on the accept.

I dont know the details of dlm- can you have one dlm instance per
network namespace? That's where I'm running into this issue- when we
try to have one rds listen socket per netns, and want to be able to
do both
- dynamically build/tear down new network namepsaces, without 
  unloading rds_tcp globally
- unload rds_tcp globally withouth tearing down individual netns.

But perhaps we digress.

Fundamental issue remains: newsk is the syn_recv version of the
listen socket. If the listen socket is a "kernel" socket (kern == 1
for sk_alloc, and the listen socket thus has no sk_net_refcnt),
the syn_recv socket must also have that behavior, so that it is
cleaned up in the same way.

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: netns refcnt leak for kernel accept sock

2015-07-27 Thread Cong Wang
On Mon, Jul 27, 2015 at 11:19 AM, Sowmini Varadhan
 wrote:
> On (07/27/15 11:13), Cong Wang wrote:
>>
>> That refcnt should be released in sock destructor too, when the tcp
>> connection is terminated.
>
> yes, but in my case, the listen socket is opened as part of
> the ->init indirection in pernet_operations (thus it is a kernel socket)
> and the expectation is that this listen socket, and any accept sockets
> derived from it, will be closed in ->exit.
>
> But if the accept socket is treated as a uspace socket (thus holds a 
> get_net())
> then it will block cleanup_net() and the associated ->exit cleanup operations.
>
> This is probably not a problem for other systems like vxlan/gue/geneve etc
> because they all use udp sockets, thus dont have the "accept" equivalent.
>

dlm uses a kernel TCP socket too, but it allocates a new socket and calls
->accept() by itself. ;)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: netns refcnt leak for kernel accept sock

2015-07-27 Thread Sowmini Varadhan
On (07/27/15 11:13), Cong Wang wrote:
> 
> That refcnt should be released in sock destructor too, when the tcp
> connection is terminated.

yes, but in my case, the listen socket is opened as part of
the ->init indirection in pernet_operations (thus it is a kernel socket)
and the expectation is that this listen socket, and any accept sockets
derived from it, will be closed in ->exit.

But if the accept socket is treated as a uspace socket (thus holds a get_net())
then it will block cleanup_net() and the associated ->exit cleanup operations.

This is probably not a problem for other systems like vxlan/gue/geneve etc
because they all use udp sockets, thus dont have the "accept" equivalent.

But fundamentally, its wrong for a kspace listen socket to result in a
"uspace" accept socket.

> Given the fact that sk_destruct() checks for sk_net_refcnt, your
> patch makes sense to me. But I am not sure how a TCP kernel
> socket is supposed to use.

Thanks for the confirmation - I think RDS is a bit of a maverick here in
that it uses tcp sockets unlike vxlan etc.
For those curious about RDS-TCP, I've actually updated the documentation at
https://oss.oracle.com/projects/rds/dist/documentation/rds-3.1-spec.html
recently. I hope that helps.

--Sowmini
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: netns refcnt leak for kernel accept sock

2015-07-27 Thread Cong Wang
On Mon, Jul 27, 2015 at 7:21 AM, Sowmini Varadhan
 wrote:
>
> I'm running into a netns refcnt issue, and I suspect that
> eeb1bd5c has something to do with it (perhaps we need an
> additional change in sk_clone_lock() after eeb1bd5c).
> Here's the problem:
>
> When we create an syn_recv sock based on a kernel listen sock, we
> take a get_net() ref  with a stack similar to the one shown below.
> Note that the parent (kernel, listen) sock itself has not taken
> a get_net() ref, because it explicitly calls sock_create_kern().
>
>   get_net /* for the newsk */
>   sk_clone_lock
>   inet_csk_clone_lock
>   tcp_create_openreq_child
>   tcp_v4_syn_recv_sock
>   tcp_check_req
>   tcp_v4_do_rcv
>   tcp_v4_rcv
>:
>
> But it's not clear to me where this refcnt will be released:
> in my case, I expect to create/cleanup kernel sockets as part
> of ->init/->exit for my module, but because the accept socket
> has a netns refcnt, it blocks cleanup_net(), thus my ->exit
> pernet_subsys op cannot run and clean this up, and we have a leak.


That refcnt should be released in sock destructor too, when the tcp
connection is terminated.

>
> I think that sk_clone_lock() should only do a get_net() if the parent
> is not a kernel socket (making this similar to sk_alloc()), i.e.,
>

Given the fact that sk_destruct() checks for sk_net_refcnt, your
patch makes sense to me. But I am not sure how a TCP kernel
socket is supposed to use.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: netns refcnt leak for kernel accept sock

2015-07-27 Thread Sowmini Varadhan
On (07/27/15 12:40), ebied...@xmission.com wrote:
> sock_create_kern and friends are specialied interfaces for special
> purposes.  At a quick read through I don't think we have a single in
> tree user doing with them what you are trying to do.

That doesnt change the fact that the architecture is questionable.
and my description should make it quite clear why this is so.

> 
> Without seeing code using the interfaces in the way are trying to use
> them I do not have enough information to comment intelligently.

Ok, here you go.

I'm still testing it, but there's enough there for you to see the bug
quite clearly.

Enjoy. I think my other mail had better "information to comment intelligently"
but ymmv.

--Sowmini

diff --git a/net/core/sock.c b/net/core/sock.c
index 08f16db..371d1b7 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1497,7 +1497,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
gfp_t priority)
sock_copy(newsk, sk);
 
/* SANITY */
-   get_net(sock_net(newsk));
+   if (likely(newsk->sk_net_refcnt))
+   get_net(sock_net(newsk));
sk_node_init(&newsk->sk_node);
sock_lock_init(newsk);
bh_lock_sock(newsk);
diff --git a/net/rds/bind.c b/net/rds/bind.c
index 4ebd29c..dd666fb 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -185,7 +185,8 @@ int rds_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
ret = 0;
goto out;
}
-   trans = rds_trans_get_preferred(sin->sin_addr.s_addr);
+   trans = rds_trans_get_preferred(sock_net(sock->sk),
+   sin->sin_addr.s_addr);
if (!trans) {
ret = -EADDRNOTAVAIL;
rds_remove_bound(rs);
diff --git a/net/rds/connection.c b/net/rds/connection.c
index da6da57..273fa6c 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -117,7 +117,8 @@ static void rds_conn_reset(struct rds_connection *conn)
  * For now they are not garbage collected once they're created.  They
  * are torn down as the module is removed, if ever.
  */
-static struct rds_connection *__rds_conn_create(__be32 laddr, __be32 faddr,
+static struct rds_connection *__rds_conn_create(struct net *net,
+  __be32 laddr, __be32 faddr,
   struct rds_transport *trans, gfp_t gfp,
   int is_outgoing)
 {
@@ -157,7 +158,7 @@ new_conn:
conn->c_faddr = faddr;
spin_lock_init(&conn->c_lock);
conn->c_next_tx_seq = 1;
-
+   write_pnet(&conn->c_net, net);
init_waitqueue_head(&conn->c_waitq);
INIT_LIST_HEAD(&conn->c_send_queue);
INIT_LIST_HEAD(&conn->c_retrans);
@@ -174,7 +175,7 @@ new_conn:
 * can bind to the destination address then we'd rather the messages
 * flow through loopback rather than either transport.
 */
-   loop_trans = rds_trans_get_preferred(faddr);
+   loop_trans = rds_trans_get_preferred(net, faddr);
if (loop_trans) {
rds_trans_put(loop_trans);
conn->c_loopback = 1;
@@ -260,17 +261,19 @@ out:
return conn;
 }
 
-struct rds_connection *rds_conn_create(__be32 laddr, __be32 faddr,
+struct rds_connection *rds_conn_create(struct net *net,
+  __be32 laddr, __be32 faddr,
   struct rds_transport *trans, gfp_t gfp)
 {
-   return __rds_conn_create(laddr, faddr, trans, gfp, 0);
+   return __rds_conn_create(net, laddr, faddr, trans, gfp, 0);
 }
 EXPORT_SYMBOL_GPL(rds_conn_create);
 
-struct rds_connection *rds_conn_create_outgoing(__be32 laddr, __be32 faddr,
+struct rds_connection *rds_conn_create_outgoing(struct net *net,
+  __be32 laddr, __be32 faddr,
   struct rds_transport *trans, gfp_t gfp)
 {
-   return __rds_conn_create(laddr, faddr, trans, gfp, 1);
+   return __rds_conn_create(net, laddr, faddr, trans, gfp, 1);
 }
 EXPORT_SYMBOL_GPL(rds_conn_create_outgoing);
 
diff --git a/net/rds/ib.c b/net/rds/ib.c
index ba2dffe..1381422 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -317,7 +317,7 @@ static void rds_ib_ic_info(struct socket *sock, unsigned 
int len,
  * allowed to influence which paths have priority.  We could call userspace
  * asserting this policy "routing".
  */
-static int rds_ib_laddr_check(__be32 addr)
+static int rds_ib_laddr_check(struct net *net, __be32 addr)
 {
int ret;
struct rdma_cm_id *cm_id;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 0da2a45..c38d8a0 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -448,8 +448,8 @@ int rds_ib_cm_handle_connect(struct rdma_cm_id *cm_id,
 (unsigned long long)be64_to_cpu(lguid),
 (unsigned long long)be64_to_cpu(fguid));
 
-   co

Re: netns refcnt leak for kernel accept sock

2015-07-27 Thread Eric W. Biederman

sock_create_kern and friends are specialied interfaces for special
purposes.  At a quick read through I don't think we have a single in
tree user doing with them what you are trying to do.

Without seeing code using the interfaces in the way are trying to use
them I do not have enough information to comment intelligently.

Eric

Sowmini Varadhan  writes:
> I'm running into a netns refcnt issue, and I suspect that 
> eeb1bd5c has something to do with it (perhaps we need an 
> additional change in sk_clone_lock() after eeb1bd5c). 
> Here's the problem:
>
> When we create an syn_recv sock based on a kernel listen sock, we
> take a get_net() ref  with a stack similar to the one shown below.
> Note that the parent (kernel, listen) sock itself has not taken
> a get_net() ref, because it explicitly calls sock_create_kern().
>
>   get_net /* for the newsk */
>   sk_clone_lock
>   inet_csk_clone_lock
>   tcp_create_openreq_child
>   tcp_v4_syn_recv_sock
>   tcp_check_req
>   tcp_v4_do_rcv
>   tcp_v4_rcv 
>:
>
> But it's not clear to me where this refcnt will be released: 
> in my case, I expect to create/cleanup kernel sockets as part 
> of ->init/->exit for my module, but because the accept socket 
> has a netns refcnt, it blocks cleanup_net(), thus my ->exit 
> pernet_subsys op cannot run and clean this up, and we have a leak.
>
> I think that sk_clone_lock() should only do a get_net() if the parent
> is not a kernel socket (making this similar to sk_alloc()), i.e.,
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 08f16db..371d1b7 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1497,7 +1497,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
> gf
> sock_copy(newsk, sk);
>
> /* SANITY */
> -   get_net(sock_net(newsk));
> +   if (likely(newsk->sk_net_refcnt))
> +   get_net(sock_net(newsk));
> sk_node_init(&newsk->sk_node);
> sock_lock_init(newsk);
> bh_lock_sock(newsk);
>
> Does this sound right?
>
> --Sowmini
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


netns refcnt leak for kernel accept sock

2015-07-27 Thread Sowmini Varadhan

I'm running into a netns refcnt issue, and I suspect that 
eeb1bd5c has something to do with it (perhaps we need an 
additional change in sk_clone_lock() after eeb1bd5c). 
Here's the problem:

When we create an syn_recv sock based on a kernel listen sock, we
take a get_net() ref  with a stack similar to the one shown below.
Note that the parent (kernel, listen) sock itself has not taken
a get_net() ref, because it explicitly calls sock_create_kern().

  get_net /* for the newsk */
  sk_clone_lock
  inet_csk_clone_lock
  tcp_create_openreq_child
  tcp_v4_syn_recv_sock
  tcp_check_req
  tcp_v4_do_rcv
  tcp_v4_rcv 
   :

But it's not clear to me where this refcnt will be released: 
in my case, I expect to create/cleanup kernel sockets as part 
of ->init/->exit for my module, but because the accept socket 
has a netns refcnt, it blocks cleanup_net(), thus my ->exit 
pernet_subsys op cannot run and clean this up, and we have a leak.

I think that sk_clone_lock() should only do a get_net() if the parent
is not a kernel socket (making this similar to sk_alloc()), i.e.,

diff --git a/net/core/sock.c b/net/core/sock.c
index 08f16db..371d1b7 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1497,7 +1497,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gf
sock_copy(newsk, sk);

/* SANITY */
-   get_net(sock_net(newsk));
+   if (likely(newsk->sk_net_refcnt))
+   get_net(sock_net(newsk));
sk_node_init(&newsk->sk_node);
sock_lock_init(newsk);
bh_lock_sock(newsk);

Does this sound right?

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html