Re: [Cluster-devel] remove kernel_setsockopt and kernel_getsockopt v2

2020-05-21 Thread 'Christoph Hellwig'
On Thu, May 21, 2020 at 08:01:33AM +, David Laight wrote:
> How much does this increase the kernel code by?

 44 files changed, 660 insertions(+), 843 deletions(-)


> You are also replicating a lot of code making it more
> difficult to maintain.

No, I specifically don't.

> I don't think the performance of an socket option code
> really matters - it is usually done once when a socket
> is initialised and the other costs of establishing a
> connection will dominate.
> 
> Pulling the user copies outside the [gs]etsocksopt switch
> statement not only reduces the code size (source and object)
> and trivially allows kernel_[sg]sockopt() to me added to
> the list of socket calls.
> 
> It probably isn't possible to pull the usercopies right
> out into the syscall wrapper because of some broken
> requests.

Please read through the previous discussion of the rationale and the
options.  We've been there before.

> I worried about whether getsockopt() should read the entire
> user buffer first. SCTP needs the some of it often (including a
> sockaddr_storage in one case), TCP needs it once.
> However the cost of reading a few words is small, and a big
> buffer probably needs setting to avoid leaking kernel
> memory if the structure has holes or fields that don't get set.
> Reading from userspace solves both issues.

As mention in the thread on the last series:  That was my first idea, but
we have way to many sockopts, especially in obscure protocols that just
hard code the size.  The chance of breaking userspace in a way that can't
be fixed without going back to passing user pointers to get/setsockopt
is way to high to commit to such a change unfortunately.



Re: [Cluster-devel] [PATCH 31/33] sctp: add sctp_sock_set_nodelay

2020-05-21 Thread Christoph Hellwig
On Thu, May 21, 2020 at 10:33:48AM -0300, Marcelo Ricardo Leitner wrote:
> With the patch there are now two ways of enabling nodelay.

There is exactly one way to do for user applications, and one way
for kernel drivers.  (actually they could just set the field directly,
which no one does for sctp, but for ipv4 a few do just that).



Re: [Cluster-devel] remove kernel_setsockopt and kernel_getsockopt v2

2020-05-23 Thread Christoph Hellwig
On Wed, May 20, 2020 at 09:54:36PM +0200, Christoph Hellwig wrote:
> Hi Dave,
> 
> this series removes the kernel_setsockopt and kernel_getsockopt
> functions, and instead switches their users to small functions that
> implement setting (or in one case getting) a sockopt directly using
> a normal kernel function call with type safety and all the other
> benefits of not having a function call.
> 
> In some cases these functions seem pretty heavy handed as they do
> a lock_sock even for just setting a single variable, but this mirrors
> the real setsockopt implementation unlike a few drivers that just set
> set the fields directly.

Hi Dave and other maintainers,

can you take a look at and potentially merge patches 1-30 while we
discuss the sctp refactoring?  It would get a nice headstart by removing
kernel_getsockopt and most kernel_setsockopt users, and for the next
follow on I wouldn't need to spam lots of lists with 30+ patches again.



[Cluster-devel] remove kernel_getsockopt

2020-05-27 Thread Christoph Hellwig
Hi dear maintainers,

this series reduces scope from the last round and just removes
kernel_getsockopt to avoid conflicting with the sctp cleanup series.



[Cluster-devel] [PATCH 1/2] dlm: use the tcp version of accept_from_sock for sctp as well

2020-05-27 Thread Christoph Hellwig
The only difference between a few missing fixes applied to the SCTP
one is that TCP uses ->getpeername to get the remote address, while
SCTP uses kernel_getsockopt(.. SCTP_PRIMARY_ADDR).  But given that
getpeername is defined to return the primary address for sctp, there
doesn't seem to be any reason for the different way of quering the
peername, or all the code duplication.

Signed-off-by: Christoph Hellwig 
---
 fs/dlm/lowcomms.c | 123 ++
 1 file changed, 3 insertions(+), 120 deletions(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index cdfaf4f0e11a0..f13dad0fd9ef3 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -724,7 +724,7 @@ static int receive_from_sock(struct connection *con)
 }
 
 /* Listening socket is busy, accept a connection */
-static int tcp_accept_from_sock(struct connection *con)
+static int accept_from_sock(struct connection *con)
 {
int result;
struct sockaddr_storage peeraddr;
@@ -852,123 +852,6 @@ static int tcp_accept_from_sock(struct connection *con)
return result;
 }
 
-static int sctp_accept_from_sock(struct connection *con)
-{
-   /* Check that the new node is in the lockspace */
-   struct sctp_prim prim;
-   int nodeid;
-   int prim_len, ret;
-   int addr_len;
-   struct connection *newcon;
-   struct connection *addcon;
-   struct socket *newsock;
-
-   mutex_lock(&connections_lock);
-   if (!dlm_allow_conn) {
-   mutex_unlock(&connections_lock);
-   return -1;
-   }
-   mutex_unlock(&connections_lock);
-
-   mutex_lock_nested(&con->sock_mutex, 0);
-
-   ret = kernel_accept(con->sock, &newsock, O_NONBLOCK);
-   if (ret < 0)
-   goto accept_err;
-
-   memset(&prim, 0, sizeof(struct sctp_prim));
-   prim_len = sizeof(struct sctp_prim);
-
-   ret = kernel_getsockopt(newsock, IPPROTO_SCTP, SCTP_PRIMARY_ADDR,
-   (char *)&prim, &prim_len);
-   if (ret < 0) {
-   log_print("getsockopt/sctp_primary_addr failed: %d", ret);
-   goto accept_err;
-   }
-
-   make_sockaddr(&prim.ssp_addr, 0, &addr_len);
-   ret = addr_to_nodeid(&prim.ssp_addr, &nodeid);
-   if (ret) {
-   unsigned char *b = (unsigned char *)&prim.ssp_addr;
-
-   log_print("reject connect from unknown addr");
-   print_hex_dump_bytes("ss: ", DUMP_PREFIX_NONE,
-b, sizeof(struct sockaddr_storage));
-   goto accept_err;
-   }
-
-   newcon = nodeid2con(nodeid, GFP_NOFS);
-   if (!newcon) {
-   ret = -ENOMEM;
-   goto accept_err;
-   }
-
-   mutex_lock_nested(&newcon->sock_mutex, 1);
-
-   if (newcon->sock) {
-   struct connection *othercon = newcon->othercon;
-
-   if (!othercon) {
-   othercon = kmem_cache_zalloc(con_cache, GFP_NOFS);
-   if (!othercon) {
-   log_print("failed to allocate incoming socket");
-   mutex_unlock(&newcon->sock_mutex);
-   ret = -ENOMEM;
-   goto accept_err;
-   }
-   othercon->nodeid = nodeid;
-   othercon->rx_action = receive_from_sock;
-   mutex_init(&othercon->sock_mutex);
-   INIT_LIST_HEAD(&othercon->writequeue);
-   spin_lock_init(&othercon->writequeue_lock);
-   INIT_WORK(&othercon->swork, process_send_sockets);
-   INIT_WORK(&othercon->rwork, process_recv_sockets);
-   set_bit(CF_IS_OTHERCON, &othercon->flags);
-   }
-   mutex_lock_nested(&othercon->sock_mutex, 2);
-   if (!othercon->sock) {
-   newcon->othercon = othercon;
-   add_sock(newsock, othercon);
-   addcon = othercon;
-   mutex_unlock(&othercon->sock_mutex);
-   } else {
-   printk("Extra connection from node %d attempted\n", 
nodeid);
-   ret = -EAGAIN;
-   mutex_unlock(&othercon->sock_mutex);
-   mutex_unlock(&newcon->sock_mutex);
-   goto accept_err;
-   }
-   } else {
-   newcon->rx_action = receive_from_sock;
-   add_sock(newsock, newcon);
-   addcon = newcon;
-   }
-
-   log_print("connected to %d", nodeid);
-
-   mutex_unlock(&newcon->sock_mut

[Cluster-devel] [PATCH 2/2] net: remove kernel_getsockopt

2020-05-27 Thread Christoph Hellwig
No users left.

Signed-off-by: Christoph Hellwig 
---
 include/linux/net.h |  2 --
 net/socket.c| 34 --
 2 files changed, 36 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 6451425e828f5..74ef5d7315f70 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -303,8 +303,6 @@ int kernel_connect(struct socket *sock, struct sockaddr 
*addr, int addrlen,
   int flags);
 int kernel_getsockname(struct socket *sock, struct sockaddr *addr);
 int kernel_getpeername(struct socket *sock, struct sockaddr *addr);
-int kernel_getsockopt(struct socket *sock, int level, int optname, char 
*optval,
- int *optlen);
 int kernel_setsockopt(struct socket *sock, int level, int optname, char 
*optval,
  unsigned int optlen);
 int kernel_sendpage(struct socket *sock, struct page *page, int offset,
diff --git a/net/socket.c b/net/socket.c
index 80422fc3c836e..81a98b6cbd087 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -3624,40 +3624,6 @@ int kernel_getpeername(struct socket *sock, struct 
sockaddr *addr)
 }
 EXPORT_SYMBOL(kernel_getpeername);
 
-/**
- * kernel_getsockopt - get a socket option (kernel space)
- * @sock: socket
- * @level: API level (SOL_SOCKET, ...)
- * @optname: option tag
- * @optval: option value
- * @optlen: option length
- *
- * Assigns the option length to @optlen.
- * Returns 0 or an error.
- */
-
-int kernel_getsockopt(struct socket *sock, int level, int optname,
-   char *optval, int *optlen)
-{
-   mm_segment_t oldfs = get_fs();
-   char __user *uoptval;
-   int __user *uoptlen;
-   int err;
-
-   uoptval = (char __user __force *) optval;
-   uoptlen = (int __user __force *) optlen;
-
-   set_fs(KERNEL_DS);
-   if (level == SOL_SOCKET)
-   err = sock_getsockopt(sock, level, optname, uoptval, uoptlen);
-   else
-   err = sock->ops->getsockopt(sock, level, optname, uoptval,
-   uoptlen);
-   set_fs(oldfs);
-   return err;
-}
-EXPORT_SYMBOL(kernel_getsockopt);
-
 /**
  * kernel_setsockopt - set a socket option (kernel space)
  * @sock: socket
-- 
2.26.2



[Cluster-devel] [PATCH 23/28] ipv6: add ip6_sock_set_v6only

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the IPV6_V6ONLY sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 include/net/ipv6.h| 11 +++
 net/ipv6/ip6_udp_tunnel.c |  5 +
 net/sunrpc/svcsock.c  |  6 +-
 3 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 39a00d3ef5e22..9b91188c9a74c 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1177,4 +1177,15 @@ int ipv6_sock_mc_join_ssm(struct sock *sk, int ifindex,
  const struct in6_addr *addr, unsigned int mode);
 int ipv6_sock_mc_drop(struct sock *sk, int ifindex,
  const struct in6_addr *addr);
+
+static inline int ip6_sock_set_v6only(struct sock *sk)
+{
+   if (inet_sk(sk)->inet_num)
+   return -EINVAL;
+   lock_sock(sk);
+   sk->sk_ipv6only = true;
+   release_sock(sk);
+   return 0;
+}
+
 #endif /* _NET_IPV6_H */
diff --git a/net/ipv6/ip6_udp_tunnel.c b/net/ipv6/ip6_udp_tunnel.c
index 6523609516d25..2e0ad1bc84a83 100644
--- a/net/ipv6/ip6_udp_tunnel.c
+++ b/net/ipv6/ip6_udp_tunnel.c
@@ -25,10 +25,7 @@ int udp_sock_create6(struct net *net, struct udp_port_cfg 
*cfg,
goto error;
 
if (cfg->ipv6_v6only) {
-   int val = 1;
-
-   err = kernel_setsockopt(sock, IPPROTO_IPV6, IPV6_V6ONLY,
-   (char *) &val, sizeof(val));
+   err = ip6_sock_set_v6only(sock->sk);
if (err < 0)
goto error;
}
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 7a805d165689c..a391892977cd2 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1328,7 +1328,6 @@ static struct svc_xprt *svc_create_socket(struct svc_serv 
*serv,
struct sockaddr *newsin = (struct sockaddr *)&addr;
int newlen;
int family;
-   int val;
RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);
 
dprintk("svc: svc_create_socket(%s, %d, %s)\n",
@@ -1364,11 +1363,8 @@ static struct svc_xprt *svc_create_socket(struct 
svc_serv *serv,
 * getting requests from IPv4 remotes.  Those should
 * be shunted to a PF_INET listener via rpcbind.
 */
-   val = 1;
if (family == PF_INET6)
-   kernel_setsockopt(sock, SOL_IPV6, IPV6_V6ONLY,
-   (char *)&val, sizeof(val));
-
+   ip6_sock_set_v6only(sock->sk);
if (type == SOCK_STREAM)
sock->sk->sk_reuse = SK_CAN_REUSE; /* allow address reuse */
error = kernel_bind(sock, sin, len);
-- 
2.26.2



[Cluster-devel] [PATCH 28/28] tipc: call tsk_set_importance from tipc_topsrv_create_listener

2020-05-27 Thread Christoph Hellwig
Avoid using kernel_setsockopt for the TIPC_IMPORTANCE option when we can
just use the internal helper.  The only change needed is to pass a struct
sock instead of tipc_sock, which is private to socket.c

Signed-off-by: Christoph Hellwig 
---
 net/tipc/socket.c | 18 +-
 net/tipc/socket.h |  2 ++
 net/tipc/topsrv.c |  6 +++---
 3 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index d6b67d07d22ec..3734cdbedc9cc 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -196,17 +196,17 @@ static int tsk_importance(struct tipc_sock *tsk)
return msg_importance(&tsk->phdr);
 }
 
-static int tsk_set_importance(struct tipc_sock *tsk, int imp)
+static struct tipc_sock *tipc_sk(const struct sock *sk)
 {
-   if (imp > TIPC_CRITICAL_IMPORTANCE)
-   return -EINVAL;
-   msg_set_importance(&tsk->phdr, (u32)imp);
-   return 0;
+   return container_of(sk, struct tipc_sock, sk);
 }
 
-static struct tipc_sock *tipc_sk(const struct sock *sk)
+int tsk_set_importance(struct sock *sk, int imp)
 {
-   return container_of(sk, struct tipc_sock, sk);
+   if (imp > TIPC_CRITICAL_IMPORTANCE)
+   return -EINVAL;
+   msg_set_importance(&tipc_sk(sk)->phdr, (u32)imp);
+   return 0;
 }
 
 static bool tsk_conn_cong(struct tipc_sock *tsk)
@@ -2721,7 +2721,7 @@ static int tipc_accept(struct socket *sock, struct socket 
*new_sock, int flags,
/* Connect new socket to it's peer */
tipc_sk_finish_conn(new_tsock, msg_origport(msg), msg_orignode(msg));
 
-   tsk_set_importance(new_tsock, msg_importance(msg));
+   tsk_set_importance(new_sk, msg_importance(msg));
if (msg_named(msg)) {
new_tsock->conn_type = msg_nametype(msg);
new_tsock->conn_instance = msg_nameinst(msg);
@@ -3139,7 +3139,7 @@ static int tipc_setsockopt(struct socket *sock, int lvl, 
int opt,
 
switch (opt) {
case TIPC_IMPORTANCE:
-   res = tsk_set_importance(tsk, value);
+   res = tsk_set_importance(sk, value);
break;
case TIPC_SRC_DROPPABLE:
if (sock->type != SOCK_STREAM)
diff --git a/net/tipc/socket.h b/net/tipc/socket.h
index 235b9679acee4..b11575afc66fe 100644
--- a/net/tipc/socket.h
+++ b/net/tipc/socket.h
@@ -75,4 +75,6 @@ u32 tipc_sock_get_portid(struct sock *sk);
 bool tipc_sk_overlimit1(struct sock *sk, struct sk_buff *skb);
 bool tipc_sk_overlimit2(struct sock *sk, struct sk_buff *skb);
 
+int tsk_set_importance(struct sock *sk, int imp);
+
 #endif
diff --git a/net/tipc/topsrv.c b/net/tipc/topsrv.c
index 446af7bbd13e6..1489cfb941d8e 100644
--- a/net/tipc/topsrv.c
+++ b/net/tipc/topsrv.c
@@ -497,7 +497,6 @@ static void tipc_topsrv_listener_data_ready(struct sock *sk)
 
 static int tipc_topsrv_create_listener(struct tipc_topsrv *srv)
 {
-   int imp = TIPC_CRITICAL_IMPORTANCE;
struct socket *lsock = NULL;
struct sockaddr_tipc saddr;
struct sock *sk;
@@ -514,8 +513,9 @@ static int tipc_topsrv_create_listener(struct tipc_topsrv 
*srv)
sk->sk_user_data = srv;
write_unlock_bh(&sk->sk_callback_lock);
 
-   rc = kernel_setsockopt(lsock, SOL_TIPC, TIPC_IMPORTANCE,
-  (char *)&imp, sizeof(imp));
+   lock_sock(sk);
+   rc = tsk_set_importance(sk, TIPC_CRITICAL_IMPORTANCE);
+   release_sock(sk);
if (rc < 0)
goto err;
 
-- 
2.26.2



[Cluster-devel] [PATCH 19/28] ipv4: add ip_sock_set_freebind

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the IP_FREEBIND sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 drivers/target/iscsi/iscsi_target_login.c | 13 +++--
 include/net/ip.h  |  1 +
 net/ipv4/ip_sockglue.c|  8 
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_login.c 
b/drivers/target/iscsi/iscsi_target_login.c
index b561b07a869a0..85748e3388582 100644
--- a/drivers/target/iscsi/iscsi_target_login.c
+++ b/drivers/target/iscsi/iscsi_target_login.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include /* TCP_NODELAY */
+#include 
 #include  /* ipv6_addr_v4mapped() */
 #include 
 #include 
@@ -855,7 +856,7 @@ int iscsit_setup_np(
struct sockaddr_storage *sockaddr)
 {
struct socket *sock = NULL;
-   int backlog = ISCSIT_TCP_BACKLOG, ret, opt = 0, len;
+   int backlog = ISCSIT_TCP_BACKLOG, ret, len;
 
switch (np->np_network_transport) {
case ISCSI_TCP:
@@ -900,15 +901,7 @@ int iscsit_setup_np(
if (np->np_network_transport == ISCSI_TCP)
tcp_sock_set_nodelay(sock->sk);
sock_set_reuseaddr(sock->sk);
-
-   opt = 1;
-   ret = kernel_setsockopt(sock, IPPROTO_IP, IP_FREEBIND,
-   (char *)&opt, sizeof(opt));
-   if (ret < 0) {
-   pr_err("kernel_setsockopt() for IP_FREEBIND"
-   " failed\n");
-   goto fail;
-   }
+   ip_sock_set_freebind(sock->sk);
 
ret = kernel_bind(sock, (struct sockaddr *)&np->np_sockaddr, len);
if (ret < 0) {
diff --git a/include/net/ip.h b/include/net/ip.h
index 2fc52e26fa88b..5f5d8226b6abc 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -765,6 +765,7 @@ static inline bool inetdev_valid_mtu(unsigned int mtu)
return likely(mtu >= IPV4_MIN_MTU);
 }
 
+void ip_sock_set_freebind(struct sock *sk);
 void ip_sock_set_tos(struct sock *sk, int val);
 
 #endif /* _IP_H */
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index b43a29e11f4a5..767838d2030d8 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -581,6 +581,14 @@ void ip_sock_set_tos(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(ip_sock_set_tos);
 
+void ip_sock_set_freebind(struct sock *sk)
+{
+   lock_sock(sk);
+   inet_sk(sk)->freebind = true;
+   release_sock(sk);
+}
+EXPORT_SYMBOL(ip_sock_set_freebind);
+
 /*
  * Socket option code for IP. This is the end of the line after any
  * TCP,UDP etc options on an IP socket.
-- 
2.26.2



[Cluster-devel] [PATCH 15/28] tcp: add tcp_sock_set_keepidle

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the TCP_KEEP_IDLE sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 include/linux/tcp.h   |  1 +
 net/ipv4/tcp.c| 49 ++-
 net/rds/tcp_listen.c  |  5 +
 net/sunrpc/xprtsock.c |  3 +--
 4 files changed, 37 insertions(+), 21 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index de682143efe4d..5724dd84a85ed 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -498,6 +498,7 @@ int tcp_skb_shift(struct sk_buff *to, struct sk_buff *from, 
int pcount,
  int shiftlen);
 
 void tcp_sock_set_cork(struct sock *sk, bool on);
+int tcp_sock_set_keepidle(struct sock *sk, int val);
 void tcp_sock_set_nodelay(struct sock *sk);
 void tcp_sock_set_quickack(struct sock *sk, int val);
 int tcp_sock_set_syncnt(struct sock *sk, int val);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0004bd9ae7b0a..bdf0ff9333514 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2901,6 +2901,39 @@ void tcp_sock_set_user_timeout(struct sock *sk, u32 val)
 }
 EXPORT_SYMBOL(tcp_sock_set_user_timeout);
 
+static int __tcp_sock_set_keepidle(struct sock *sk, int val)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   if (val < 1 || val > MAX_TCP_KEEPIDLE)
+   return -EINVAL;
+
+   tp->keepalive_time = val * HZ;
+   if (sock_flag(sk, SOCK_KEEPOPEN) &&
+   !((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))) {
+   u32 elapsed = keepalive_time_elapsed(tp);
+
+   if (tp->keepalive_time > elapsed)
+   elapsed = tp->keepalive_time - elapsed;
+   else
+   elapsed = 0;
+   inet_csk_reset_keepalive_timer(sk, elapsed);
+   }
+
+   return 0;
+}
+
+int tcp_sock_set_keepidle(struct sock *sk, int val)
+{
+   int err;
+
+   lock_sock(sk);
+   err = __tcp_sock_set_keepidle(sk, val);
+   release_sock(sk);
+   return err;
+}
+EXPORT_SYMBOL(tcp_sock_set_keepidle);
+
 /*
  * Socket option code for TCP.
  */
@@ -3070,21 +3103,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
break;
 
case TCP_KEEPIDLE:
-   if (val < 1 || val > MAX_TCP_KEEPIDLE)
-   err = -EINVAL;
-   else {
-   tp->keepalive_time = val * HZ;
-   if (sock_flag(sk, SOCK_KEEPOPEN) &&
-   !((1 << sk->sk_state) &
- (TCPF_CLOSE | TCPF_LISTEN))) {
-   u32 elapsed = keepalive_time_elapsed(tp);
-   if (tp->keepalive_time > elapsed)
-   elapsed = tp->keepalive_time - elapsed;
-   else
-   elapsed = 0;
-   inet_csk_reset_keepalive_timer(sk, elapsed);
-   }
-   }
+   err = __tcp_sock_set_keepidle(sk, val);
break;
case TCP_KEEPINTVL:
if (val < 1 || val > MAX_TCP_KEEPINTVL)
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 6f90ea077adcd..79f9adc008114 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -52,10 +52,7 @@ int rds_tcp_keepalive(struct socket *sock)
if (ret < 0)
goto bail;
 
-   ret = kernel_setsockopt(sock, IPPROTO_TCP, TCP_KEEPIDLE,
-   (char *)&keepidle, sizeof(keepidle));
-   if (ret < 0)
-   goto bail;
+   tcp_sock_set_keepidle(sock->sk, keepidle);
 
/* KEEPINTVL is the interval between successive probes. We follow
 * the model in xs_tcp_finish_connecting() and re-use keepidle.
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 231fd6162f68d..473290f7c5c0a 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2107,8 +2107,7 @@ static void xs_tcp_set_socket_timeouts(struct rpc_xprt 
*xprt,
 
/* TCP Keepalive options */
sock_set_keepalive(sock->sk);
-   kernel_setsockopt(sock, SOL_TCP, TCP_KEEPIDLE,
-   (char *)&keepidle, sizeof(keepidle));
+   tcp_sock_set_keepidle(sock->sk, keepidle);
kernel_setsockopt(sock, SOL_TCP, TCP_KEEPINTVL,
(char *)&keepidle, sizeof(keepidle));
kernel_setsockopt(sock, SOL_TCP, TCP_KEEPCNT,
-- 
2.26.2



[Cluster-devel] [PATCH 27/28] rxrpc: add rxrpc_sock_set_min_security_level

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the RXRPC_MIN_SECURITY_LEVEL sockopt from
kernel space without going through a fake uaccess.

Thanks to David Howells for the documentation updates.

Signed-off-by: Christoph Hellwig 
Acked-by: David Howells 
---
 Documentation/networking/rxrpc.rst | 13 +++--
 fs/afs/rxrpc.c |  6 ++
 include/net/af_rxrpc.h |  2 ++
 net/rxrpc/af_rxrpc.c   | 13 +
 4 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/rxrpc.rst 
b/Documentation/networking/rxrpc.rst
index 5ad35113d0f46..68552b92dc442 100644
--- a/Documentation/networking/rxrpc.rst
+++ b/Documentation/networking/rxrpc.rst
@@ -477,7 +477,7 @@ AF_RXRPC sockets support a few socket options at the 
SOL_RXRPC level:
 Encrypted checksum plus packet padded and first eight bytes of packet
 encrypted - which includes the actual packet length.
 
- (c) RXRPC_SECURITY_ENCRYPTED
+ (c) RXRPC_SECURITY_ENCRYPT
 
 Encrypted checksum plus entire packet padded and encrypted, including
 actual packet length.
@@ -578,7 +578,7 @@ A client would issue an operation by:
  This issues a request_key() to get the key representing the security
  context.  The minimum security level can be set::
 
-   unsigned int sec = RXRPC_SECURITY_ENCRYPTED;
+   unsigned int sec = RXRPC_SECURITY_ENCRYPT;
setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL,
   &sec, sizeof(sec));
 
@@ -1090,6 +1090,15 @@ The kernel interface functions are as follows:
  jiffies).  In the event of the timeout occurring, the call will be
  aborted and -ETIME or -ETIMEDOUT will be returned.
 
+ (#) Apply the RXRPC_MIN_SECURITY_LEVEL sockopt to a socket from within in the
+ kernel::
+
+   int rxrpc_sock_set_min_security_level(struct sock *sk,
+unsigned int val);
+
+ This specifies the minimum security level required for calls on this
+ socket.
+
 
 Configurable Parameters
 ===
diff --git a/fs/afs/rxrpc.c b/fs/afs/rxrpc.c
index 1ecc67da6c1a4..e313dae01674f 100644
--- a/fs/afs/rxrpc.c
+++ b/fs/afs/rxrpc.c
@@ -37,7 +37,6 @@ int afs_open_socket(struct afs_net *net)
 {
struct sockaddr_rxrpc srx;
struct socket *socket;
-   unsigned int min_level;
int ret;
 
_enter("");
@@ -57,9 +56,8 @@ int afs_open_socket(struct afs_net *net)
srx.transport.sin6.sin6_family  = AF_INET6;
srx.transport.sin6.sin6_port= htons(AFS_CM_PORT);
 
-   min_level = RXRPC_SECURITY_ENCRYPT;
-   ret = kernel_setsockopt(socket, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL,
-   (void *)&min_level, sizeof(min_level));
+   ret = rxrpc_sock_set_min_security_level(socket->sk,
+   RXRPC_SECURITY_ENCRYPT);
if (ret < 0)
goto error_2;
 
diff --git a/include/net/af_rxrpc.h b/include/net/af_rxrpc.h
index ab988940bf045..91eacbdcf33d2 100644
--- a/include/net/af_rxrpc.h
+++ b/include/net/af_rxrpc.h
@@ -72,4 +72,6 @@ bool rxrpc_kernel_call_is_complete(struct rxrpc_call *);
 void rxrpc_kernel_set_max_life(struct socket *, struct rxrpc_call *,
   unsigned long);
 
+int rxrpc_sock_set_min_security_level(struct sock *sk, unsigned int val);
+
 #endif /* _NET_RXRPC_H */
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index 15ee92d795815..394189b81849f 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -571,6 +571,19 @@ static int rxrpc_sendmsg(struct socket *sock, struct 
msghdr *m, size_t len)
return ret;
 }
 
+int rxrpc_sock_set_min_security_level(struct sock *sk, unsigned int val)
+{
+   if (sk->sk_state != RXRPC_UNBOUND)
+   return -EISCONN;
+   if (val > RXRPC_SECURITY_MAX)
+   return -EINVAL;
+   lock_sock(sk);
+   rxrpc_sk(sk)->min_sec_level = val;
+   release_sock(sk);
+   return 0;
+}
+EXPORT_SYMBOL(rxrpc_sock_set_min_security_level);
+
 /*
  * set RxRPC socket options
  */
-- 
2.26.2



[Cluster-devel] [PATCH 22/28] ipv4: add ip_sock_set_pktinfo

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the IP_PKTINFO sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 include/net/ip.h   | 1 +
 net/ipv4/ip_sockglue.c | 8 
 net/sunrpc/svcsock.c   | 5 ++---
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index d3649c49dd333..04ebe7bf54c6a 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -767,6 +767,7 @@ static inline bool inetdev_valid_mtu(unsigned int mtu)
 
 void ip_sock_set_freebind(struct sock *sk);
 int ip_sock_set_mtu_discover(struct sock *sk, int val);
+void ip_sock_set_pktinfo(struct sock *sk);
 void ip_sock_set_recverr(struct sock *sk);
 void ip_sock_set_tos(struct sock *sk, int val);
 
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index aa115be11dcfb..84ec3703c9091 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -608,6 +608,14 @@ int ip_sock_set_mtu_discover(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(ip_sock_set_mtu_discover);
 
+void ip_sock_set_pktinfo(struct sock *sk)
+{
+   lock_sock(sk);
+   inet_sk(sk)->cmsg_flags |= IP_CMSG_PKTINFO;
+   release_sock(sk);
+}
+EXPORT_SYMBOL(ip_sock_set_pktinfo);
+
 /*
  * Socket option code for IP. This is the end of the line after any
  * TCP,UDP etc options on an IP socket.
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 6773dacc64d8e..7a805d165689c 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -616,9 +616,8 @@ static void svc_udp_init(struct svc_sock *svsk, struct 
svc_serv *serv)
/* make sure we get destination address info */
switch (svsk->sk_sk->sk_family) {
case AF_INET:
-   level = SOL_IP;
-   optname = IP_PKTINFO;
-   break;
+   ip_sock_set_pktinfo(svsk->sk_sock->sk);
+   return;
case AF_INET6:
level = SOL_IPV6;
optname = IPV6_RECVPKTINFO;
-- 
2.26.2



[Cluster-devel] [PATCH 20/28] ipv4: add ip_sock_set_recverr

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the IP_RECVERR sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
Reviewed-by: David Howells 
---
 include/net/ip.h | 1 +
 net/ipv4/ip_sockglue.c   | 8 
 net/rxrpc/local_object.c | 8 +---
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 5f5d8226b6abc..f063a491b9063 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -766,6 +766,7 @@ static inline bool inetdev_valid_mtu(unsigned int mtu)
 }
 
 void ip_sock_set_freebind(struct sock *sk);
+void ip_sock_set_recverr(struct sock *sk);
 void ip_sock_set_tos(struct sock *sk, int val);
 
 #endif /* _IP_H */
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 767838d2030d8..aca6b81da9bae 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -589,6 +589,14 @@ void ip_sock_set_freebind(struct sock *sk)
 }
 EXPORT_SYMBOL(ip_sock_set_freebind);
 
+void ip_sock_set_recverr(struct sock *sk)
+{
+   lock_sock(sk);
+   inet_sk(sk)->recverr = true;
+   release_sock(sk);
+}
+EXPORT_SYMBOL(ip_sock_set_recverr);
+
 /*
  * Socket option code for IP. This is the end of the line after any
  * TCP,UDP etc options on an IP socket.
diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c
index 5ea2bd01fdd59..4c0e8fe5ec1fb 100644
--- a/net/rxrpc/local_object.c
+++ b/net/rxrpc/local_object.c
@@ -171,13 +171,7 @@ static int rxrpc_open_socket(struct rxrpc_local *local, 
struct net *net)
/* Fall through */
case AF_INET:
/* we want to receive ICMP errors */
-   opt = 1;
-   ret = kernel_setsockopt(local->socket, SOL_IP, IP_RECVERR,
-   (char *) &opt, sizeof(opt));
-   if (ret < 0) {
-   _debug("setsockopt failed");
-   goto error;
-   }
+   ip_sock_set_recverr(local->socket->sk);
 
/* we want to set the don't fragment bit */
opt = IP_PMTUDISC_DO;
-- 
2.26.2



[Cluster-devel] [PATCH 21/28] ipv4: add ip_sock_set_mtu_discover

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the IP_MTU_DISCOVER sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
Reviewed-by: David Howells  [rxrpc bits]
---
 include/net/ip.h |  1 +
 net/ipv4/ip_sockglue.c   | 11 +++
 net/rxrpc/local_object.c |  8 +---
 net/rxrpc/output.c   | 14 +-
 4 files changed, 18 insertions(+), 16 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index f063a491b9063..d3649c49dd333 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -766,6 +766,7 @@ static inline bool inetdev_valid_mtu(unsigned int mtu)
 }
 
 void ip_sock_set_freebind(struct sock *sk);
+int ip_sock_set_mtu_discover(struct sock *sk, int val);
 void ip_sock_set_recverr(struct sock *sk);
 void ip_sock_set_tos(struct sock *sk, int val);
 
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index aca6b81da9bae..aa115be11dcfb 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -597,6 +597,17 @@ void ip_sock_set_recverr(struct sock *sk)
 }
 EXPORT_SYMBOL(ip_sock_set_recverr);
 
+int ip_sock_set_mtu_discover(struct sock *sk, int val)
+{
+   if (val < IP_PMTUDISC_DONT || val > IP_PMTUDISC_OMIT)
+   return -EINVAL;
+   lock_sock(sk);
+   inet_sk(sk)->pmtudisc = val;
+   release_sock(sk);
+   return 0;
+}
+EXPORT_SYMBOL(ip_sock_set_mtu_discover);
+
 /*
  * Socket option code for IP. This is the end of the line after any
  * TCP,UDP etc options on an IP socket.
diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c
index 4c0e8fe5ec1fb..6f4e6b4817cf2 100644
--- a/net/rxrpc/local_object.c
+++ b/net/rxrpc/local_object.c
@@ -174,13 +174,7 @@ static int rxrpc_open_socket(struct rxrpc_local *local, 
struct net *net)
ip_sock_set_recverr(local->socket->sk);
 
/* we want to set the don't fragment bit */
-   opt = IP_PMTUDISC_DO;
-   ret = kernel_setsockopt(local->socket, SOL_IP, IP_MTU_DISCOVER,
-   (char *) &opt, sizeof(opt));
-   if (ret < 0) {
-   _debug("setsockopt failed");
-   goto error;
-   }
+   ip_sock_set_mtu_discover(local->socket->sk, IP_PMTUDISC_DO);
 
/* We want receive timestamps. */
sock_enable_timestamps(local->socket->sk);
diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index f8b632a5c6197..1ba43c3df4adb 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -321,7 +321,7 @@ int rxrpc_send_data_packet(struct rxrpc_call *call, struct 
sk_buff *skb,
struct kvec iov[2];
rxrpc_serial_t serial;
size_t len;
-   int ret, opt;
+   int ret;
 
_enter(",{%d}", skb->len);
 
@@ -473,18 +473,14 @@ int rxrpc_send_data_packet(struct rxrpc_call *call, 
struct sk_buff *skb,
switch (conn->params.local->srx.transport.family) {
case AF_INET6:
case AF_INET:
-   opt = IP_PMTUDISC_DONT;
-   kernel_setsockopt(conn->params.local->socket,
- SOL_IP, IP_MTU_DISCOVER,
- (char *)&opt, sizeof(opt));
+   ip_sock_set_mtu_discover(conn->params.local->socket->sk,
+   IP_PMTUDISC_DONT);
ret = kernel_sendmsg(conn->params.local->socket, &msg,
 iov, 2, len);
conn->params.peer->last_tx_at = ktime_get_seconds();
 
-   opt = IP_PMTUDISC_DO;
-   kernel_setsockopt(conn->params.local->socket,
- SOL_IP, IP_MTU_DISCOVER,
- (char *)&opt, sizeof(opt));
+   ip_sock_set_mtu_discover(conn->params.local->socket->sk,
+   IP_PMTUDISC_DO);
break;
 
default:
-- 
2.26.2



[Cluster-devel] [PATCH 25/28] ipv6: add ip6_sock_set_addr_preferences

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the IPV6_ADD_PREFERENCES sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 include/net/ipv6.h   | 67 
 net/ipv6/ipv6_sockglue.c | 59 +--
 net/sunrpc/xprtsock.c|  7 +++--
 3 files changed, 72 insertions(+), 61 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 49c4abf991489..9a90759830162 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1195,4 +1195,71 @@ static inline void ip6_sock_set_recverr(struct sock *sk)
release_sock(sk);
 }
 
+static inline int __ip6_sock_set_addr_preferences(struct sock *sk, int val)
+{
+   unsigned int pref = 0;
+   unsigned int prefmask = ~0;
+
+   /* check PUBLIC/TMP/PUBTMP_DEFAULT conflicts */
+   switch (val & (IPV6_PREFER_SRC_PUBLIC |
+  IPV6_PREFER_SRC_TMP |
+  IPV6_PREFER_SRC_PUBTMP_DEFAULT)) {
+   case IPV6_PREFER_SRC_PUBLIC:
+   pref |= IPV6_PREFER_SRC_PUBLIC;
+   prefmask &= ~(IPV6_PREFER_SRC_PUBLIC |
+ IPV6_PREFER_SRC_TMP);
+   break;
+   case IPV6_PREFER_SRC_TMP:
+   pref |= IPV6_PREFER_SRC_TMP;
+   prefmask &= ~(IPV6_PREFER_SRC_PUBLIC |
+ IPV6_PREFER_SRC_TMP);
+   break;
+   case IPV6_PREFER_SRC_PUBTMP_DEFAULT:
+   prefmask &= ~(IPV6_PREFER_SRC_PUBLIC |
+ IPV6_PREFER_SRC_TMP);
+   break;
+   case 0:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   /* check HOME/COA conflicts */
+   switch (val & (IPV6_PREFER_SRC_HOME | IPV6_PREFER_SRC_COA)) {
+   case IPV6_PREFER_SRC_HOME:
+   prefmask &= ~IPV6_PREFER_SRC_COA;
+   break;
+   case IPV6_PREFER_SRC_COA:
+   pref |= IPV6_PREFER_SRC_COA;
+   break;
+   case 0:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   /* check CGA/NONCGA conflicts */
+   switch (val & (IPV6_PREFER_SRC_CGA|IPV6_PREFER_SRC_NONCGA)) {
+   case IPV6_PREFER_SRC_CGA:
+   case IPV6_PREFER_SRC_NONCGA:
+   case 0:
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   inet6_sk(sk)->srcprefs = (inet6_sk(sk)->srcprefs & prefmask) | pref;
+   return 0;
+}
+
+static inline int ip6_sock_set_addr_preferences(struct sock *sk, bool val)
+{
+   int ret;
+
+   lock_sock(sk);
+   ret = __ip6_sock_set_addr_preferences(sk, val);
+   release_sock(sk);
+   return ret;
+}
+
 #endif /* _NET_IPV6_H */
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index e10258c2210e8..adbfed6adf11c 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -845,67 +845,10 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, 
int optname,
break;
 
case IPV6_ADDR_PREFERENCES:
-   {
-   unsigned int pref = 0;
-   unsigned int prefmask = ~0;
-
if (optlen < sizeof(int))
goto e_inval;
-
-   retv = -EINVAL;
-
-   /* check PUBLIC/TMP/PUBTMP_DEFAULT conflicts */
-   switch (val & (IPV6_PREFER_SRC_PUBLIC|
-  IPV6_PREFER_SRC_TMP|
-  IPV6_PREFER_SRC_PUBTMP_DEFAULT)) {
-   case IPV6_PREFER_SRC_PUBLIC:
-   pref |= IPV6_PREFER_SRC_PUBLIC;
-   break;
-   case IPV6_PREFER_SRC_TMP:
-   pref |= IPV6_PREFER_SRC_TMP;
-   break;
-   case IPV6_PREFER_SRC_PUBTMP_DEFAULT:
-   break;
-   case 0:
-   goto pref_skip_pubtmp;
-   default:
-   goto e_inval;
-   }
-
-   prefmask &= ~(IPV6_PREFER_SRC_PUBLIC|
- IPV6_PREFER_SRC_TMP);
-pref_skip_pubtmp:
-
-   /* check HOME/COA conflicts */
-   switch (val & (IPV6_PREFER_SRC_HOME|IPV6_PREFER_SRC_COA)) {
-   case IPV6_PREFER_SRC_HOME:
-   break;
-   case IPV6_PREFER_SRC_COA:
-   pref |= IPV6_PREFER_SRC_COA;
-   case 0:
-   goto pref_skip_coa;
-   default:
-   goto e_inval;
-   }
-
-   prefmask &= ~IPV6_PREFER_SRC_COA;
-pref_skip_coa:
-
-   /* check CGA/NONCGA conflicts */
-   switch (val & (IPV6_PREFER_SRC_CGA|IPV6_PREFER_SRC_NONCGA)) {
-   case IPV6_PREFER_SRC_CGA:
-   case IPV6_PREFER_SRC_NONCGA:
-   case 0:
-   break;

[Cluster-devel] [PATCH 18/28] ipv4: add ip_sock_set_tos

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the IP_TOS sockopt from kernel space without
going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
Acked-by: Sagi Grimberg 
---
 drivers/nvme/host/tcp.c   | 14 +++---
 drivers/nvme/target/tcp.c | 10 ++
 include/net/ip.h  |  2 ++
 net/ipv4/ip_sockglue.c| 30 +-
 4 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 2872584f52f63..4c972d8abf317 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1313,7 +1313,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 {
struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
struct nvme_tcp_queue *queue = &ctrl->queues[qid];
-   int ret, opt, rcv_pdu_size;
+   int ret, rcv_pdu_size;
 
queue->ctrl = ctrl;
INIT_LIST_HEAD(&queue->send_list);
@@ -1352,16 +1352,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
sock_set_priority(queue->sock->sk, so_priority);
 
/* Set socket type of service */
-   if (nctrl->opts->tos >= 0) {
-   opt = nctrl->opts->tos;
-   ret = kernel_setsockopt(queue->sock, SOL_IP, IP_TOS,
-   (char *)&opt, sizeof(opt));
-   if (ret) {
-   dev_err(nctrl->device,
-   "failed to set IP_TOS sock opt %d\n", ret);
-   goto err_sock;
-   }
-   }
+   if (nctrl->opts->tos >= 0)
+   ip_sock_set_tos(queue->sock->sk, nctrl->opts->tos);
 
queue->sock->sk->sk_allocation = GFP_ATOMIC;
nvme_tcp_set_queue_io_cpu(queue);
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 55bc4c3c0a74a..4546049a96b37 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1452,14 +1452,8 @@ static int nvmet_tcp_set_queue_sock(struct 
nvmet_tcp_queue *queue)
sock_set_priority(sock->sk, so_priority);
 
/* Set socket type of service */
-   if (inet->rcv_tos > 0) {
-   int tos = inet->rcv_tos;
-
-   ret = kernel_setsockopt(sock, SOL_IP, IP_TOS,
-   (char *)&tos, sizeof(tos));
-   if (ret)
-   return ret;
-   }
+   if (inet->rcv_tos > 0)
+   ip_sock_set_tos(sock->sk, inet->rcv_tos);
 
write_lock_bh(&sock->sk->sk_callback_lock);
sock->sk->sk_user_data = queue;
diff --git a/include/net/ip.h b/include/net/ip.h
index 5b317c9f4470a..2fc52e26fa88b 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -765,4 +765,6 @@ static inline bool inetdev_valid_mtu(unsigned int mtu)
return likely(mtu >= IPV4_MIN_MTU);
 }
 
+void ip_sock_set_tos(struct sock *sk, int val);
+
 #endif /* _IP_H */
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index f43d5f12aa86a..b43a29e11f4a5 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -560,6 +560,26 @@ int ip_recv_error(struct sock *sk, struct msghdr *msg, int 
len, int *addr_len)
return err;
 }
 
+static void __ip_sock_set_tos(struct sock *sk, int val)
+{
+   if (sk->sk_type == SOCK_STREAM) {
+   val &= ~INET_ECN_MASK;
+   val |= inet_sk(sk)->tos & INET_ECN_MASK;
+   }
+   if (inet_sk(sk)->tos != val) {
+   inet_sk(sk)->tos = val;
+   sk->sk_priority = rt_tos2priority(val);
+   sk_dst_reset(sk);
+   }
+}
+
+void ip_sock_set_tos(struct sock *sk, int val)
+{
+   lock_sock(sk);
+   __ip_sock_set_tos(sk, val);
+   release_sock(sk);
+}
+EXPORT_SYMBOL(ip_sock_set_tos);
 
 /*
  * Socket option code for IP. This is the end of the line after any
@@ -823,15 +843,7 @@ static int do_ip_setsockopt(struct sock *sk, int level,
inet->cmsg_flags &= ~IP_CMSG_RECVFRAGSIZE;
break;
case IP_TOS:/* This sets both TOS and Precedence */
-   if (sk->sk_type == SOCK_STREAM) {
-   val &= ~INET_ECN_MASK;
-   val |= inet->tos & INET_ECN_MASK;
-   }
-   if (inet->tos != val) {
-   inet->tos = val;
-   sk->sk_priority = rt_tos2priority(val);
-   sk_dst_reset(sk);
-   }
+   __ip_sock_set_tos(sk, val);
break;
case IP_TTL:
if (optlen < 1)
-- 
2.26.2



[Cluster-devel] [PATCH 26/28] ipv6: add ip6_sock_set_recvpktinfo

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the IPV6_RECVPKTINFO sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 include/net/ipv6.h   |  7 +++
 net/sunrpc/svcsock.c | 10 ++
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 9a90759830162..5e65bf2fd32d0 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1262,4 +1262,11 @@ static inline int ip6_sock_set_addr_preferences(struct 
sock *sk, bool val)
return ret;
 }
 
+static inline void ip6_sock_set_recvpktinfo(struct sock *sk)
+{
+   lock_sock(sk);
+   inet6_sk(sk)->rxopt.bits.rxinfo = true;
+   release_sock(sk);
+}
+
 #endif /* _NET_IPV6_H */
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index a391892977cd2..e7a0037d9b56c 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -595,8 +595,6 @@ static struct svc_xprt_class svc_udp_class = {
 
 static void svc_udp_init(struct svc_sock *svsk, struct svc_serv *serv)
 {
-   int err, level, optname, one = 1;
-
svc_xprt_init(sock_net(svsk->sk_sock->sk), &svc_udp_class,
  &svsk->sk_xprt, serv);
clear_bit(XPT_CACHE_AUTH, &svsk->sk_xprt.xpt_flags);
@@ -617,17 +615,13 @@ static void svc_udp_init(struct svc_sock *svsk, struct 
svc_serv *serv)
switch (svsk->sk_sk->sk_family) {
case AF_INET:
ip_sock_set_pktinfo(svsk->sk_sock->sk);
-   return;
+   break;
case AF_INET6:
-   level = SOL_IPV6;
-   optname = IPV6_RECVPKTINFO;
+   ip6_sock_set_recvpktinfo(svsk->sk_sock->sk);
break;
default:
BUG();
}
-   err = kernel_setsockopt(svsk->sk_sock, level, optname,
-   (char *)&one, sizeof(one));
-   dprintk("svc: kernel_setsockopt returned %d\n", err);
 }
 
 /*
-- 
2.26.2



[Cluster-devel] [PATCH 24/28] ipv6: add ip6_sock_set_recverr

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the IPV6_RECVERR sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
Reviewed-by: David Howells 
---
 include/net/ipv6.h   |  7 +++
 net/rxrpc/local_object.c | 10 ++
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 9b91188c9a74c..49c4abf991489 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1188,4 +1188,11 @@ static inline int ip6_sock_set_v6only(struct sock *sk)
return 0;
 }
 
+static inline void ip6_sock_set_recverr(struct sock *sk)
+{
+   lock_sock(sk);
+   inet6_sk(sk)->recverr = true;
+   release_sock(sk);
+}
+
 #endif /* _NET_IPV6_H */
diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c
index 6f4e6b4817cf2..c8b2097f499c0 100644
--- a/net/rxrpc/local_object.c
+++ b/net/rxrpc/local_object.c
@@ -107,7 +107,7 @@ static struct rxrpc_local *rxrpc_alloc_local(struct 
rxrpc_net *rxnet,
 static int rxrpc_open_socket(struct rxrpc_local *local, struct net *net)
 {
struct sock *usk;
-   int ret, opt;
+   int ret;
 
_enter("%p{%d,%d}",
   local, local->srx.transport_type, local->srx.transport.family);
@@ -157,13 +157,7 @@ static int rxrpc_open_socket(struct rxrpc_local *local, 
struct net *net)
switch (local->srx.transport.family) {
case AF_INET6:
/* we want to receive ICMPv6 errors */
-   opt = 1;
-   ret = kernel_setsockopt(local->socket, SOL_IPV6, IPV6_RECVERR,
-   (char *) &opt, sizeof(opt));
-   if (ret < 0) {
-   _debug("setsockopt failed");
-   goto error;
-   }
+   ip6_sock_set_recverr(local->socket->sk);
 
/* Fall through and set IPv4 options too otherwise we don't get
 * errors from IPv4 packets sent through the IPv6 socket.
-- 
2.26.2



[Cluster-devel] [PATCH 01/28] net: add sock_set_reuseaddr

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the SO_REUSEADDR sockopt from kernel space
without going through a fake uaccess.

For this the iscsi target now has to formally depend on inet to avoid
a mostly theoretical compile failure.  For actual operation it already
did depend on having ipv4 or ipv6 support.

Signed-off-by: Christoph Hellwig 
Acked-by: Sagi Grimberg 
---
 drivers/infiniband/sw/siw/siw_cm.c| 18 +-
 drivers/nvme/target/tcp.c |  8 +---
 drivers/target/iscsi/Kconfig  |  2 +-
 drivers/target/iscsi/iscsi_target_login.c |  9 +
 fs/dlm/lowcomms.c |  6 +-
 include/net/sock.h|  2 ++
 net/core/sock.c   |  8 
 7 files changed, 19 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/sw/siw/siw_cm.c 
b/drivers/infiniband/sw/siw/siw_cm.c
index 559e5fd3bad8b..d1860f3e87401 100644
--- a/drivers/infiniband/sw/siw/siw_cm.c
+++ b/drivers/infiniband/sw/siw/siw_cm.c
@@ -1312,17 +1312,14 @@ static void siw_cm_llp_state_change(struct sock *sk)
 static int kernel_bindconnect(struct socket *s, struct sockaddr *laddr,
  struct sockaddr *raddr)
 {
-   int rv, flags = 0, s_val = 1;
+   int rv, flags = 0;
size_t size = laddr->sa_family == AF_INET ?
sizeof(struct sockaddr_in) : sizeof(struct sockaddr_in6);
 
/*
 * Make address available again asap.
 */
-   rv = kernel_setsockopt(s, SOL_SOCKET, SO_REUSEADDR, (char *)&s_val,
-  sizeof(s_val));
-   if (rv < 0)
-   return rv;
+   sock_set_reuseaddr(s->sk);
 
rv = s->ops->bind(s, laddr, size);
if (rv < 0)
@@ -1781,7 +1778,7 @@ int siw_create_listen(struct iw_cm_id *id, int backlog)
struct siw_cep *cep = NULL;
struct siw_device *sdev = to_siw_dev(id->device);
int addr_family = id->local_addr.ss_family;
-   int rv = 0, s_val;
+   int rv = 0;
 
if (addr_family != AF_INET && addr_family != AF_INET6)
return -EAFNOSUPPORT;
@@ -1793,13 +1790,8 @@ int siw_create_listen(struct iw_cm_id *id, int backlog)
/*
 * Allow binding local port when still in TIME_WAIT from last close.
 */
-   s_val = 1;
-   rv = kernel_setsockopt(s, SOL_SOCKET, SO_REUSEADDR, (char *)&s_val,
-  sizeof(s_val));
-   if (rv) {
-   siw_dbg(id->device, "setsockopt error: %d\n", rv);
-   goto error;
-   }
+   sock_set_reuseaddr(s->sk);
+
if (addr_family == AF_INET) {
struct sockaddr_in *laddr = &to_sockaddr_in(id->local_addr);
 
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index f0da04e960f40..40757a63f4553 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1632,6 +1632,7 @@ static int nvmet_tcp_add_port(struct nvmet_port *nport)
port->sock->sk->sk_user_data = port;
port->data_ready = port->sock->sk->sk_data_ready;
port->sock->sk->sk_data_ready = nvmet_tcp_listen_data_ready;
+   sock_set_reuseaddr(port->sock->sk);
 
opt = 1;
ret = kernel_setsockopt(port->sock, IPPROTO_TCP,
@@ -1641,13 +1642,6 @@ static int nvmet_tcp_add_port(struct nvmet_port *nport)
goto err_sock;
}
 
-   ret = kernel_setsockopt(port->sock, SOL_SOCKET, SO_REUSEADDR,
-   (char *)&opt, sizeof(opt));
-   if (ret) {
-   pr_err("failed to set SO_REUSEADDR sock opt %d\n", ret);
-   goto err_sock;
-   }
-
if (so_priority > 0) {
ret = kernel_setsockopt(port->sock, SOL_SOCKET, SO_PRIORITY,
(char *)&so_priority, sizeof(so_priority));
diff --git a/drivers/target/iscsi/Kconfig b/drivers/target/iscsi/Kconfig
index 1f93ea3813536..922484ea4e304 100644
--- a/drivers/target/iscsi/Kconfig
+++ b/drivers/target/iscsi/Kconfig
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0-only
 config ISCSI_TARGET
tristate "Linux-iSCSI.org iSCSI Target Mode Stack"
-   depends on NET
+   depends on INET
select CRYPTO
select CRYPTO_CRC32C
select CRYPTO_CRC32C_INTEL if X86
diff --git a/drivers/target/iscsi/iscsi_target_login.c 
b/drivers/target/iscsi/iscsi_target_login.c
index 731ee67fe914b..91acb3f07b4cc 100644
--- a/drivers/target/iscsi/iscsi_target_login.c
+++ b/drivers/target/iscsi/iscsi_target_login.c
@@ -909,14 +909,7 @@ int iscsit_setup_np(
}
}
 
-   /* FIXME: Someone please explain why this is endian-safe */
-   ret = kernel_setsockopt(sock, SOL_SOCKET, SO_REUSEADDR,
-   (char *)&opt, sizeof(opt));
-   if (ret < 0) {
-   pr_err(&qu

[Cluster-devel] [PATCH 05/28] net: add sock_bindtoindex

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the SO_BINDTOIFINDEX sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 include/net/sock.h|  1 +
 net/core/sock.c   | 21 +++--
 net/ipv4/udp_tunnel.c |  4 +---
 net/ipv6/ip6_udp_tunnel.c |  4 +---
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 9a7b9e98685ac..cdec7bc055d5b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2688,6 +2688,7 @@ static inline bool sk_dev_equal_l3scope(struct sock *sk, 
int dif)
 
 void sock_def_readable(struct sock *sk);
 
+int sock_bindtoindex(struct sock *sk, int ifindex);
 void sock_no_linger(struct sock *sk);
 void sock_set_priority(struct sock *sk, u32 priority);
 void sock_set_reuseaddr(struct sock *sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index d3b1d61e4f768..23f80880fbb2c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -566,7 +566,7 @@ struct dst_entry *sk_dst_check(struct sock *sk, u32 cookie)
 }
 EXPORT_SYMBOL(sk_dst_check);
 
-static int sock_setbindtodevice_locked(struct sock *sk, int ifindex)
+static int sock_bindtoindex_locked(struct sock *sk, int ifindex)
 {
int ret = -ENOPROTOOPT;
 #ifdef CONFIG_NETDEVICES
@@ -594,6 +594,18 @@ static int sock_setbindtodevice_locked(struct sock *sk, 
int ifindex)
return ret;
 }
 
+int sock_bindtoindex(struct sock *sk, int ifindex)
+{
+   int ret;
+
+   lock_sock(sk);
+   ret = sock_bindtoindex_locked(sk, ifindex);
+   release_sock(sk);
+
+   return ret;
+}
+EXPORT_SYMBOL(sock_bindtoindex);
+
 static int sock_setbindtodevice(struct sock *sk, char __user *optval,
int optlen)
 {
@@ -634,10 +646,7 @@ static int sock_setbindtodevice(struct sock *sk, char 
__user *optval,
goto out;
}
 
-   lock_sock(sk);
-   ret = sock_setbindtodevice_locked(sk, index);
-   release_sock(sk);
-
+   return sock_bindtoindex(sk, index);
 out:
 #endif
 
@@ -1216,7 +1225,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
break;
 
case SO_BINDTOIFINDEX:
-   ret = sock_setbindtodevice_locked(sk, val);
+   ret = sock_bindtoindex_locked(sk, val);
break;
 
default:
diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
index 150e6f0fdbf59..2158e8bddf41c 100644
--- a/net/ipv4/udp_tunnel.c
+++ b/net/ipv4/udp_tunnel.c
@@ -22,9 +22,7 @@ int udp_sock_create4(struct net *net, struct udp_port_cfg 
*cfg,
goto error;
 
if (cfg->bind_ifindex) {
-   err = kernel_setsockopt(sock, SOL_SOCKET, SO_BINDTOIFINDEX,
-   (void *)&cfg->bind_ifindex,
-   sizeof(cfg->bind_ifindex));
+   err = sock_bindtoindex(sock->sk, cfg->bind_ifindex);
if (err < 0)
goto error;
}
diff --git a/net/ipv6/ip6_udp_tunnel.c b/net/ipv6/ip6_udp_tunnel.c
index 58956a6b66a21..6523609516d25 100644
--- a/net/ipv6/ip6_udp_tunnel.c
+++ b/net/ipv6/ip6_udp_tunnel.c
@@ -33,9 +33,7 @@ int udp_sock_create6(struct net *net, struct udp_port_cfg 
*cfg,
goto error;
}
if (cfg->bind_ifindex) {
-   err = kernel_setsockopt(sock, SOL_SOCKET, SO_BINDTOIFINDEX,
-   (void *)&cfg->bind_ifindex,
-   sizeof(cfg->bind_ifindex));
+   err = sock_bindtoindex(sock->sk, cfg->bind_ifindex);
if (err < 0)
goto error;
}
-- 
2.26.2



[Cluster-devel] [PATCH 11/28] tcp: add tcp_sock_set_nodelay

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the TCP_NODELAY sockopt from kernel space
without going through a fake uaccess.  Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig 
Acked-by: Sagi Grimberg 
Acked-by: Jason Gunthorpe 
---
 drivers/block/drbd/drbd_int.h |  7 
 drivers/block/drbd/drbd_main.c|  2 +-
 drivers/block/drbd/drbd_receiver.c|  4 +--
 drivers/infiniband/sw/siw/siw_cm.c| 24 +++---
 drivers/nvme/host/tcp.c   |  9 +-
 drivers/nvme/target/tcp.c | 12 ++-
 drivers/target/iscsi/iscsi_target_login.c | 15 ++---
 fs/cifs/connect.c | 10 ++
 fs/dlm/lowcomms.c |  8 ++---
 fs/ocfs2/cluster/tcp.c| 20 ++--
 include/linux/tcp.h   |  1 +
 net/ceph/messenger.c  | 11 ++-
 net/ipv4/tcp.c| 39 +++
 net/rds/tcp.c | 11 +--
 net/rds/tcp.h |  1 -
 net/rds/tcp_listen.c  |  2 +-
 16 files changed, 49 insertions(+), 127 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index 3550adc93c68b..e24bba87c8e02 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1570,13 +1570,6 @@ extern void drbd_set_recv_tcq(struct drbd_device 
*device, int tcq_enabled);
 extern void _drbd_clear_done_ee(struct drbd_device *device, struct list_head 
*to_be_freed);
 extern int drbd_connected(struct drbd_peer_device *);
 
-static inline void drbd_tcp_nodelay(struct socket *sock)
-{
-   int val = 1;
-   (void) kernel_setsockopt(sock, SOL_TCP, TCP_NODELAY,
-   (char*)&val, sizeof(val));
-}
-
 static inline void drbd_tcp_quickack(struct socket *sock)
 {
int val = 2;
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index c094c3c2c5d4d..45fbd526c453b 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -660,7 +660,7 @@ static int __send_command(struct drbd_connection 
*connection, int vnr,
/* DRBD protocol "pings" are latency critical.
 * This is supposed to trigger tcp_push_pending_frames() */
if (!err && (cmd == P_PING || cmd == P_PING_ACK))
-   drbd_tcp_nodelay(sock->socket);
+   tcp_sock_set_nodelay(sock->socket->sk);
 
return err;
 }
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 55ea907ad33cb..20a5e94494acd 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1051,8 +1051,8 @@ static int conn_connect(struct drbd_connection 
*connection)
 
/* we don't want delays.
 * we use TCP_CORK where appropriate, though */
-   drbd_tcp_nodelay(sock.socket);
-   drbd_tcp_nodelay(msock.socket);
+   tcp_sock_set_nodelay(sock.socket->sk);
+   tcp_sock_set_nodelay(msock.socket->sk);
 
connection->data.socket = sock.socket;
connection->meta.socket = msock.socket;
diff --git a/drivers/infiniband/sw/siw/siw_cm.c 
b/drivers/infiniband/sw/siw/siw_cm.c
index d1860f3e87401..1662216be66df 100644
--- a/drivers/infiniband/sw/siw/siw_cm.c
+++ b/drivers/infiniband/sw/siw/siw_cm.c
@@ -947,16 +947,8 @@ static void siw_accept_newconn(struct siw_cep *cep)
siw_cep_get(new_cep);
new_s->sk->sk_user_data = new_cep;
 
-   if (siw_tcp_nagle == false) {
-   int val = 1;
-
-   rv = kernel_setsockopt(new_s, SOL_TCP, TCP_NODELAY,
-  (char *)&val, sizeof(val));
-   if (rv) {
-   siw_dbg_cep(cep, "setsockopt NODELAY error: %d\n", rv);
-   goto error;
-   }
-   }
+   if (siw_tcp_nagle == false)
+   tcp_sock_set_nodelay(new_s->sk);
new_cep->state = SIW_EPSTATE_AWAIT_MPAREQ;
 
rv = siw_cm_queue_work(new_cep, SIW_CM_WORK_MPATIMEOUT);
@@ -1386,16 +1378,8 @@ int siw_connect(struct iw_cm_id *id, struct 
iw_cm_conn_param *params)
siw_dbg_qp(qp, "kernel_bindconnect: error %d\n", rv);
goto error;
}
-   if (siw_tcp_nagle == false) {
-   int val = 1;
-
-   rv = kernel_setsockopt(s, SOL_TCP, TCP_NODELAY, (char *)&val,
-  sizeof(val));
-   if (rv) {
-   siw_dbg_qp(qp, "setsockopt NODELAY error: %d\n", rv);
-   goto error;
-   }
-   }
+   if (siw_tcp_nagle == false)
+   tcp_sock_set_nodelay(s->sk);
cep = siw_cep_alloc(sdev);
if (!cep) {
rv = -ENOMEM;
diff --gi

[Cluster-devel] [PATCH 17/28] tcp: add tcp_sock_set_keepcnt

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the TCP_KEEPCNT sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 include/linux/tcp.h   |  1 +
 net/ipv4/tcp.c| 12 
 net/rds/tcp.h |  2 +-
 net/rds/tcp_listen.c  | 17 +++--
 net/sunrpc/xprtsock.c |  3 +--
 5 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 1f9bada00faab..9aac824c523cf 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -498,6 +498,7 @@ int tcp_skb_shift(struct sk_buff *to, struct sk_buff *from, 
int pcount,
  int shiftlen);
 
 void tcp_sock_set_cork(struct sock *sk, bool on);
+int tcp_sock_set_keepcnt(struct sock *sk, int val);
 int tcp_sock_set_keepidle(struct sock *sk, int val);
 int tcp_sock_set_keepintvl(struct sock *sk, int val);
 void tcp_sock_set_nodelay(struct sock *sk);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7eb083e09786a..15d47d5e79510 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2946,6 +2946,18 @@ int tcp_sock_set_keepintvl(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(tcp_sock_set_keepintvl);
 
+int tcp_sock_set_keepcnt(struct sock *sk, int val)
+{
+   if (val < 1 || val > MAX_TCP_KEEPCNT)
+   return -EINVAL;
+
+   lock_sock(sk);
+   tcp_sk(sk)->keepalive_probes = val;
+   release_sock(sk);
+   return 0;
+}
+EXPORT_SYMBOL(tcp_sock_set_keepcnt);
+
 /*
  * Socket option code for TCP.
  */
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index f6d75d8cb167a..bad9cf49d5657 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -70,7 +70,7 @@ struct socket *rds_tcp_listen_init(struct net *net, bool 
isv6);
 void rds_tcp_listen_stop(struct socket *sock, struct work_struct *acceptor);
 void rds_tcp_listen_data_ready(struct sock *sk);
 int rds_tcp_accept_one(struct socket *sock);
-int rds_tcp_keepalive(struct socket *sock);
+void rds_tcp_keepalive(struct socket *sock);
 void *rds_tcp_listen_sock_def_readable(struct net *net);
 
 /* tcp_recv.c */
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 9ad555c48d15d..101cf14215a0b 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -38,27 +38,19 @@
 #include "rds.h"
 #include "tcp.h"
 
-int rds_tcp_keepalive(struct socket *sock)
+void rds_tcp_keepalive(struct socket *sock)
 {
/* values below based on xs_udp_default_timeout */
int keepidle = 5; /* send a probe 'keepidle' secs after last data */
int keepcnt = 5; /* number of unack'ed probes before declaring dead */
-   int ret = 0;
 
sock_set_keepalive(sock->sk);
-
-   ret = kernel_setsockopt(sock, IPPROTO_TCP, TCP_KEEPCNT,
-   (char *)&keepcnt, sizeof(keepcnt));
-   if (ret < 0)
-   goto bail;
-
+   tcp_sock_set_keepcnt(sock->sk, keepcnt);
tcp_sock_set_keepidle(sock->sk, keepidle);
/* KEEPINTVL is the interval between successive probes. We follow
 * the model in xs_tcp_finish_connecting() and re-use keepidle.
 */
tcp_sock_set_keepintvl(sock->sk, keepidle);
-bail:
-   return ret;
 }
 
 /* rds_tcp_accept_one_path(): if accepting on cp_index > 0, make sure the
@@ -140,10 +132,7 @@ int rds_tcp_accept_one(struct socket *sock)
new_sock->ops = sock->ops;
__module_get(new_sock->ops->owner);
 
-   ret = rds_tcp_keepalive(new_sock);
-   if (ret < 0)
-   goto out;
-
+   rds_tcp_keepalive(new_sock);
rds_tcp_tune(new_sock);
 
inet = inet_sk(new_sock->sk);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 5ca64e12af0c5..0d3ec055bc12f 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2109,8 +2109,7 @@ static void xs_tcp_set_socket_timeouts(struct rpc_xprt 
*xprt,
sock_set_keepalive(sock->sk);
tcp_sock_set_keepidle(sock->sk, keepidle);
tcp_sock_set_keepintvl(sock->sk, keepidle);
-   kernel_setsockopt(sock, SOL_TCP, TCP_KEEPCNT,
-   (char *)&keepcnt, sizeof(keepcnt));
+   tcp_sock_set_keepcnt(sock->sk, keepcnt);
 
/* TCP user timeout (see RFC5482) */
tcp_sock_set_user_timeout(sock->sk, timeo);
-- 
2.26.2



[Cluster-devel] [PATCH 10/28] tcp: add tcp_sock_set_cork

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the TCP_CORK sockopt from kernel space
without going through a fake uaccess.  Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/drbd/drbd_int.h  | 14 
 drivers/block/drbd/drbd_receiver.c |  4 +--
 drivers/block/drbd/drbd_worker.c   |  6 ++--
 fs/cifs/transport.c|  8 ++---
 include/linux/tcp.h|  2 ++
 net/ipv4/tcp.c | 51 +++---
 net/rds/tcp_send.c |  9 ++
 7 files changed, 43 insertions(+), 51 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index aae99a2d7bd40..3550adc93c68b 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1570,20 +1570,6 @@ extern void drbd_set_recv_tcq(struct drbd_device 
*device, int tcq_enabled);
 extern void _drbd_clear_done_ee(struct drbd_device *device, struct list_head 
*to_be_freed);
 extern int drbd_connected(struct drbd_peer_device *);
 
-static inline void drbd_tcp_cork(struct socket *sock)
-{
-   int val = 1;
-   (void) kernel_setsockopt(sock, SOL_TCP, TCP_CORK,
-   (char*)&val, sizeof(val));
-}
-
-static inline void drbd_tcp_uncork(struct socket *sock)
-{
-   int val = 0;
-   (void) kernel_setsockopt(sock, SOL_TCP, TCP_CORK,
-   (char*)&val, sizeof(val));
-}
-
 static inline void drbd_tcp_nodelay(struct socket *sock)
 {
int val = 1;
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index c15e7083b13a6..55ea907ad33cb 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -6162,7 +6162,7 @@ void drbd_send_acks_wf(struct work_struct *ws)
rcu_read_unlock();
 
if (tcp_cork)
-   drbd_tcp_cork(connection->meta.socket);
+   tcp_sock_set_cork(connection->meta.socket->sk, true);
 
err = drbd_finish_peer_reqs(device);
kref_put(&device->kref, drbd_destroy_device);
@@ -6175,7 +6175,7 @@ void drbd_send_acks_wf(struct work_struct *ws)
}
 
if (tcp_cork)
-   drbd_tcp_uncork(connection->meta.socket);
+   tcp_sock_set_cork(connection->meta.socket->sk, false);
 
return;
 }
diff --git a/drivers/block/drbd/drbd_worker.c b/drivers/block/drbd/drbd_worker.c
index 0dc019da1f8d0..2b89c9f2ca707 100644
--- a/drivers/block/drbd/drbd_worker.c
+++ b/drivers/block/drbd/drbd_worker.c
@@ -2098,7 +2098,7 @@ static void wait_for_work(struct drbd_connection 
*connection, struct list_head *
if (uncork) {
mutex_lock(&connection->data.mutex);
if (connection->data.socket)
-   drbd_tcp_uncork(connection->data.socket);
+   tcp_sock_set_cork(connection->data.socket->sk, false);
mutex_unlock(&connection->data.mutex);
}
 
@@ -2153,9 +2153,9 @@ static void wait_for_work(struct drbd_connection 
*connection, struct list_head *
mutex_lock(&connection->data.mutex);
if (connection->data.socket) {
if (cork)
-   drbd_tcp_cork(connection->data.socket);
+   tcp_sock_set_cork(connection->data.socket->sk, true);
else if (!uncork)
-   drbd_tcp_uncork(connection->data.socket);
+   tcp_sock_set_cork(connection->data.socket->sk, false);
}
mutex_unlock(&connection->data.mutex);
 }
diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index c97570eb2c180..99760063e0006 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -325,7 +325,6 @@ __smb_send_rqst(struct TCP_Server_Info *server, int 
num_rqst,
size_t total_len = 0, sent, size;
struct socket *ssocket = server->ssocket;
struct msghdr smb_msg;
-   int val = 1;
__be32 rfc1002_marker;
 
if (cifs_rdma_enabled(server)) {
@@ -345,8 +344,7 @@ __smb_send_rqst(struct TCP_Server_Info *server, int 
num_rqst,
}
 
/* cork the socket */
-   kernel_setsockopt(ssocket, SOL_TCP, TCP_CORK,
-   (char *)&val, sizeof(val));
+   tcp_sock_set_cork(ssocket->sk, true);
 
for (j = 0; j < num_rqst; j++)
send_length += smb_rqst_len(server, &rqst[j]);
@@ -435,9 +433,7 @@ __smb_send_rqst(struct TCP_Server_Info *server, int 
num_rqst,
}
 
/* uncork it */
-   val = 0;
-   kernel_setsockopt(ssocket, SOL_TCP, TCP_CORK,
-   (char *)&val, sizeof(val));
+   tcp_sock_set_cork(ssocket->sk, false);
 
if ((total_len > 0) && (total_len != send_length)) {
cifs_dbg(FYI, "partial send (wanted=%u sent=%zu): termina

[Cluster-devel] [PATCH 14/28] tcp: add tcp_sock_set_user_timeout

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the TCP_USER_TIMEOUT sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 fs/ocfs2/cluster/tcp.c | 22 ++
 include/linux/tcp.h|  1 +
 net/ipv4/tcp.c |  8 
 net/sunrpc/xprtsock.c  |  3 +--
 4 files changed, 12 insertions(+), 22 deletions(-)

diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 4c70fe9d19ab2..79a2317194600 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -1441,14 +1441,6 @@ static void o2net_rx_until_empty(struct work_struct 
*work)
sc_put(sc);
 }
 
-static int o2net_set_usertimeout(struct socket *sock)
-{
-   int user_timeout = O2NET_TCP_USER_TIMEOUT;
-
-   return kernel_setsockopt(sock, SOL_TCP, TCP_USER_TIMEOUT,
-   (void *)&user_timeout, sizeof(user_timeout));
-}
-
 static void o2net_initialize_handshake(void)
 {
o2net_hand->o2hb_heartbeat_timeout_ms = cpu_to_be32(
@@ -1629,12 +1621,7 @@ static void o2net_start_connect(struct work_struct *work)
}
 
tcp_sock_set_nodelay(sc->sc_sock->sk);
-
-   ret = o2net_set_usertimeout(sock);
-   if (ret) {
-   mlog(ML_ERROR, "set TCP_USER_TIMEOUT failed with %d\n", ret);
-   goto out;
-   }
+   tcp_sock_set_user_timeout(sock->sk, O2NET_TCP_USER_TIMEOUT);
 
o2net_register_callbacks(sc->sc_sock->sk, sc);
 
@@ -1821,12 +1808,7 @@ static int o2net_accept_one(struct socket *sock, int 
*more)
new_sock->sk->sk_allocation = GFP_ATOMIC;
 
tcp_sock_set_nodelay(new_sock->sk);
-
-   ret = o2net_set_usertimeout(new_sock);
-   if (ret) {
-   mlog(ML_ERROR, "set TCP_USER_TIMEOUT failed with %d\n", ret);
-   goto out;
-   }
+   tcp_sock_set_user_timeout(new_sock->sk, O2NET_TCP_USER_TIMEOUT);
 
ret = new_sock->ops->getname(new_sock, (struct sockaddr *) &sin, 1);
if (ret < 0)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 6aa4ae5ebf3d5..de682143efe4d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -501,5 +501,6 @@ void tcp_sock_set_cork(struct sock *sk, bool on);
 void tcp_sock_set_nodelay(struct sock *sk);
 void tcp_sock_set_quickack(struct sock *sk, int val);
 int tcp_sock_set_syncnt(struct sock *sk, int val);
+void tcp_sock_set_user_timeout(struct sock *sk, u32 val);
 
 #endif /* _LINUX_TCP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d2c67ae1da07a..0004bd9ae7b0a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2893,6 +2893,14 @@ int tcp_sock_set_syncnt(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(tcp_sock_set_syncnt);
 
+void tcp_sock_set_user_timeout(struct sock *sk, u32 val)
+{
+   lock_sock(sk);
+   inet_csk(sk)->icsk_user_timeout = val;
+   release_sock(sk);
+}
+EXPORT_SYMBOL(tcp_sock_set_user_timeout);
+
 /*
  * Socket option code for TCP.
  */
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 399848c2bcb29..231fd6162f68d 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2115,8 +2115,7 @@ static void xs_tcp_set_socket_timeouts(struct rpc_xprt 
*xprt,
(char *)&keepcnt, sizeof(keepcnt));
 
/* TCP user timeout (see RFC5482) */
-   kernel_setsockopt(sock, SOL_TCP, TCP_USER_TIMEOUT,
-   (char *)&timeo, sizeof(timeo));
+   tcp_sock_set_user_timeout(sock->sk, timeo);
 }
 
 static void xs_tcp_set_connect_timeout(struct rpc_xprt *xprt,
-- 
2.26.2



[Cluster-devel] [PATCH 12/28] tcp: add tcp_sock_set_quickack

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the TCP_QUICKACK sockopt from kernel space
without going through a fake uaccess.  Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig 
---
 drivers/block/drbd/drbd_int.h  |  7 --
 drivers/block/drbd/drbd_receiver.c |  5 ++--
 include/linux/tcp.h|  1 +
 net/ipv4/tcp.c | 39 --
 4 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index e24bba87c8e02..14345a87c7cc5 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1570,13 +1570,6 @@ extern void drbd_set_recv_tcq(struct drbd_device 
*device, int tcq_enabled);
 extern void _drbd_clear_done_ee(struct drbd_device *device, struct list_head 
*to_be_freed);
 extern int drbd_connected(struct drbd_peer_device *);
 
-static inline void drbd_tcp_quickack(struct socket *sock)
-{
-   int val = 2;
-   (void) kernel_setsockopt(sock, SOL_TCP, TCP_QUICKACK,
-   (char*)&val, sizeof(val));
-}
-
 /* sets the number of 512 byte sectors of our virtual device */
 void drbd_set_my_capacity(struct drbd_device *device, sector_t size);
 
diff --git a/drivers/block/drbd/drbd_receiver.c 
b/drivers/block/drbd/drbd_receiver.c
index 20a5e94494acd..3a3f2b6a821f3 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -1223,7 +1223,7 @@ static int drbd_recv_header_maybe_unplug(struct 
drbd_connection *connection, str
 * quickly as possible, and let remote TCP know what we have
 * received so far. */
if (err == -EAGAIN) {
-   drbd_tcp_quickack(connection->data.socket);
+   tcp_sock_set_quickack(connection->data.socket->sk, 2);
drbd_unplug_all_devices(connection);
}
if (err > 0) {
@@ -4959,8 +4959,7 @@ static int receive_UnplugRemote(struct drbd_connection 
*connection, struct packe
 {
/* Make sure we've acked all the TCP data associated
 * with the data requests being unplugged */
-   drbd_tcp_quickack(connection->data.socket);
-
+   tcp_sock_set_quickack(connection->data.socket->sk, 2);
return 0;
 }
 
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 9e42c7fe50a8b..2eaf8320b9db0 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -499,5 +499,6 @@ int tcp_skb_shift(struct sk_buff *to, struct sk_buff *from, 
int pcount,
 
 void tcp_sock_set_cork(struct sock *sk, bool on);
 void tcp_sock_set_nodelay(struct sock *sk);
+void tcp_sock_set_quickack(struct sock *sk, int val);
 
 #endif /* _LINUX_TCP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a65f293a19fac..27b5e7a4e2ef9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2856,6 +2856,31 @@ void tcp_sock_set_nodelay(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_sock_set_nodelay);
 
+static void __tcp_sock_set_quickack(struct sock *sk, int val)
+{
+   if (!val) {
+   inet_csk_enter_pingpong_mode(sk);
+   return;
+   }
+
+   inet_csk_exit_pingpong_mode(sk);
+   if ((1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
+   inet_csk_ack_scheduled(sk)) {
+   inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_PUSHED;
+   tcp_cleanup_rbuf(sk, 1);
+   if (!(val & 1))
+   inet_csk_enter_pingpong_mode(sk);
+   }
+}
+
+void tcp_sock_set_quickack(struct sock *sk, int val)
+{
+   lock_sock(sk);
+   __tcp_sock_set_quickack(sk, val);
+   release_sock(sk);
+}
+EXPORT_SYMBOL(tcp_sock_set_quickack);
+
 /*
  * Socket option code for TCP.
  */
@@ -3096,19 +3121,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
break;
 
case TCP_QUICKACK:
-   if (!val) {
-   inet_csk_enter_pingpong_mode(sk);
-   } else {
-   inet_csk_exit_pingpong_mode(sk);
-   if ((1 << sk->sk_state) &
-   (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT) &&
-   inet_csk_ack_scheduled(sk)) {
-   icsk->icsk_ack.pending |= ICSK_ACK_PUSHED;
-   tcp_cleanup_rbuf(sk, 1);
-   if (!(val & 1))
-   inet_csk_enter_pingpong_mode(sk);
-   }
-   }
+   __tcp_sock_set_quickack(sk, val);
break;
 
 #ifdef CONFIG_TCP_MD5SIG
-- 
2.26.2



[Cluster-devel] [PATCH 09/28] net: add sock_set_reuseport

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the SO_REUSEPORT sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 include/net/sock.h|  1 +
 net/core/sock.c   |  8 
 net/sunrpc/xprtsock.c | 17 +
 3 files changed, 10 insertions(+), 16 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index c997289aabbf9..d994daa418ec2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2695,6 +2695,7 @@ void sock_set_keepalive(struct sock *sk);
 void sock_set_priority(struct sock *sk, u32 priority);
 void sock_set_rcvbuf(struct sock *sk, int val);
 void sock_set_reuseaddr(struct sock *sk);
+void sock_set_reuseport(struct sock *sk);
 void sock_set_sndtimeo(struct sock *sk, s64 secs);
 
 #endif /* _SOCK_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 3c6ebf952e9ad..2ca3425b519c0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -729,6 +729,14 @@ void sock_set_reuseaddr(struct sock *sk)
 }
 EXPORT_SYMBOL(sock_set_reuseaddr);
 
+void sock_set_reuseport(struct sock *sk)
+{
+   lock_sock(sk);
+   sk->sk_reuseport = true;
+   release_sock(sk);
+}
+EXPORT_SYMBOL(sock_set_reuseport);
+
 void sock_no_linger(struct sock *sk)
 {
lock_sock(sk);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 30082cd039960..399848c2bcb29 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1594,21 +1594,6 @@ static int xs_get_random_port(void)
return rand + min;
 }
 
-/**
- * xs_set_reuseaddr_port - set the socket's port and address reuse options
- * @sock: socket
- *
- * Note that this function has to be called on all sockets that share the
- * same port, and it must be called before binding.
- */
-static void xs_sock_set_reuseport(struct socket *sock)
-{
-   int opt = 1;
-
-   kernel_setsockopt(sock, SOL_SOCKET, SO_REUSEPORT,
-   (char *)&opt, sizeof(opt));
-}
-
 static unsigned short xs_sock_getport(struct socket *sock)
 {
struct sockaddr_storage buf;
@@ -1801,7 +1786,7 @@ static struct socket *xs_create_sock(struct rpc_xprt 
*xprt,
xs_reclassify_socket(family, sock);
 
if (reuseport)
-   xs_sock_set_reuseport(sock);
+   sock_set_reuseport(sock->sk);
 
err = xs_bind(transport, sock);
if (err) {
-- 
2.26.2



[Cluster-devel] [PATCH 02/28] net: add sock_no_linger

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the SO_LINGER sockopt from kernel space
with onoff set to true and a linger time of 0 without going through a
fake uaccess.

Signed-off-by: Christoph Hellwig 
Acked-by: Sagi Grimberg 
---
 drivers/nvme/host/tcp.c   |  9 +
 drivers/nvme/target/tcp.c |  6 +-
 include/net/sock.h|  1 +
 net/core/sock.c   |  9 +
 net/rds/tcp.h |  1 -
 net/rds/tcp_connect.c |  2 +-
 net/rds/tcp_listen.c  | 13 +
 net/sunrpc/svcsock.c  | 12 ++--
 8 files changed, 16 insertions(+), 37 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index c15a92163c1f7..e72d87482eb78 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1313,7 +1313,6 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 {
struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
struct nvme_tcp_queue *queue = &ctrl->queues[qid];
-   struct linger sol = { .l_onoff = 1, .l_linger = 0 };
int ret, opt, rcv_pdu_size;
 
queue->ctrl = ctrl;
@@ -1361,13 +1360,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 * close. This is done to prevent stale data from being sent should
 * the network connection be restored before TCP times out.
 */
-   ret = kernel_setsockopt(queue->sock, SOL_SOCKET, SO_LINGER,
-   (char *)&sol, sizeof(sol));
-   if (ret) {
-   dev_err(nctrl->device,
-   "failed to set SO_LINGER sock opt %d\n", ret);
-   goto err_sock;
-   }
+   sock_no_linger(queue->sock->sk);
 
if (so_priority > 0) {
ret = kernel_setsockopt(queue->sock, SOL_SOCKET, SO_PRIORITY,
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index 40757a63f4553..e0801494b097f 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1429,7 +1429,6 @@ static int nvmet_tcp_set_queue_sock(struct 
nvmet_tcp_queue *queue)
 {
struct socket *sock = queue->sock;
struct inet_sock *inet = inet_sk(sock->sk);
-   struct linger sol = { .l_onoff = 1, .l_linger = 0 };
int ret;
 
ret = kernel_getsockname(sock,
@@ -1447,10 +1446,7 @@ static int nvmet_tcp_set_queue_sock(struct 
nvmet_tcp_queue *queue)
 * close. This is done to prevent stale data from being sent should
 * the network connection be restored before TCP times out.
 */
-   ret = kernel_setsockopt(sock, SOL_SOCKET, SO_LINGER,
-   (char *)&sol, sizeof(sol));
-   if (ret)
-   return ret;
+   sock_no_linger(sock->sk);
 
if (so_priority > 0) {
ret = kernel_setsockopt(sock, SOL_SOCKET, SO_PRIORITY,
diff --git a/include/net/sock.h b/include/net/sock.h
index 2ec085044790c..6ed00bf009bbe 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2688,6 +2688,7 @@ static inline bool sk_dev_equal_l3scope(struct sock *sk, 
int dif)
 
 void sock_def_readable(struct sock *sk);
 
+void sock_no_linger(struct sock *sk);
 void sock_set_reuseaddr(struct sock *sk);
 
 #endif /* _SOCK_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 18eb84fdf5fbe..f0f09524911c8 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -720,6 +720,15 @@ void sock_set_reuseaddr(struct sock *sk)
 }
 EXPORT_SYMBOL(sock_set_reuseaddr);
 
+void sock_no_linger(struct sock *sk)
+{
+   lock_sock(sk);
+   sk->sk_lingertime = 0;
+   sock_set_flag(sk, SOCK_LINGER);
+   release_sock(sk);
+}
+EXPORT_SYMBOL(sock_no_linger);
+
 /*
  * This is meant for all protocols to use and covers goings on
  * at the socket level. Everything here is generic.
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index 3c69361d21c73..d640e210b97b6 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -73,7 +73,6 @@ void rds_tcp_listen_data_ready(struct sock *sk);
 int rds_tcp_accept_one(struct socket *sock);
 int rds_tcp_keepalive(struct socket *sock);
 void *rds_tcp_listen_sock_def_readable(struct net *net);
-void rds_tcp_set_linger(struct socket *sock);
 
 /* tcp_recv.c */
 int rds_tcp_recv_init(void);
diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c
index 008f50fb25dd2..4e64598176b05 100644
--- a/net/rds/tcp_connect.c
+++ b/net/rds/tcp_connect.c
@@ -207,7 +207,7 @@ void rds_tcp_conn_path_shutdown(struct rds_conn_path *cp)
 
if (sock) {
if (rds_destroy_pending(cp->cp_conn))
-   rds_tcp_set_linger(sock);
+   sock_no_linger(sock->sk);
sock->ops->shutdown(sock, RCV_SHUTDOWN | SEND_SHUTDOWN);
lock_sock(sock->sk);
rds_tcp_restore_callbacks(sock, tc); /* tc->tc_sock = NULL */
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 810a3a49e9474..bbb31b9c0b391 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c

[Cluster-devel] [PATCH 16/28] tcp: add tcp_sock_set_keepintvl

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the TCP_KEEPINTVL sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 include/linux/tcp.h   |  1 +
 net/ipv4/tcp.c| 12 
 net/rds/tcp_listen.c  |  4 +---
 net/sunrpc/xprtsock.c |  3 +--
 4 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 5724dd84a85ed..1f9bada00faab 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -499,6 +499,7 @@ int tcp_skb_shift(struct sk_buff *to, struct sk_buff *from, 
int pcount,
 
 void tcp_sock_set_cork(struct sock *sk, bool on);
 int tcp_sock_set_keepidle(struct sock *sk, int val);
+int tcp_sock_set_keepintvl(struct sock *sk, int val);
 void tcp_sock_set_nodelay(struct sock *sk);
 void tcp_sock_set_quickack(struct sock *sk, int val);
 int tcp_sock_set_syncnt(struct sock *sk, int val);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bdf0ff9333514..7eb083e09786a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2934,6 +2934,18 @@ int tcp_sock_set_keepidle(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(tcp_sock_set_keepidle);
 
+int tcp_sock_set_keepintvl(struct sock *sk, int val)
+{
+   if (val < 1 || val > MAX_TCP_KEEPINTVL)
+   return -EINVAL;
+
+   lock_sock(sk);
+   tcp_sk(sk)->keepalive_intvl = val * HZ;
+   release_sock(sk);
+   return 0;
+}
+EXPORT_SYMBOL(tcp_sock_set_keepintvl);
+
 /*
  * Socket option code for TCP.
  */
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 79f9adc008114..9ad555c48d15d 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -53,12 +53,10 @@ int rds_tcp_keepalive(struct socket *sock)
goto bail;
 
tcp_sock_set_keepidle(sock->sk, keepidle);
-
/* KEEPINTVL is the interval between successive probes. We follow
 * the model in xs_tcp_finish_connecting() and re-use keepidle.
 */
-   ret = kernel_setsockopt(sock, IPPROTO_TCP, TCP_KEEPINTVL,
-   (char *)&keepidle, sizeof(keepidle));
+   tcp_sock_set_keepintvl(sock->sk, keepidle);
 bail:
return ret;
 }
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 473290f7c5c0a..5ca64e12af0c5 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2108,8 +2108,7 @@ static void xs_tcp_set_socket_timeouts(struct rpc_xprt 
*xprt,
/* TCP Keepalive options */
sock_set_keepalive(sock->sk);
tcp_sock_set_keepidle(sock->sk, keepidle);
-   kernel_setsockopt(sock, SOL_TCP, TCP_KEEPINTVL,
-   (char *)&keepidle, sizeof(keepidle));
+   tcp_sock_set_keepintvl(sock->sk, keepidle);
kernel_setsockopt(sock, SOL_TCP, TCP_KEEPCNT,
(char *)&keepcnt, sizeof(keepcnt));
 
-- 
2.26.2



[Cluster-devel] [PATCH 04/28] net: add sock_set_sndtimeo

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the SO_SNDTIMEO_NEW sockopt from kernel
space without going through a fake uaccess.  The interface is
simplified to only pass the seconds value, as that is the only
thing needed at the moment.

Signed-off-by: Christoph Hellwig 
---
 fs/dlm/lowcomms.c  |  8 ++--
 include/net/sock.h |  1 +
 net/core/sock.c| 11 +++
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index b801e77e3e596..b4d491122814b 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1035,7 +1035,6 @@ static void sctp_connect_to_sock(struct connection *con)
int result;
int addr_len;
struct socket *sock;
-   struct __kernel_sock_timeval tv = { .tv_sec = 5, .tv_usec = 0 };
 
if (con->nodeid == 0) {
log_print("attempt to connect sock 0 foiled");
@@ -1087,13 +1086,10 @@ static void sctp_connect_to_sock(struct connection *con)
 * since O_NONBLOCK argument in connect() function does not work here,
 * then, we should restore the default value of this attribute.
 */
-   kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO_NEW, (char *)&tv,
- sizeof(tv));
+   sock_set_sndtimeo(sock->sk, 5);
result = sock->ops->connect(sock, (struct sockaddr *)&daddr, addr_len,
   0);
-   memset(&tv, 0, sizeof(tv));
-   kernel_setsockopt(sock, SOL_SOCKET, SO_SNDTIMEO_NEW, (char *)&tv,
- sizeof(tv));
+   sock_set_sndtimeo(sock->sk, 0);
 
if (result == -EINPROGRESS)
result = 0;
diff --git a/include/net/sock.h b/include/net/sock.h
index a3a43141a4be2..9a7b9e98685ac 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2691,5 +2691,6 @@ void sock_def_readable(struct sock *sk);
 void sock_no_linger(struct sock *sk);
 void sock_set_priority(struct sock *sk, u32 priority);
 void sock_set_reuseaddr(struct sock *sk);
+void sock_set_sndtimeo(struct sock *sk, s64 secs);
 
 #endif /* _SOCK_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index ceda1a9248b3e..d3b1d61e4f768 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -737,6 +737,17 @@ void sock_set_priority(struct sock *sk, u32 priority)
 }
 EXPORT_SYMBOL(sock_set_priority);
 
+void sock_set_sndtimeo(struct sock *sk, s64 secs)
+{
+   lock_sock(sk);
+   if (secs && secs < MAX_SCHEDULE_TIMEOUT / HZ - 1)
+   sk->sk_sndtimeo = secs * HZ;
+   else
+   sk->sk_sndtimeo = MAX_SCHEDULE_TIMEOUT;
+   release_sock(sk);
+}
+EXPORT_SYMBOL(sock_set_sndtimeo);
+
 /*
  * This is meant for all protocols to use and covers goings on
  * at the socket level. Everything here is generic.
-- 
2.26.2



[Cluster-devel] remove most callers of kernel_setsockopt v3

2020-05-27 Thread Christoph Hellwig
Hi Dave,

this series removes most callers of the kernel_setsockopt functions, and
instead switches their users to small functions that implement setting a
sockopt directly using a normal kernel function call with type safety and
all the other benefits of not having a function call.

In some cases these functions seem pretty heavy handed as they do
a lock_sock even for just setting a single variable, but this mirrors
the real setsockopt implementation unlike a few drivers that just set
set the fields directly.


Changes since v2:
 - drop the separately merged kernel_getopt_removal
 - drop the sctp patches, as there is conflicting cleanup going on
 - add an additional ACK for the rxrpc changes

Changes since v1:
 - use ->getname for sctp sockets in dlm
 - add a new ->bind_add struct proto method for dlm/sctp
 - switch the ipv6 and remaining sctp helpers to inline function so that
   the ipv6 and sctp modules are not pulled in by any module that could
   potentially use ipv6 or sctp connections
 - remove arguments to various sock_* helpers that are always used with
   the same constant arguments



[Cluster-devel] [PATCH 03/28] net: add sock_set_priority

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the SO_PRIORITY sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
Acked-by: Sagi Grimberg 
---
 drivers/nvme/host/tcp.c   | 12 ++--
 drivers/nvme/target/tcp.c | 18 --
 include/net/sock.h|  1 +
 net/core/sock.c   |  8 
 4 files changed, 15 insertions(+), 24 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index e72d87482eb78..a307972d33a02 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1362,16 +1362,8 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
 */
sock_no_linger(queue->sock->sk);
 
-   if (so_priority > 0) {
-   ret = kernel_setsockopt(queue->sock, SOL_SOCKET, SO_PRIORITY,
-   (char *)&so_priority, sizeof(so_priority));
-   if (ret) {
-   dev_err(ctrl->ctrl.device,
-   "failed to set SO_PRIORITY sock opt, ret %d\n",
-   ret);
-   goto err_sock;
-   }
-   }
+   if (so_priority > 0)
+   sock_set_priority(queue->sock->sk, so_priority);
 
/* Set socket type of service */
if (nctrl->opts->tos >= 0) {
diff --git a/drivers/nvme/target/tcp.c b/drivers/nvme/target/tcp.c
index e0801494b097f..f3088156d01da 100644
--- a/drivers/nvme/target/tcp.c
+++ b/drivers/nvme/target/tcp.c
@@ -1448,12 +1448,8 @@ static int nvmet_tcp_set_queue_sock(struct 
nvmet_tcp_queue *queue)
 */
sock_no_linger(sock->sk);
 
-   if (so_priority > 0) {
-   ret = kernel_setsockopt(sock, SOL_SOCKET, SO_PRIORITY,
-   (char *)&so_priority, sizeof(so_priority));
-   if (ret)
-   return ret;
-   }
+   if (so_priority > 0)
+   sock_set_priority(sock->sk, so_priority);
 
/* Set socket type of service */
if (inet->rcv_tos > 0) {
@@ -1638,14 +1634,8 @@ static int nvmet_tcp_add_port(struct nvmet_port *nport)
goto err_sock;
}
 
-   if (so_priority > 0) {
-   ret = kernel_setsockopt(port->sock, SOL_SOCKET, SO_PRIORITY,
-   (char *)&so_priority, sizeof(so_priority));
-   if (ret) {
-   pr_err("failed to set SO_PRIORITY sock opt %d\n", ret);
-   goto err_sock;
-   }
-   }
+   if (so_priority > 0)
+   sock_set_priority(port->sock->sk, so_priority);
 
ret = kernel_bind(port->sock, (struct sockaddr *)&port->addr,
sizeof(port->addr));
diff --git a/include/net/sock.h b/include/net/sock.h
index 6ed00bf009bbe..a3a43141a4be2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2689,6 +2689,7 @@ static inline bool sk_dev_equal_l3scope(struct sock *sk, 
int dif)
 void sock_def_readable(struct sock *sk);
 
 void sock_no_linger(struct sock *sk);
+void sock_set_priority(struct sock *sk, u32 priority);
 void sock_set_reuseaddr(struct sock *sk);
 
 #endif /* _SOCK_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index f0f09524911c8..ceda1a9248b3e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -729,6 +729,14 @@ void sock_no_linger(struct sock *sk)
 }
 EXPORT_SYMBOL(sock_no_linger);
 
+void sock_set_priority(struct sock *sk, u32 priority)
+{
+   lock_sock(sk);
+   sk->sk_priority = priority;
+   release_sock(sk);
+}
+EXPORT_SYMBOL(sock_set_priority);
+
 /*
  * This is meant for all protocols to use and covers goings on
  * at the socket level. Everything here is generic.
-- 
2.26.2



[Cluster-devel] [PATCH 06/28] net: add sock_enable_timestamps

2020-05-27 Thread Christoph Hellwig
Add a helper to directly enable timestamps instead of setting the
SO_TIMESTAMP* sockopts from kernel space and going through a fake
uaccess.

Signed-off-by: Christoph Hellwig 
---
 include/net/sock.h   |  1 +
 net/core/sock.c  | 47 +---
 net/rxrpc/local_object.c |  8 +--
 3 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index cdec7bc055d5b..99ef43508d2b5 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2689,6 +2689,7 @@ static inline bool sk_dev_equal_l3scope(struct sock *sk, 
int dif)
 void sock_def_readable(struct sock *sk);
 
 int sock_bindtoindex(struct sock *sk, int ifindex);
+void sock_enable_timestamps(struct sock *sk);
 void sock_no_linger(struct sock *sk);
 void sock_set_priority(struct sock *sk, u32 priority);
 void sock_set_reuseaddr(struct sock *sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index 23f80880fbb2c..e4a4dd2b3d8b3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -757,6 +757,28 @@ void sock_set_sndtimeo(struct sock *sk, s64 secs)
 }
 EXPORT_SYMBOL(sock_set_sndtimeo);
 
+static void __sock_set_timestamps(struct sock *sk, bool val, bool new, bool ns)
+{
+   if (val)  {
+   sock_valbool_flag(sk, SOCK_TSTAMP_NEW, new);
+   sock_valbool_flag(sk, SOCK_RCVTSTAMPNS, ns);
+   sock_set_flag(sk, SOCK_RCVTSTAMP);
+   sock_enable_timestamp(sk, SOCK_TIMESTAMP);
+   } else {
+   sock_reset_flag(sk, SOCK_RCVTSTAMP);
+   sock_reset_flag(sk, SOCK_RCVTSTAMPNS);
+   sock_reset_flag(sk, SOCK_TSTAMP_NEW);
+   }
+}
+
+void sock_enable_timestamps(struct sock *sk)
+{
+   lock_sock(sk);
+   __sock_set_timestamps(sk, true, false, true);
+   release_sock(sk);
+}
+EXPORT_SYMBOL(sock_enable_timestamps);
+
 /*
  * This is meant for all protocols to use and covers goings on
  * at the socket level. Everything here is generic.
@@ -948,28 +970,17 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
break;
 
case SO_TIMESTAMP_OLD:
+   __sock_set_timestamps(sk, valbool, false, false);
+   break;
case SO_TIMESTAMP_NEW:
+   __sock_set_timestamps(sk, valbool, true, false);
+   break;
case SO_TIMESTAMPNS_OLD:
+   __sock_set_timestamps(sk, valbool, false, true);
+   break;
case SO_TIMESTAMPNS_NEW:
-   if (valbool)  {
-   if (optname == SO_TIMESTAMP_NEW || optname == 
SO_TIMESTAMPNS_NEW)
-   sock_set_flag(sk, SOCK_TSTAMP_NEW);
-   else
-   sock_reset_flag(sk, SOCK_TSTAMP_NEW);
-
-   if (optname == SO_TIMESTAMP_OLD || optname == 
SO_TIMESTAMP_NEW)
-   sock_reset_flag(sk, SOCK_RCVTSTAMPNS);
-   else
-   sock_set_flag(sk, SOCK_RCVTSTAMPNS);
-   sock_set_flag(sk, SOCK_RCVTSTAMP);
-   sock_enable_timestamp(sk, SOCK_TIMESTAMP);
-   } else {
-   sock_reset_flag(sk, SOCK_RCVTSTAMP);
-   sock_reset_flag(sk, SOCK_RCVTSTAMPNS);
-   sock_reset_flag(sk, SOCK_TSTAMP_NEW);
-   }
+   __sock_set_timestamps(sk, valbool, true, true);
break;
-
case SO_TIMESTAMPING_NEW:
sock_set_flag(sk, SOCK_TSTAMP_NEW);
/* fall through */
diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c
index 01135e54d95d2..5ea2bd01fdd59 100644
--- a/net/rxrpc/local_object.c
+++ b/net/rxrpc/local_object.c
@@ -189,13 +189,7 @@ static int rxrpc_open_socket(struct rxrpc_local *local, 
struct net *net)
}
 
/* We want receive timestamps. */
-   opt = 1;
-   ret = kernel_setsockopt(local->socket, SOL_SOCKET, 
SO_TIMESTAMPNS_OLD,
-   (char *)&opt, sizeof(opt));
-   if (ret < 0) {
-   _debug("setsockopt failed");
-   goto error;
-   }
+   sock_enable_timestamps(local->socket->sk);
break;
 
default:
-- 
2.26.2



[Cluster-devel] [PATCH 07/28] net: add sock_set_keepalive

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the SO_KEEPALIVE sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 fs/dlm/lowcomms.c |  6 +-
 include/net/sock.h|  1 +
 net/core/sock.c   | 10 ++
 net/rds/tcp_listen.c  |  6 +-
 net/sunrpc/xprtsock.c |  4 +---
 5 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index b4d491122814b..138009c6a2ee1 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1259,11 +1259,7 @@ static struct socket *tcp_create_listen_sock(struct 
connection *con,
con->sock = NULL;
goto create_out;
}
-   result = kernel_setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE,
-(char *)&one, sizeof(one));
-   if (result < 0) {
-   log_print("Set keepalive failed: %d", result);
-   }
+   sock_set_keepalive(sock->sk);
 
result = sock->ops->listen(sock, 5);
if (result < 0) {
diff --git a/include/net/sock.h b/include/net/sock.h
index 99ef43508d2b5..dc08c176238fd 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2691,6 +2691,7 @@ void sock_def_readable(struct sock *sk);
 int sock_bindtoindex(struct sock *sk, int ifindex);
 void sock_enable_timestamps(struct sock *sk);
 void sock_no_linger(struct sock *sk);
+void sock_set_keepalive(struct sock *sk);
 void sock_set_priority(struct sock *sk, u32 priority);
 void sock_set_reuseaddr(struct sock *sk);
 void sock_set_sndtimeo(struct sock *sk, s64 secs);
diff --git a/net/core/sock.c b/net/core/sock.c
index e4a4dd2b3d8b3..728f5fb156a0c 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -779,6 +779,16 @@ void sock_enable_timestamps(struct sock *sk)
 }
 EXPORT_SYMBOL(sock_enable_timestamps);
 
+void sock_set_keepalive(struct sock *sk)
+{
+   lock_sock(sk);
+   if (sk->sk_prot->keepalive)
+   sk->sk_prot->keepalive(sk, true);
+   sock_valbool_flag(sk, SOCK_KEEPOPEN, true);
+   release_sock(sk);
+}
+EXPORT_SYMBOL(sock_set_keepalive);
+
 /*
  * This is meant for all protocols to use and covers goings on
  * at the socket level. Everything here is generic.
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index bbb31b9c0b391..d8bd132769594 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -43,13 +43,9 @@ int rds_tcp_keepalive(struct socket *sock)
/* values below based on xs_udp_default_timeout */
int keepidle = 5; /* send a probe 'keepidle' secs after last data */
int keepcnt = 5; /* number of unack'ed probes before declaring dead */
-   int keepalive = 1;
int ret = 0;
 
-   ret = kernel_setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE,
-   (char *)&keepalive, sizeof(keepalive));
-   if (ret < 0)
-   goto bail;
+   sock_set_keepalive(sock->sk);
 
ret = kernel_setsockopt(sock, IPPROTO_TCP, TCP_KEEPCNT,
(char *)&keepcnt, sizeof(keepcnt));
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 845d0be805ece..30082cd039960 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2110,7 +2110,6 @@ static void xs_tcp_set_socket_timeouts(struct rpc_xprt 
*xprt,
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, 
xprt);
unsigned int keepidle;
unsigned int keepcnt;
-   unsigned int opt_on = 1;
unsigned int timeo;
 
spin_lock(&xprt->transport_lock);
@@ -2122,8 +2121,7 @@ static void xs_tcp_set_socket_timeouts(struct rpc_xprt 
*xprt,
spin_unlock(&xprt->transport_lock);
 
/* TCP Keepalive options */
-   kernel_setsockopt(sock, SOL_SOCKET, SO_KEEPALIVE,
-   (char *)&opt_on, sizeof(opt_on));
+   sock_set_keepalive(sock->sk);
kernel_setsockopt(sock, SOL_TCP, TCP_KEEPIDLE,
(char *)&keepidle, sizeof(keepidle));
kernel_setsockopt(sock, SOL_TCP, TCP_KEEPINTVL,
-- 
2.26.2



[Cluster-devel] [PATCH 13/28] tcp: add tcp_sock_set_syncnt

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the TCP_SYNCNT sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
Acked-by: Sagi Grimberg 
---
 drivers/nvme/host/tcp.c |  9 +
 include/linux/tcp.h |  1 +
 net/ipv4/tcp.c  | 12 
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 4e4a750ecdb97..2872584f52f63 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1336,14 +1336,7 @@ static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl,
}
 
/* Single syn retry */
-   opt = 1;
-   ret = kernel_setsockopt(queue->sock, IPPROTO_TCP, TCP_SYNCNT,
-   (char *)&opt, sizeof(opt));
-   if (ret) {
-   dev_err(nctrl->device,
-   "failed to set TCP_SYNCNT sock opt %d\n", ret);
-   goto err_sock;
-   }
+   tcp_sock_set_syncnt(queue->sock->sk, 1);
 
/* Set TCP no delay */
tcp_sock_set_nodelay(queue->sock->sk);
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 2eaf8320b9db0..6aa4ae5ebf3d5 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -500,5 +500,6 @@ int tcp_skb_shift(struct sk_buff *to, struct sk_buff *from, 
int pcount,
 void tcp_sock_set_cork(struct sock *sk, bool on);
 void tcp_sock_set_nodelay(struct sock *sk);
 void tcp_sock_set_quickack(struct sock *sk, int val);
+int tcp_sock_set_syncnt(struct sock *sk, int val);
 
 #endif /* _LINUX_TCP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 27b5e7a4e2ef9..d2c67ae1da07a 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2881,6 +2881,18 @@ void tcp_sock_set_quickack(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(tcp_sock_set_quickack);
 
+int tcp_sock_set_syncnt(struct sock *sk, int val)
+{
+   if (val < 1 || val > MAX_TCP_SYNCNT)
+   return -EINVAL;
+
+   lock_sock(sk);
+   inet_csk(sk)->icsk_syn_retries = val;
+   release_sock(sk);
+   return 0;
+}
+EXPORT_SYMBOL(tcp_sock_set_syncnt);
+
 /*
  * Socket option code for TCP.
  */
-- 
2.26.2



[Cluster-devel] [PATCH 08/28] net: add sock_set_rcvbuf

2020-05-27 Thread Christoph Hellwig
Add a helper to directly set the SO_RCVBUFFORCE sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 fs/dlm/lowcomms.c  |  7 +-
 include/net/sock.h |  1 +
 net/core/sock.c| 59 +-
 3 files changed, 34 insertions(+), 33 deletions(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 138009c6a2ee1..45c37f572c9d2 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -1297,7 +1297,6 @@ static int sctp_listen_for_all(void)
struct socket *sock = NULL;
int result = -EINVAL;
struct connection *con = nodeid2con(0, GFP_NOFS);
-   int bufsize = NEEDED_RMEM;
int one = 1;
 
if (!con)
@@ -1312,11 +1311,7 @@ static int sctp_listen_for_all(void)
goto out;
}
 
-   result = kernel_setsockopt(sock, SOL_SOCKET, SO_RCVBUFFORCE,
-(char *)&bufsize, sizeof(bufsize));
-   if (result)
-   log_print("Error increasing buffer space on socket %d", result);
-
+   sock_set_rcvbuf(sock->sk, NEEDED_RMEM);
result = kernel_setsockopt(sock, SOL_SCTP, SCTP_NODELAY, (char *)&one,
   sizeof(one));
if (result < 0)
diff --git a/include/net/sock.h b/include/net/sock.h
index dc08c176238fd..c997289aabbf9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2693,6 +2693,7 @@ void sock_enable_timestamps(struct sock *sk);
 void sock_no_linger(struct sock *sk);
 void sock_set_keepalive(struct sock *sk);
 void sock_set_priority(struct sock *sk, u32 priority);
+void sock_set_rcvbuf(struct sock *sk, int val);
 void sock_set_reuseaddr(struct sock *sk);
 void sock_set_sndtimeo(struct sock *sk, s64 secs);
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 728f5fb156a0c..3c6ebf952e9ad 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -789,6 +789,35 @@ void sock_set_keepalive(struct sock *sk)
 }
 EXPORT_SYMBOL(sock_set_keepalive);
 
+static void __sock_set_rcvbuf(struct sock *sk, int val)
+{
+   /* Ensure val * 2 fits into an int, to prevent max_t() from treating it
+* as a negative value.
+*/
+   val = min_t(int, val, INT_MAX / 2);
+   sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+
+   /* We double it on the way in to account for "struct sk_buff" etc.
+* overhead.   Applications assume that the SO_RCVBUF setting they make
+* will allow that much actual data to be received on that socket.
+*
+* Applications are unaware that "struct sk_buff" and other overheads
+* allocate from the receive buffer during socket buffer allocation.
+*
+* And after considering the possible alternatives, returning the value
+* we actually used in getsockopt is the most desirable behavior.
+*/
+   WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF));
+}
+
+void sock_set_rcvbuf(struct sock *sk, int val)
+{
+   lock_sock(sk);
+   __sock_set_rcvbuf(sk, val);
+   release_sock(sk);
+}
+EXPORT_SYMBOL(sock_set_rcvbuf);
+
 /*
  * This is meant for all protocols to use and covers goings on
  * at the socket level. Everything here is generic.
@@ -885,30 +914,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
 * play 'guess the biggest size' games. RCVBUF/SNDBUF
 * are treated in BSD as hints
 */
-   val = min_t(u32, val, sysctl_rmem_max);
-set_rcvbuf:
-   /* Ensure val * 2 fits into an int, to prevent max_t()
-* from treating it as a negative value.
-*/
-   val = min_t(int, val, INT_MAX / 2);
-   sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
-   /*
-* We double it on the way in to account for
-* "struct sk_buff" etc. overhead.   Applications
-* assume that the SO_RCVBUF setting they make will
-* allow that much actual data to be received on that
-* socket.
-*
-* Applications are unaware that "struct sk_buff" and
-* other overheads allocate from the receive buffer
-* during socket buffer allocation.
-*
-* And after considering the possible alternatives,
-* returning the value we actually used in getsockopt
-* is the most desirable behavior.
-*/
-   WRITE_ONCE(sk->sk_rcvbuf,
-  max_t(int, val * 2, SOCK_MIN_RCVBUF));
+   __sock_set_rcvbuf(sk, min_t(u32, val, sysctl_rmem_max));
break;
 
case SO_RCVBUFFORCE:
@@ -920,9 +926,8 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
/* No ne

Re: [Cluster-devel] [PATCH 4/4] net: remove kernel_setsockopt

2020-05-29 Thread &#x27;Christoph Hellwig'
On Fri, May 29, 2020 at 12:27:12PM +, David Laight wrote:
> From: Christoph Hellwig
> > Sent: 29 May 2020 13:10
> > 
> > No users left.
> 
> There is no point even proposing this until all the changes to remove
> its use have made it at least as far into 'net-next' and probably 'net'.

If you look at net-next, the only two users left are the two removed in
this series.



[Cluster-devel] [PATCH 1/4] sctp: add sctp_sock_set_nodelay

2020-05-29 Thread Christoph Hellwig
Add a helper to directly set the SCTP_NODELAY sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig 
---
 fs/dlm/lowcomms.c   | 10 ++
 include/net/sctp/sctp.h |  7 +++
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 69333728d871b..9f1c3cdc9d653 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -914,7 +914,6 @@ static int sctp_bind_addrs(struct connection *con, uint16_t 
port)
 static void sctp_connect_to_sock(struct connection *con)
 {
struct sockaddr_storage daddr;
-   int one = 1;
int result;
int addr_len;
struct socket *sock;
@@ -961,8 +960,7 @@ static void sctp_connect_to_sock(struct connection *con)
log_print("connecting to %d", con->nodeid);
 
/* Turn off Nagle's algorithm */
-   kernel_setsockopt(sock, SOL_SCTP, SCTP_NODELAY, (char *)&one,
- sizeof(one));
+   sctp_sock_set_nodelay(sock->sk);
 
/*
 * Make sock->ops->connect() function return in specified time,
@@ -1176,7 +1174,6 @@ static int sctp_listen_for_all(void)
struct socket *sock = NULL;
int result = -EINVAL;
struct connection *con = nodeid2con(0, GFP_NOFS);
-   int one = 1;
 
if (!con)
return -ENOMEM;
@@ -1191,10 +1188,7 @@ static int sctp_listen_for_all(void)
}
 
sock_set_rcvbuf(sock->sk, NEEDED_RMEM);
-   result = kernel_setsockopt(sock, SOL_SCTP, SCTP_NODELAY, (char *)&one,
-  sizeof(one));
-   if (result < 0)
-   log_print("Could not set SCTP NODELAY error %d\n", result);
+   sctp_sock_set_nodelay(sock->sk);
 
write_lock_bh(&sock->sk->sk_callback_lock);
/* Init con struct */
diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 3ab5c6bbb90bd..f8bcb75bb0448 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -615,4 +615,11 @@ static inline bool sctp_newsk_ready(const struct sock *sk)
return sock_flag(sk, SOCK_DEAD) || sk->sk_socket;
 }
 
+static inline void sctp_sock_set_nodelay(struct sock *sk)
+{
+   lock_sock(sk);
+   sctp_sk(sk)->nodelay = true;
+   release_sock(sk);
+}
+
 #endif /* __net_sctp_h__ */
-- 
2.26.2



[Cluster-devel] [PATCH 2/4] sctp: refactor sctp_setsockopt_bindx

2020-05-29 Thread Christoph Hellwig
Split out a sctp_setsockopt_bindx_kernel that takes a kernel pointer
to the sockaddr and make sctp_setsockopt_bindx a small wrapper around
it.  This prepares for adding a new bind_add proto op.

Signed-off-by: Christoph Hellwig 
---
 net/sctp/socket.c | 61 ++-
 1 file changed, 28 insertions(+), 33 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 827a9903ee288..6e745ac3c4a59 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -972,23 +972,22 @@ int sctp_asconf_mgmt(struct sctp_sock *sp, struct 
sctp_sockaddr_entry *addrw)
  * it.
  *
  * skThe sk of the socket
- * addrs The pointer to the addresses in user land
+ * addrs The pointer to the addresses
  * addrssize Size of the addrs buffer
  * opOperation to perform (add or remove, see the flags of
  *   sctp_bindx)
  *
  * Returns 0 if ok, <0 errno code on error.
  */
-static int sctp_setsockopt_bindx(struct sock *sk,
-struct sockaddr __user *addrs,
-int addrs_size, int op)
+static int sctp_setsockopt_bindx_kernel(struct sock *sk,
+   struct sockaddr *addrs, int addrs_size,
+   int op)
 {
-   struct sockaddr *kaddrs;
int err;
int addrcnt = 0;
int walk_size = 0;
struct sockaddr *sa_addr;
-   void *addr_buf;
+   void *addr_buf = addrs;
struct sctp_af *af;
 
pr_debug("%s: sk:%p addrs:%p addrs_size:%d opt:%d\n",
@@ -997,17 +996,10 @@ static int sctp_setsockopt_bindx(struct sock *sk,
if (unlikely(addrs_size <= 0))
return -EINVAL;
 
-   kaddrs = memdup_user(addrs, addrs_size);
-   if (IS_ERR(kaddrs))
-   return PTR_ERR(kaddrs);
-
/* Walk through the addrs buffer and count the number of addresses. */
-   addr_buf = kaddrs;
while (walk_size < addrs_size) {
-   if (walk_size + sizeof(sa_family_t) > addrs_size) {
-   kfree(kaddrs);
+   if (walk_size + sizeof(sa_family_t) > addrs_size)
return -EINVAL;
-   }
 
sa_addr = addr_buf;
af = sctp_get_af_specific(sa_addr->sa_family);
@@ -1015,10 +1007,8 @@ static int sctp_setsockopt_bindx(struct sock *sk,
/* If the address family is not supported or if this address
 * causes the address buffer to overflow return EINVAL.
 */
-   if (!af || (walk_size + af->sockaddr_len) > addrs_size) {
-   kfree(kaddrs);
+   if (!af || (walk_size + af->sockaddr_len) > addrs_size)
return -EINVAL;
-   }
addrcnt++;
addr_buf += af->sockaddr_len;
walk_size += af->sockaddr_len;
@@ -1029,31 +1019,36 @@ static int sctp_setsockopt_bindx(struct sock *sk,
case SCTP_BINDX_ADD_ADDR:
/* Allow security module to validate bindx addresses. */
err = security_sctp_bind_connect(sk, SCTP_SOCKOPT_BINDX_ADD,
-(struct sockaddr *)kaddrs,
-addrs_size);
+addrs, addrs_size);
if (err)
-   goto out;
-   err = sctp_bindx_add(sk, kaddrs, addrcnt);
+   return err;
+   err = sctp_bindx_add(sk, addrs, addrcnt);
if (err)
-   goto out;
-   err = sctp_send_asconf_add_ip(sk, kaddrs, addrcnt);
-   break;
-
+   return err;
+   return sctp_send_asconf_add_ip(sk, addrs, addrcnt);
case SCTP_BINDX_REM_ADDR:
-   err = sctp_bindx_rem(sk, kaddrs, addrcnt);
+   err = sctp_bindx_rem(sk, addrs, addrcnt);
if (err)
-   goto out;
-   err = sctp_send_asconf_del_ip(sk, kaddrs, addrcnt);
-   break;
+   return err;
+   return sctp_send_asconf_del_ip(sk, addrs, addrcnt);
 
default:
-   err = -EINVAL;
-   break;
+   return -EINVAL;
}
+}
 
-out:
-   kfree(kaddrs);
+static int sctp_setsockopt_bindx(struct sock *sk,
+struct sockaddr __user *addrs,
+int addrs_size, int op)
+{
+   struct sockaddr *kaddrs;
+   int err;
 
+   kaddrs = memdup_user(addrs, addrs_size);
+   if (IS_ERR(kaddrs))
+   return PTR_ERR(kaddrs);
+   err = sctp_setsockopt_bindx_kernel(sk, kaddrs, addrs_size, op);
+   kfree(kaddrs);
return err;
 }
 
-- 
2.26.2



[Cluster-devel] [PATCH 4/4] net: remove kernel_setsockopt

2020-05-29 Thread Christoph Hellwig
No users left.

Signed-off-by: Christoph Hellwig 
---
 include/linux/net.h |  2 --
 net/socket.c| 31 ---
 2 files changed, 33 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 74ef5d7315f70..e10f378194a59 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -303,8 +303,6 @@ int kernel_connect(struct socket *sock, struct sockaddr 
*addr, int addrlen,
   int flags);
 int kernel_getsockname(struct socket *sock, struct sockaddr *addr);
 int kernel_getpeername(struct socket *sock, struct sockaddr *addr);
-int kernel_setsockopt(struct socket *sock, int level, int optname, char 
*optval,
- unsigned int optlen);
 int kernel_sendpage(struct socket *sock, struct page *page, int offset,
size_t size, int flags);
 int kernel_sendpage_locked(struct sock *sk, struct page *page, int offset,
diff --git a/net/socket.c b/net/socket.c
index 81a98b6cbd087..976426d03f099 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -3624,37 +3624,6 @@ int kernel_getpeername(struct socket *sock, struct 
sockaddr *addr)
 }
 EXPORT_SYMBOL(kernel_getpeername);
 
-/**
- * kernel_setsockopt - set a socket option (kernel space)
- * @sock: socket
- * @level: API level (SOL_SOCKET, ...)
- * @optname: option tag
- * @optval: option value
- * @optlen: option length
- *
- * Returns 0 or an error.
- */
-
-int kernel_setsockopt(struct socket *sock, int level, int optname,
-   char *optval, unsigned int optlen)
-{
-   mm_segment_t oldfs = get_fs();
-   char __user *uoptval;
-   int err;
-
-   uoptval = (char __user __force *) optval;
-
-   set_fs(KERNEL_DS);
-   if (level == SOL_SOCKET)
-   err = sock_setsockopt(sock, level, optname, uoptval, optlen);
-   else
-   err = sock->ops->setsockopt(sock, level, optname, uoptval,
-   optlen);
-   set_fs(oldfs);
-   return err;
-}
-EXPORT_SYMBOL(kernel_setsockopt);
-
 /**
  * kernel_sendpage - send a &page through a socket (kernel space)
  * @sock: socket
-- 
2.26.2



[Cluster-devel] [PATCH 3/4] net: add a new bind_add method

2020-05-29 Thread Christoph Hellwig
The SCTP protocol allows to bind multiple address to a socket.  That
feature is currently only exposed as a socket option.  Add a bind_add
method struct proto that allows to bind additional addresses, and
switch the dlm code to use the method instead of going through the
socket option from kernel space.

Signed-off-by: Christoph Hellwig 
---
 fs/dlm/lowcomms.c  |  9 +++--
 include/net/sock.h |  6 +-
 net/core/sock.c|  8 
 net/sctp/socket.c  | 14 ++
 4 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 9f1c3cdc9d653..3543a8fec9075 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -882,6 +882,7 @@ static void writequeue_entry_complete(struct 
writequeue_entry *e, int completed)
 static int sctp_bind_addrs(struct connection *con, uint16_t port)
 {
struct sockaddr_storage localaddr;
+   struct sockaddr *addr = (struct sockaddr *)&localaddr;
int i, addr_len, result = 0;
 
for (i = 0; i < dlm_local_count; i++) {
@@ -889,13 +890,9 @@ static int sctp_bind_addrs(struct connection *con, 
uint16_t port)
make_sockaddr(&localaddr, port, &addr_len);
 
if (!i)
-   result = kernel_bind(con->sock,
-(struct sockaddr *)&localaddr,
-addr_len);
+   result = kernel_bind(con->sock, addr, addr_len);
else
-   result = kernel_setsockopt(con->sock, SOL_SCTP,
-  SCTP_SOCKOPT_BINDX_ADD,
-  (char *)&localaddr, 
addr_len);
+   result = sock_bind_add(con->sock->sk, addr, addr_len);
 
if (result < 0) {
log_print("Can't bind to %d addr number %d, %d.\n",
diff --git a/include/net/sock.h b/include/net/sock.h
index d994daa418ec2..6e9f713a78607 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1156,7 +1156,9 @@ struct proto {
int (*sendpage)(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
int (*bind)(struct sock *sk,
-   struct sockaddr *uaddr, int addr_len);
+   struct sockaddr *addr, int addr_len);
+   int (*bind_add)(struct sock *sk,
+   struct sockaddr *addr, int addr_len);
 
int (*backlog_rcv) (struct sock *sk,
struct sk_buff *skb);
@@ -2698,4 +2700,6 @@ void sock_set_reuseaddr(struct sock *sk);
 void sock_set_reuseport(struct sock *sk);
 void sock_set_sndtimeo(struct sock *sk, s64 secs);
 
+int sock_bind_add(struct sock *sk, struct sockaddr *addr, int addr_len);
+
 #endif /* _SOCK_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 2ca3425b519c0..61ec573221a60 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3712,3 +3712,11 @@ bool sk_busy_loop_end(void *p, unsigned long start_time)
 }
 EXPORT_SYMBOL(sk_busy_loop_end);
 #endif /* CONFIG_NET_RX_BUSY_POLL */
+
+int sock_bind_add(struct sock *sk, struct sockaddr *addr, int addr_len)
+{
+   if (!sk->sk_prot->bind_add)
+   return -EOPNOTSUPP;
+   return sk->sk_prot->bind_add(sk, addr, addr_len);
+}
+EXPORT_SYMBOL(sock_bind_add);
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 6e745ac3c4a59..d57e1a002ffc8 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1052,6 +1052,18 @@ static int sctp_setsockopt_bindx(struct sock *sk,
return err;
 }
 
+static int sctp_bind_add(struct sock *sk, struct sockaddr *addrs,
+   int addrlen)
+{
+   int err;
+
+   lock_sock(sk);
+   err = sctp_setsockopt_bindx_kernel(sk, addrs, addrlen,
+  SCTP_BINDX_ADD_ADDR);
+   release_sock(sk);
+   return err;
+}
+
 static int sctp_connect_new_asoc(struct sctp_endpoint *ep,
 const union sctp_addr *daddr,
 const struct sctp_initmsg *init,
@@ -9620,6 +9632,7 @@ struct proto sctp_prot = {
.sendmsg =  sctp_sendmsg,
.recvmsg =  sctp_recvmsg,
.bind=  sctp_bind,
+   .bind_add=  sctp_bind_add,
.backlog_rcv =  sctp_backlog_rcv,
.hash=  sctp_hash,
.unhash  =  sctp_unhash,
@@ -9662,6 +9675,7 @@ struct proto sctpv6_prot = {
.sendmsg= sctp_sendmsg,
.recvmsg= sctp_recvmsg,
.bind   = sctp_bind,
+   .bind_add   = sctp_bind_add,
.backlog_rcv= sctp_backlog_rcv,
.hash   = sctp_hash,
.unhash = sctp_unhash,
-- 
2.26.2



[Cluster-devel] remove kernel_setsockopt v4

2020-05-29 Thread Christoph Hellwig
Hi Dave and Marcelo,

now that only the dlm calls to sctp are left for kernel_setsockopt,
while we haven't really made much progress with the sctp setsockopt
refactoring, how about this small series that splits out a
sctp_setsockopt_bindx_kernel that takes a kernel space address array
to share more code as requested by Marcelo.  This should fit in with
whatever variant of the refator of sctp setsockopt we go with, but
just solved the immediate problem for now.

Changes since v3:
 - dropped all the merged patches, just sctp setsockopt left now
 - factor out a new sctp_setsockopt_bindx_kernel helper instead of
   duplicating a small amount of logic

Changes since v2:
 - drop the separately merged kernel_getopt_removal
 - drop the sctp patches, as there is conflicting cleanup going on
 - add an additional ACK for the rxrpc changes

Changes since v1:
 - use ->getname for sctp sockets in dlm
 - add a new ->bind_add struct proto method for dlm/sctp
 - switch the ipv6 and remaining sctp helpers to inline function so that
   the ipv6 and sctp modules are not pulled in by any module that could
   potentially use ipv6 or sctp connections
 - remove arguments to various sock_* helpers that are always used with
   the same constant arguments



[Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-01 Thread Christoph Hellwig
On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to indicate
> that if the page invalidation fails, return back control to the
> filesystem so it may fallback to buffered mode.
> 
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Goldwyn Rodrigues 

I'd like to start a discussion of this shouldn't really be the
default behavior.  If we have page cache that can't be invalidated it
actually makes a whole lot of sense to not do direct I/O, avoid the
warnings, etc.

Adding all the relevant lists.

> ---
>  fs/iomap/direct-io.c  |  8 +++-
>  include/linux/iomap.h | 14 ++
>  2 files changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index fd22bff61569..2459c76e41ab 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -484,8 +484,14 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>*/
>   ret = invalidate_inode_pages2_range(mapping,
>   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
> - if (ret)
> + if (ret) {
> + if (dio_flags & IOMAP_DIO_RWF_NO_STALE_PAGECACHE) {
> + if (ret == -EBUSY)
> + ret = 0;
> + goto out_free_dio;
> + }
>   dio_warn_stale_pagecache(iocb->ki_filp);
> + }
>   ret = 0;
>  
>   if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 8a4ba1635202..2ebb8a298cd8 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -262,7 +262,21 @@ struct iomap_dio_ops {
>  /*
>   * Wait for completion of DIO
>   */
> +
>  #define IOMAP_DIO_RWF_SYNCIO (1 << 0)
> +/*
> + * Direct IO will attempt to keep the page cache coherent by
> + * invalidating the inode's page cache over the range of the DIO.
> + * That can fail if something else is actively using the page cache.
> + * If this happens and the DIO continues, the data in the page
> + * cache will become stale.
> + *
> + * Set this flag if you want the DIO to abort without issuing any IO
> + * or error if it fails to invalidate the page cache successfully.
> + * This allows the IO submitter to fallback to buffered IO to resubmit
> + * IO
> + */
> +#define IOMAP_DIO_RWF_NO_STALE_PAGECACHE (1 << 1)
>  
>  ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
> -- 
> 2.26.2
---end quoted text---



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-07 Thread Christoph Hellwig
On Tue, Jul 07, 2020 at 01:57:05PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 07, 2020 at 07:43:46AM -0500, Goldwyn Rodrigues wrote:
> > On  9:53 01/07, Christoph Hellwig wrote:
> > > On Mon, Jun 29, 2020 at 02:23:49PM -0500, Goldwyn Rodrigues wrote:
> > > > From: Goldwyn Rodrigues 
> > > > 
> > > > For direct I/O, add the flag IOMAP_DIO_RWF_NO_STALE_PAGECACHE to 
> > > > indicate
> > > > that if the page invalidation fails, return back control to the
> > > > filesystem so it may fallback to buffered mode.
> > > > 
> > > > Reviewed-by: Darrick J. Wong 
> > > > Signed-off-by: Goldwyn Rodrigues 
> > > 
> > > I'd like to start a discussion of this shouldn't really be the
> > > default behavior.  If we have page cache that can't be invalidated it
> > > actually makes a whole lot of sense to not do direct I/O, avoid the
> > > warnings, etc.
> > > 
> > > Adding all the relevant lists.
> > 
> > Since no one responded so far, let me see if I can stir the cauldron :)
> > 
> > What error should be returned in case of such an error? I think the
> 
> Christoph's message is ambiguous.  I don't know if he means "fail the
> I/O with an error" or "satisfy the I/O through the page cache".  I'm
> strongly in favour of the latter.

Same here.  Sorry if my previous mail was unclear.

> Indeed, I'm in favour of not invalidating
> the page cache at all for direct I/O.  For reads, I think the page cache
> should be used to satisfy any portion of the read which is currently
> cached.  For writes, I think we should write into the page cache pages
> which currently exist, and then force those pages to be written back,
> but left in cache.

Something like that, yes.



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-08 Thread Christoph Hellwig
On Wed, Jul 08, 2020 at 02:54:37PM +0100, Matthew Wilcox wrote:
> Direct I/O isn't deterministic though.  If the file isn't shared, then
> it works great, but as soon as you get mixed buffered and direct I/O,
> everything is already terrible.  Direct I/Os perform pagecache lookups
> already, but instead of using the data that we found in the cache, we
> (if it's dirty) write it back, wait for the write to complete, remove
> the page from the pagecache and then perform another I/O to get the data
> that we just wrote out!  And then the app that's using buffered I/O has
> to read it back in again.

Mostly agreed.  That being said I suspect invalidating clean cache
might still be a good idea.  The original idea was mostly on how
to deal with invalidation failures of any kind, but falling back for
any kind of dirty cache also makes at least some sense.

> I have had an objection raised off-list.  In a scenario with a block
> device shared between two systems and an application which does direct
> I/O, everything is normally fine.  If one of the systems uses tar to
> back up the contents of the block device then the application on that
> system will no longer see the writes from the other system because
> there's nothing to invalidate the pagecache on the first system.

Err, WTF?  If someone access shared block devices with random
applications all bets are off anyway.

> 
> Unfortunately, this is in direct conflict with the performance
> problem caused by some little arsewipe deciding to do:
> 
> $ while true; do dd if=/lib/x86_64-linux-gnu/libc-2.30.so iflag=direct 
> of=/dev/null; done
> 
> ... doesn't hurt me because my root filesystem is on ext4 which doesn't
> purge the cache.  But anything using iomap gets all the pages for libc
> kicked out of the cache, and that's a lot of fun.

ext4 uses iomap..



Re: [Cluster-devel] always fall back to buffered I/O after invalidation failures, was: Re: [PATCH 2/6] iomap: IOMAP_DIO_RWF_NO_STALE_PAGECACHE return if page invalidation fails

2020-07-10 Thread Christoph Hellwig
This looks sane - slightly updated version below to not bother with
the ret and a few tidyups.

That being said and to get back to the discussion in this thread:
I think it would be saner to give up on direct I/O in case of the
invalidation failure.  I've cooked up a patch on top of this one
(for which I had a few trivial cleanups).  It is still under testing,
but I'll send the two out in a new thread.



[Cluster-devel] [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures

2020-07-13 Thread Christoph Hellwig
Failing to invalid the page cache means data in incoherent, which is
a very bad state for the system.  Always fall back to buffered I/O
through the page cache if we can't invalidate mappings.

Signed-off-by: Christoph Hellwig 
---
 fs/ext4/file.c   |  2 ++
 fs/gfs2/file.c   |  3 ++-
 fs/iomap/direct-io.c | 13 -
 fs/iomap/trace.h |  1 +
 fs/xfs/xfs_file.c|  4 ++--
 fs/zonefs/super.c|  7 +--
 6 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2a01e31a032c4c..0da6c2a2c32c1e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -544,6 +544,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
iomap_ops = &ext4_iomap_overwrite_ops;
ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
   is_sync_kiocb(iocb) || unaligned_io || extend);
+   if (ret == -EREMCHG)
+   ret = 0;
 
if (extend)
ret = ext4_handle_inode_extension(inode, offset, ret, count);
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index fe305e4bfd3734..c7907d40c61d17 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -814,7 +814,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, 
struct iov_iter *from)
 
ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
   is_sync_kiocb(iocb));
-
+   if (ret == -EREMCHG)
+   ret = 0;
 out:
gfs2_glock_dq(&gh);
 out_uninit:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 190967e87b69e4..62626235cdbe8d 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include "trace.h"
 
 #include "../internal.h"
 
@@ -478,13 +479,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
if (iov_iter_rw(iter) == WRITE) {
/*
 * Try to invalidate cache pages for the range we are writing.
-* If this invalidation fails, tough, the write will still work,
-* but racing two incompatible write paths is a pretty crazy
-* thing to do, so we don't support it 100%.
+* If this invalidation fails, let the caller fall back to
+* buffered I/O.
 */
if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
-   end >> PAGE_SHIFT))
-   dio_warn_stale_pagecache(iocb->ki_filp);
+   end >> PAGE_SHIFT)) {
+   trace_iomap_dio_invalidate_fail(inode, pos, count);
+   ret = -EREMCHG;
+   goto out_free_dio;
+   }
 
if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
ret = sb_init_dio_done_wq(inode->i_sb);
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 5693a39d52fb63..fdc7ae388476f5 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -74,6 +74,7 @@ DEFINE_EVENT(iomap_range_class, name, \
 DEFINE_RANGE_EVENT(iomap_writepage);
 DEFINE_RANGE_EVENT(iomap_releasepage);
 DEFINE_RANGE_EVENT(iomap_invalidatepage);
+DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
 
 #define IOMAP_TYPE_STRINGS \
{ IOMAP_HOLE,   "HOLE" }, \
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00db81eac80d6c..551cca39fa3ba6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -553,8 +553,8 @@ xfs_file_dio_aio_write(
xfs_iunlock(ip, iolock);
 
/*
-* No fallback to buffered IO on errors for XFS, direct IO will either
-* complete fully or fail.
+* No partial fallback to buffered IO on errors for XFS, direct IO will
+* either complete fully or fail.
 */
ASSERT(ret < 0 || ret == count);
return ret;
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 07bc42d62673ce..793850454b752f 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -786,8 +786,11 @@ static ssize_t zonefs_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
return -EFBIG;
 
-   if (iocb->ki_flags & IOCB_DIRECT)
-   return zonefs_file_dio_write(iocb, from);
+   if (iocb->ki_flags & IOCB_DIRECT) {
+   ret = zonefs_file_dio_write(iocb, from);
+   if (ret != -EREMCHG)
+   return ret;
+   }
 
return zonefs_file_buffered_write(iocb, from);
 }
-- 
2.26.2



[Cluster-devel] [PATCH 1/2] iomap: Only invalidate page cache pages on direct IO writes

2020-07-13 Thread Christoph Hellwig
From: Dave Chinner 

The historic requirement for XFS to invalidate cached pages on
direct IO reads has been lost in the twisty pages of history - it was
inherited from Irix, which implemented page cache invalidation on
read as a method of working around problems synchronising page
cache state with uncached IO.

XFS has carried this ever since. In the initial linux ports it was
necessary to get mmap and DIO to play "ok" together and not
immediately corrupt data. This was the state of play until the linux
kernel had infrastructure to track unwritten extents and synchronise
page faults with allocations and unwritten extent conversions
(->page_mkwrite infrastructure). IOws, the page cache invalidation
on DIO read was necessary to prevent trivial data corruptions. This
didn't solve all the problems, though.

There were peformance problems if we didn't invalidate the entire
page cache over the file on read - we couldn't easily determine if
the cached pages were over the range of the IO, and invalidation
required taking a serialising lock (i_mutex) on the inode. This
serialising lock was an issue for XFS, as it was the only exclusive
lock in the direct Io read path.

Hence if there were any cached pages, we'd just invalidate the
entire file in one go so that subsequent IOs didn't need to take the
serialising lock. This was a problem that prevented ranged
invalidation from being particularly useful for avoiding the
remaining coherency issues. This was solved with the conversion of
i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
use i_rwsem. Hence we could now just do ranged invalidation and the
performance problem went away.

However, page cache invalidation was still needed to serialise
sub-page/sub-block zeroing via direct IO against buffered IO because
bufferhead state attached to the cached page could get out of whack
when direct IOs were issued.  We've removed bufferheads from the
XFS code, and we don't carry any extent state on the cached pages
anymore, and so this problem has gone away, too.

IOWs, it would appear that we don't have any good reason to be
invalidating the page cache on DIO reads anymore. Hence remove the
invalidation on read because it is unnecessary overhead,
not needed to maintain coherency between mmap/buffered access and
direct IO anymore, and prevents anyone from using direct IO reads
from intentionally invalidating the page cache of a file.

Signed-off-by: Dave Chinner 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Matthew Wilcox (Oracle) 
Signed-off-by: Christoph Hellwig 
---
 fs/iomap/direct-io.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ec7b78e6fecaf9..190967e87b69e4 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -475,23 +475,22 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
if (ret)
goto out_free_dio;
 
-   /*
-* Try to invalidate cache pages for the range we're direct
-* writing.  If this invalidation fails, tough, the write will
-* still work, but racing two incompatible write paths is a
-* pretty crazy thing to do, so we don't support it 100%.
-*/
-   ret = invalidate_inode_pages2_range(mapping,
-   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
-   if (ret)
-   dio_warn_stale_pagecache(iocb->ki_filp);
-   ret = 0;
+   if (iov_iter_rw(iter) == WRITE) {
+   /*
+* Try to invalidate cache pages for the range we are writing.
+* If this invalidation fails, tough, the write will still work,
+* but racing two incompatible write paths is a pretty crazy
+* thing to do, so we don't support it 100%.
+*/
+   if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
+   end >> PAGE_SHIFT))
+   dio_warn_stale_pagecache(iocb->ki_filp);
 
-   if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
-   !inode->i_sb->s_dio_done_wq) {
-   ret = sb_init_dio_done_wq(inode->i_sb);
-   if (ret < 0)
-   goto out_free_dio;
+   if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
+   ret = sb_init_dio_done_wq(inode->i_sb);
+   if (ret < 0)
+   goto out_free_dio;
+   }
}
 
inode_dio_begin(inode);
-- 
2.26.2



[Cluster-devel] RFC: iomap write invalidation

2020-07-13 Thread Christoph Hellwig
Hi all,

this series has two parts:  the first one picks up Dave's patch to avoid
invalidation entierly for reads, picked up deep down from the btrfs iomap
thread.  The second one falls back to buffered writes if invalidation fails
instead of leaving a stale cache around.  Let me know what you think about
this approch.



Re: [Cluster-devel] [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures

2020-07-14 Thread Christoph Hellwig
On Mon, Jul 13, 2020 at 12:55:09PM +0100, Matthew Wilcox wrote:
> On Mon, Jul 13, 2020 at 09:46:33AM +0200, Christoph Hellwig wrote:
> > Failing to invalid the page cache means data in incoherent, which is
> > a very bad state for the system.  Always fall back to buffered I/O
> > through the page cache if we can't invalidate mappings.
> 
> Is that the right approach though?  I don't have a full picture in my head,
> but wouldn't we be better off marking these pages as !Uptodate and doing
> the direct I/O?

Isn't that a problem if e.g. pages are mapped into userspace and mlocked?



Re: [Cluster-devel] [PATCH 2/2] iomap: fall back to buffered writes for invalidation failures

2020-07-14 Thread Christoph Hellwig
On Mon, Jul 13, 2020 at 08:39:20AM -0700, Darrick J. Wong wrote:
> -ENOTBLK is already being used as a "magic" return code that means
> "retry this direct write as a buffered write".  Shouldn't we use that
> instead?
> 
> -EREMCHG was a private hack we put in XFS for the one case where a
> direct write had to be done through the page cache (non block-aligned
> COW), but maybe it's time we put that to rest since the rest of the
> world apparently thinks the magic fallback code is -ENOTBLK.

Sure, I can switch the error code.



Re: [Cluster-devel] RFC: iomap write invalidation

2020-07-21 Thread Christoph Hellwig
On Mon, Jul 20, 2020 at 04:51:25PM -0500, Goldwyn Rodrigues wrote:
> Hi Christoph,
> 
> On  9:46 13/07, Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series has two parts:  the first one picks up Dave's patch to avoid
> > invalidation entierly for reads, picked up deep down from the btrfs iomap
> > thread.  The second one falls back to buffered writes if invalidation fails
> > instead of leaving a stale cache around.  Let me know what you think about
> > this approch.
> 
> Which kernel version are these changes expected?
> btrfs dio switch to iomap depends on this.

No idea.  Darrick, are you ok with picking this up soon so that
Goldwyn can pull it in?



Re: [Cluster-devel] RFC: iomap write invalidation

2020-07-21 Thread Christoph Hellwig
On Tue, Jul 21, 2020 at 04:04:32PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 21, 2020 at 04:53:13PM +0200, Christoph Hellwig wrote:
> > On Mon, Jul 20, 2020 at 04:51:25PM -0500, Goldwyn Rodrigues wrote:
> > > Hi Christoph,
> > > 
> > > On  9:46 13/07, Christoph Hellwig wrote:
> > > > Hi all,
> > > > 
> > > > this series has two parts:  the first one picks up Dave's patch to avoid
> > > > invalidation entierly for reads, picked up deep down from the btrfs 
> > > > iomap
> > > > thread.  The second one falls back to buffered writes if invalidation 
> > > > fails
> > > > instead of leaving a stale cache around.  Let me know what you think 
> > > > about
> > > > this approch.
> > > 
> > > Which kernel version are these changes expected?
> > > btrfs dio switch to iomap depends on this.
> > 
> > No idea.  Darrick, are you ok with picking this up soon so that
> > Goldwyn can pull it in?
> 
> I thought you were going to respin this with EREMCHG changed to ENOTBLK?

Oh, true.  I'll do that ASAP.



Re: [Cluster-devel] RFC: iomap write invalidation

2020-07-21 Thread Christoph Hellwig
On Tue, Jul 21, 2020 at 04:14:37PM +0100, Matthew Wilcox wrote:
> On Tue, Jul 21, 2020 at 05:06:15PM +0200, Christoph Hellwig wrote:
> > On Tue, Jul 21, 2020 at 04:04:32PM +0100, Matthew Wilcox wrote:
> > > I thought you were going to respin this with EREMCHG changed to ENOTBLK?
> > 
> > Oh, true.  I'll do that ASAP.
> 
> Michael, could we add this to manpages?

Umm, no.  -ENOTBLK is internal - the file systems will retry using
buffered I/O and the error shall never escape to userspace (or even the
VFS for that matter).



Re: [Cluster-devel] RFC: iomap write invalidation

2020-07-21 Thread Christoph Hellwig
On Tue, Jul 21, 2020 at 08:27:54AM -0700, Darrick J. Wong wrote:
> On Tue, Jul 21, 2020 at 05:16:16PM +0200, Christoph Hellwig wrote:
> > On Tue, Jul 21, 2020 at 04:14:37PM +0100, Matthew Wilcox wrote:
> > > On Tue, Jul 21, 2020 at 05:06:15PM +0200, Christoph Hellwig wrote:
> > > > On Tue, Jul 21, 2020 at 04:04:32PM +0100, Matthew Wilcox wrote:
> > > > > I thought you were going to respin this with EREMCHG changed to 
> > > > > ENOTBLK?
> > > > 
> > > > Oh, true.  I'll do that ASAP.
> > > 
> > > Michael, could we add this to manpages?
> > 
> > Umm, no.  -ENOTBLK is internal - the file systems will retry using
> > buffered I/O and the error shall never escape to userspace (or even the
> > VFS for that matter).
> 
> It's worth dropping a comment somewhere that ENOTBLK is the desired
> "fall back to buffered" errcode, seeing as Dave and I missed that in
> XFS...

Sounds like a good idea, but what would a good place be?



Re: [Cluster-devel] RFC: iomap write invalidation

2020-07-21 Thread Christoph Hellwig
On Tue, Jul 21, 2020 at 04:31:36PM +0100, Matthew Wilcox wrote:
> > Umm, no.  -ENOTBLK is internal - the file systems will retry using
> > buffered I/O and the error shall never escape to userspace (or even the
> > VFS for that matter).
> 
> Ah, I made the mistake of believing the comments that I could see in
> your patch instead of reading the code.
> 
> Can I suggest deleting this comment:
> 
> /*
>  * No fallback to buffered IO on errors for XFS, direct IO will either
>  * complete fully or fail.
>  */
> 
> and rewording this one:
> 
> /*
>  * Allow a directio write to fall back to a buffered
>  * write *only* in the case that we're doing a reflink
>  * CoW.  In all other directio scenarios we do not
>  * allow an operation to fall back to buffered mode.
>  */
> 
> as part of your revised patchset?

That isn't actually true.  In current mainline we only fallback on
reflink RMW cases, but with this series we also fall back for
invalidation failures.



Re: [Cluster-devel] RFC: iomap write invalidation

2020-07-21 Thread Christoph Hellwig
On Tue, Jul 21, 2020 at 08:59:25AM -0700, Darrick J. Wong wrote:
> In the comment that precedes iomap_dio_rw() for the iomap version,

maybe let's just do that..

> ``direct_IO``
>   called by the generic read/write routines to perform direct_IO -
>   that is IO requests which bypass the page cache and transfer
>   data directly between the storage and the application's address
>   space.  This function can return -ENOTBLK to signal that it is
>   necessary to fallback to buffered IO.  Note that
>   blockdev_direct_IO and variants can also return -ENOTBLK.

->direct_IO is not used for iomap and various other implementations.
In fact it is a horrible hack that I've been trying to get rid of
for a while.



[Cluster-devel] [PATCH 2/3] iomap: Only invalidate page cache pages on direct IO writes

2020-07-21 Thread Christoph Hellwig
From: Dave Chinner 

The historic requirement for XFS to invalidate cached pages on
direct IO reads has been lost in the twisty pages of history - it was
inherited from Irix, which implemented page cache invalidation on
read as a method of working around problems synchronising page
cache state with uncached IO.

XFS has carried this ever since. In the initial linux ports it was
necessary to get mmap and DIO to play "ok" together and not
immediately corrupt data. This was the state of play until the linux
kernel had infrastructure to track unwritten extents and synchronise
page faults with allocations and unwritten extent conversions
(->page_mkwrite infrastructure). IOws, the page cache invalidation
on DIO read was necessary to prevent trivial data corruptions. This
didn't solve all the problems, though.

There were peformance problems if we didn't invalidate the entire
page cache over the file on read - we couldn't easily determine if
the cached pages were over the range of the IO, and invalidation
required taking a serialising lock (i_mutex) on the inode. This
serialising lock was an issue for XFS, as it was the only exclusive
lock in the direct Io read path.

Hence if there were any cached pages, we'd just invalidate the
entire file in one go so that subsequent IOs didn't need to take the
serialising lock. This was a problem that prevented ranged
invalidation from being particularly useful for avoiding the
remaining coherency issues. This was solved with the conversion of
i_mutex to i_rwsem and the conversion of the XFS inode IO lock to
use i_rwsem. Hence we could now just do ranged invalidation and the
performance problem went away.

However, page cache invalidation was still needed to serialise
sub-page/sub-block zeroing via direct IO against buffered IO because
bufferhead state attached to the cached page could get out of whack
when direct IOs were issued.  We've removed bufferheads from the
XFS code, and we don't carry any extent state on the cached pages
anymore, and so this problem has gone away, too.

IOWs, it would appear that we don't have any good reason to be
invalidating the page cache on DIO reads anymore. Hence remove the
invalidation on read because it is unnecessary overhead,
not needed to maintain coherency between mmap/buffered access and
direct IO anymore, and prevents anyone from using direct IO reads
from intentionally invalidating the page cache of a file.

Signed-off-by: Dave Chinner 
Reviewed-by: Darrick J. Wong 
Reviewed-by: Matthew Wilcox (Oracle) 
Signed-off-by: Christoph Hellwig 
---
 fs/iomap/direct-io.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index ec7b78e6fecaf9..190967e87b69e4 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -475,23 +475,22 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
if (ret)
goto out_free_dio;
 
-   /*
-* Try to invalidate cache pages for the range we're direct
-* writing.  If this invalidation fails, tough, the write will
-* still work, but racing two incompatible write paths is a
-* pretty crazy thing to do, so we don't support it 100%.
-*/
-   ret = invalidate_inode_pages2_range(mapping,
-   pos >> PAGE_SHIFT, end >> PAGE_SHIFT);
-   if (ret)
-   dio_warn_stale_pagecache(iocb->ki_filp);
-   ret = 0;
+   if (iov_iter_rw(iter) == WRITE) {
+   /*
+* Try to invalidate cache pages for the range we are writing.
+* If this invalidation fails, tough, the write will still work,
+* but racing two incompatible write paths is a pretty crazy
+* thing to do, so we don't support it 100%.
+*/
+   if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
+   end >> PAGE_SHIFT))
+   dio_warn_stale_pagecache(iocb->ki_filp);
 
-   if (iov_iter_rw(iter) == WRITE && !wait_for_completion &&
-   !inode->i_sb->s_dio_done_wq) {
-   ret = sb_init_dio_done_wq(inode->i_sb);
-   if (ret < 0)
-   goto out_free_dio;
+   if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
+   ret = sb_init_dio_done_wq(inode->i_sb);
+   if (ret < 0)
+   goto out_free_dio;
+   }
}
 
inode_dio_begin(inode);
-- 
2.27.0



[Cluster-devel] iomap write invalidation v2

2020-07-21 Thread Christoph Hellwig



Changes since v1:
 - use -ENOTBLK for the direct to buffered I/O fallback everywhere
 - document the choice of -ENOTBLK better
 - update a comment in XFS



[Cluster-devel] [PATCH 3/3] iomap: fall back to buffered writes for invalidation failures

2020-07-21 Thread Christoph Hellwig
Failing to invalid the page cache means data in incoherent, which is
a very bad state for the system.  Always fall back to buffered I/O
through the page cache if we can't invalidate mappings.

Signed-off-by: Christoph Hellwig 
Acked-by: Dave Chinner 
Reviewed-by: Goldwyn Rodrigues 
---
 fs/ext4/file.c   |  2 ++
 fs/gfs2/file.c   |  3 ++-
 fs/iomap/direct-io.c | 16 +++-
 fs/iomap/trace.h |  1 +
 fs/xfs/xfs_file.c|  4 ++--
 fs/zonefs/super.c|  7 +--
 6 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2a01e31a032c4c..129cc1dd6b7952 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -544,6 +544,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
iomap_ops = &ext4_iomap_overwrite_ops;
ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
   is_sync_kiocb(iocb) || unaligned_io || extend);
+   if (ret == -ENOTBLK)
+   ret = 0;
 
if (extend)
ret = ext4_handle_inode_extension(inode, offset, ret, count);
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index bebde537ac8cf2..b085a3bea4f0fd 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -835,7 +835,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, 
struct iov_iter *from)
 
ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
   is_sync_kiocb(iocb));
-
+   if (ret == -ENOTBLK)
+   ret = 0;
 out:
gfs2_glock_dq(&gh);
 out_uninit:
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 190967e87b69e4..c1aafb2ab99072 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include "trace.h"
 
 #include "../internal.h"
 
@@ -401,6 +402,9 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
length,
  * can be mapped into multiple disjoint IOs and only a subset of the IOs issued
  * may be pure data writes. In that case, we still need to do a full data sync
  * completion.
+ *
+ * Returns -ENOTBLK In case of a page invalidation invalidation failure for
+ * writes.  The callers needs to fall back to buffered I/O in this case.
  */
 ssize_t
 iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
@@ -478,13 +482,15 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
if (iov_iter_rw(iter) == WRITE) {
/*
 * Try to invalidate cache pages for the range we are writing.
-* If this invalidation fails, tough, the write will still work,
-* but racing two incompatible write paths is a pretty crazy
-* thing to do, so we don't support it 100%.
+* If this invalidation fails, let the caller fall back to
+* buffered I/O.
 */
if (invalidate_inode_pages2_range(mapping, pos >> PAGE_SHIFT,
-   end >> PAGE_SHIFT))
-   dio_warn_stale_pagecache(iocb->ki_filp);
+   end >> PAGE_SHIFT)) {
+   trace_iomap_dio_invalidate_fail(inode, pos, count);
+   ret = -ENOTBLK;
+   goto out_free_dio;
+   }
 
if (!wait_for_completion && !inode->i_sb->s_dio_done_wq) {
ret = sb_init_dio_done_wq(inode->i_sb);
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index 5693a39d52fb63..fdc7ae388476f5 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -74,6 +74,7 @@ DEFINE_EVENT(iomap_range_class, name, \
 DEFINE_RANGE_EVENT(iomap_writepage);
 DEFINE_RANGE_EVENT(iomap_releasepage);
 DEFINE_RANGE_EVENT(iomap_invalidatepage);
+DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
 
 #define IOMAP_TYPE_STRINGS \
{ IOMAP_HOLE,   "HOLE" }, \
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a6ef90457abf97..1b4517fc55f1b9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -553,8 +553,8 @@ xfs_file_dio_aio_write(
xfs_iunlock(ip, iolock);
 
/*
-* No fallback to buffered IO on errors for XFS, direct IO will either
-* complete fully or fail.
+* No fallback to buffered IO after short writes for XFS, direct I/O
+* will either complete fully or return an error.
 */
ASSERT(ret < 0 || ret == count);
return ret;
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 07bc42d62673ce..d0a04528a7e18e 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -786,8 +786,11 @@ static ssize_t zonefs_file_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
return -EFBIG;
 
-   if (iocb->ki_flags & IOCB_DIRECT)
-   return zonefs_file_dio_write(iocb, 

[Cluster-devel] [PATCH 1/3] xfs: use ENOTBLK for direct I/O to buffered I/O fallback

2020-07-21 Thread Christoph Hellwig
This is what the classic fs/direct-io.c implementation and thuse other
file systems use.

Signed-off-by: Christoph Hellwig 
---
 fs/xfs/xfs_file.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00db81eac80d6c..a6ef90457abf97 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -505,7 +505,7 @@ xfs_file_dio_aio_write(
 */
if (xfs_is_cow_inode(ip)) {
trace_xfs_reflink_bounce_dio_write(ip, iocb->ki_pos, 
count);
-   return -EREMCHG;
+   return -ENOTBLK;
}
iolock = XFS_IOLOCK_EXCL;
} else {
@@ -714,7 +714,7 @@ xfs_file_write_iter(
 * allow an operation to fall back to buffered mode.
 */
ret = xfs_file_dio_aio_write(iocb, from);
-   if (ret != -EREMCHG)
+   if (ret != -ENOTBLK)
return ret;
}
 
-- 
2.27.0



Re: [Cluster-devel] [PATCH 3/3] iomap: fall back to buffered writes for invalidation failures

2020-07-21 Thread Christoph Hellwig
On Tue, Jul 21, 2020 at 01:37:49PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 21, 2020 at 08:31:57PM +0200, Christoph Hellwig wrote:
> > Failing to invalid the page cache means data in incoherent, which is
> > a very bad state for the system.  Always fall back to buffered I/O
> > through the page cache if we can't invalidate mappings.
> > 
> > Signed-off-by: Christoph Hellwig 
> > Acked-by: Dave Chinner 
> > Reviewed-by: Goldwyn Rodrigues 
> 
> For the iomap and xfs parts,
> Reviewed-by: Darrick J. Wong 
> 
> But I'd still like acks from Ted, Andreas, and Damien for ext4, gfs2,
> and zonefs, respectively.
> 
> (Particularly if anyone was harboring ideas about trying to get this in
> before 5.10, though I've not yet heard anyone say that explicitly...)

Why would we want to wait another whole merge window?



[Cluster-devel] [PATCH] gfs2: fix file_system_type leak on gfs2meta mount

2008-02-25 Thread Christoph Hellwig

get_gfs2_sb does a get_fs_type without doing a put_filesystem and
thus leaking a file_system_type reference everytime it's called.

Just use gfs2_fs_type directly instead of doing the lookup and thus
fix the problem.


Signed-off-by: Christoph Hellwig <[EMAIL PROTECTED]>

Index: linux-2.6/fs/gfs2/ops_fstype.c
===
--- linux-2.6.orig/fs/gfs2/ops_fstype.c 2008-02-25 18:53:11.0 +0100
+++ linux-2.6/fs/gfs2/ops_fstype.c  2008-02-25 18:53:36.0 +0100
@@ -874,7 +874,6 @@ static struct super_block* get_gfs2_sb(c
 {
struct kstat stat;
struct nameidata nd;
-   struct file_system_type *fstype;
struct super_block *sb = NULL, *s;
int error;
 
@@ -886,8 +885,7 @@ static struct super_block* get_gfs2_sb(c
}
error = vfs_getattr(nd.path.mnt, nd.path.dentry, &stat);
 
-   fstype = get_fs_type("gfs2");
-   list_for_each_entry(s, &fstype->fs_supers, s_instances) {
+   list_for_each_entry(s, &gfs2_fs_type.fs_supers, s_instances) {
if ((S_ISBLK(stat.mode) && s->s_dev == stat.rdev) ||
(S_ISDIR(stat.mode) &&
 s == nd.path.dentry->d_inode->i_sb)) {



[Cluster-devel] Re: [PATCH 36/48] [GFS2] fix file_system_type leak on gfs2meta mount

2008-04-17 Thread Christoph Hellwig
On Thu, Apr 17, 2008 at 09:39:12AM +0100, [EMAIL PROTECTED] wrote:
> From: Christoph Hellwig <[EMAIL PROTECTED]>
> 
> get_gfs2_sb does a get_fs_type without doing a put_filesystem and
> thus leaking a file_system_type reference everytime it's called.
> 
> Just use gfs2_fs_type directly instead of doing the lookup and thus
> fix the problem.

Btw, after this patch we should remove the get_fs_type export.  It's
not used by modules and because put_filesystem is not exported every
modular user would fundamentally have the same leak as the one fixed in
this patch.



[Cluster-devel] Re: [Jfs-discussion] Unneeded kernel threads (xfs, jfs, gfs2)

2008-05-13 Thread Christoph Hellwig
On Tue, May 13, 2008 at 07:03:11PM +1000, David Chinner wrote:
> Sure - XFS will start another three kernel threads per filesystem
> that gets mounted. And for good measure, it cleans them up again
> on unmount. :)
> 
> The other threads are per-cpu workqueue threads that are shared
> across all XFS filesystems in the system and hence are started
> when XFS is initialised rather than when a mount occurs.

Well, we could refcount the number of active xfs instances and
start/stop the global threads based on that.  Not really worth my
time IHMO, but if someone comes up with a clean enough patch it should
go in.



[Cluster-devel] Re: [Jfs-discussion] Unneeded kernel threads (xfs, jfs, gfs2)

2008-05-13 Thread Christoph Hellwig
On Tue, May 13, 2008 at 05:21:56AM -0400, Christoph Hellwig wrote:
> Well, we could refcount the number of active xfs instances and
> start/stop the global threads based on that.  Not really worth my
> time IHMO, but if someone comes up with a clean enough patch it should
> go in.

Actually doing it in the VFS might be even better.  Add ->init and
->exit methods to struct file_system_type and then the filesystems can
move most of module_init/exit into the new methods.



[Cluster-devel] Re: GFS2: Add blktrace support to glocks

2009-02-19 Thread Christoph Hellwig
On Thu, Feb 19, 2009 at 04:55:39PM +, Steven Whitehouse wrote:
> Hi,
> 
> Since I last posted this pair of patches, I've done some extensive
> updating of the kernel patch, so it should now be happy to compile
> under all possible Kconfigs (fingers crossed) and also its a fair
> bit cleaner too.
> 
> I'm adding the linux-btrace list, since I didn't know about that
> list when I made the initial posting.
> 
> Since there is probably more GFS2 changes than blktrace changes, I
> could push this through the GFS2 tree. Let me know if you'd prefer
> it to go via the blktrace tree. I'd like to be able to push this
> in at the next merge window if possible. This patch is against the
> head of the GFS2 -nmw git tree (obviously that makes no difference
> to the blktrace side of the patch).
> 
> An updated blktrace userland patch follows in the next email, although
> the changes from the last version are fairly minor,

Umm, why would you put fs internal stuff into blktrace?  Just use
the generic ringbuffer directly and trace into that one.



[Cluster-devel] [PATCH] gfs2: cleanup file_operations mess

2009-04-07 Thread Christoph Hellwig
Remove the weird pointer to file_operations mess and replace it with
straight-forward defining of the lockinginstance names to the _nolock
variants.


Signed-off-by: Christoph Hellwig 

Index: linux-2.6/fs/gfs2/inode.c
===
--- linux-2.6.orig/fs/gfs2/inode.c  2009-04-06 14:17:05.667570170 +0200
+++ linux-2.6/fs/gfs2/inode.c   2009-04-07 19:21:25.250572236 +0200
@@ -137,15 +137,15 @@ void gfs2_set_iop(struct inode *inode)
if (S_ISREG(mode)) {
inode->i_op = &gfs2_file_iops;
if (gfs2_localflocks(sdp))
-   inode->i_fop = gfs2_file_fops_nolock;
+   inode->i_fop = &gfs2_file_fops_nolock;
else
-   inode->i_fop = gfs2_file_fops;
+   inode->i_fop = &gfs2_file_fops;
} else if (S_ISDIR(mode)) {
inode->i_op = &gfs2_dir_iops;
if (gfs2_localflocks(sdp))
-   inode->i_fop = gfs2_dir_fops_nolock;
+   inode->i_fop = &gfs2_dir_fops_nolock;
else
-   inode->i_fop = gfs2_dir_fops;
+   inode->i_fop = &gfs2_dir_fops;
} else if (S_ISLNK(mode)) {
inode->i_op = &gfs2_symlink_iops;
} else {
Index: linux-2.6/fs/gfs2/inode.h
===
--- linux-2.6.orig/fs/gfs2/inode.h  2009-04-06 14:17:05.671570051 +0200
+++ linux-2.6/fs/gfs2/inode.h   2009-04-07 19:21:08.357447364 +0200
@@ -101,21 +101,23 @@ void gfs2_dinode_print(const struct gfs2
 extern const struct inode_operations gfs2_file_iops;
 extern const struct inode_operations gfs2_dir_iops;
 extern const struct inode_operations gfs2_symlink_iops;
-extern const struct file_operations *gfs2_file_fops_nolock;
-extern const struct file_operations *gfs2_dir_fops_nolock;
+extern const struct file_operations gfs2_file_fops_nolock;
+extern const struct file_operations gfs2_dir_fops_nolock;
 
 extern void gfs2_set_inode_flags(struct inode *inode);
  
 #ifdef CONFIG_GFS2_FS_LOCKING_DLM
-extern const struct file_operations *gfs2_file_fops;
-extern const struct file_operations *gfs2_dir_fops;
+extern const struct file_operations gfs2_file_fops;
+extern const struct file_operations gfs2_dir_fops;
+
 static inline int gfs2_localflocks(const struct gfs2_sbd *sdp)
 {
return sdp->sd_args.ar_localflocks;
 }
 #else /* Single node only */
-#define gfs2_file_fops NULL
-#define gfs2_dir_fops NULL
+#define gfs2_file_fops gfs2_file_fops_nolock
+#define gfs2_dir_fops gfs2_dir_fops_nolock
+
 static inline int gfs2_localflocks(const struct gfs2_sbd *sdp)
 {
return 1;
Index: linux-2.6/fs/gfs2/ops_file.c
===
--- linux-2.6.orig/fs/gfs2/ops_file.c   2009-04-06 14:17:05.716570192 +0200
+++ linux-2.6/fs/gfs2/ops_file.c2009-04-07 17:56:32.55433 +0200
@@ -705,7 +705,7 @@ static int gfs2_flock(struct file *file,
}
 }
 
-const struct file_operations *gfs2_file_fops = &(const struct file_operations){
+const struct file_operations gfs2_file_fops = {
.llseek = gfs2_llseek,
.read   = do_sync_read,
.aio_read   = generic_file_aio_read,
@@ -723,7 +723,7 @@ const struct file_operations *gfs2_file_
.setlease   = gfs2_setlease,
 };
 
-const struct file_operations *gfs2_dir_fops = &(const struct file_operations){
+const struct file_operations gfs2_dir_fops = {
.readdir= gfs2_readdir,
.unlocked_ioctl = gfs2_ioctl,
.open   = gfs2_open,
@@ -735,7 +735,7 @@ const struct file_operations *gfs2_dir_f
 
 #endif /* CONFIG_GFS2_FS_LOCKING_DLM */
 
-const struct file_operations *gfs2_file_fops_nolock = &(const struct 
file_operations){
+const struct file_operations gfs2_file_fops_nolock = {
.llseek = gfs2_llseek,
.read   = do_sync_read,
.aio_read   = generic_file_aio_read,
@@ -751,7 +751,7 @@ const struct file_operations *gfs2_file_
.setlease   = generic_setlease,
 };
 
-const struct file_operations *gfs2_dir_fops_nolock = &(const struct 
file_operations){
+const struct file_operations gfs2_dir_fops_nolock = {
.readdir= gfs2_readdir,
.unlocked_ioctl = gfs2_ioctl,
.open   = gfs2_open,



[Cluster-devel] Re: GFS2: Add tracepoints

2009-05-29 Thread Christoph Hellwig
On Fri, May 29, 2009 at 08:11:00PM +0100, Steven Whitehouse wrote:
> +static inline u8 glock_trace_state(unsigned int state)
> +{
> + switch(state) {
> + case LM_ST_SHARED:
> + return DLM_LOCK_PR;
> + case LM_ST_DEFERRED:
> + return DLM_LOCK_CW;
> + case LM_ST_EXCLUSIVE:
> + return DLM_LOCK_EX;
> + }
> + return DLM_LOCK_NL;
> +}

I think this would be better done using __print_symbolic.

> +static inline const char *glock_trace_name(u8 state)
> +{
> + switch(state) {
> + case DLM_LOCK_PR:
> + return "PR";
> + case DLM_LOCK_CW:
> + return "CW";
> + case DLM_LOCK_EX:
> + return "EX";
> + case DLM_LOCK_NL:
> + return "NL";
> + default:
> + return "--";
> + }
> +}

Same here.

> +static inline const char *gfs2_blkst_name(u8 state)
> +{
> + switch(state) {
> + case GFS2_BLKST_FREE:
> + return "free";
> + case GFS2_BLKST_USED:
> + return "used";
> + case GFS2_BLKST_DINODE:
> + return "dinode";
> + case GFS2_BLKST_UNLINKED:
> + return "unlinked";
> + }
> + return "???";
> +}

Same here.

> + TP_STRUCT__entry(
> + __array(char,   devname, BDEVNAME_SIZE  )

This is extremly inefficient.  We'd be much better off just storing
the dev_t and introducing a __trace_bdevname to expand a bdevname
into the tracer buffer.  It's been on my todo list for a while and
I'll look into it next week unless you beat me to it.

> + __entry->gltype = gl->gl_name.ln_type;
> + __entry->cur_state  = glock_trace_state(gl->gl_state);
> + __entry->new_state  = glock_trace_state(new_state);
> + __entry->tgt_state  = glock_trace_state(gl->gl_target);
> + __entry->dmt_state  = DLM_LOCK_IV;
> + if (test_bit(GLF_DEMOTE, &gl->gl_flags) ||
> + test_bit(GLF_PENDING_DEMOTE, &gl->gl_flags))
> + __entry->dmt_state  = 
> glock_trace_state(gl->gl_demote_state);

Wouldn't it be better to just trace gl_flags and gl_demote_state?

> +);
> +
> +#endif /* _TRACE_GFS2_H */
> +
> +/* This part must be outside protection */
> +#undef TRACE_INCLUDE_PATH
> +#define TRACE_INCLUDE_PATH ../../fs/gfs2/

Shouldn't an

#define TRACE_INCLUDE_PATH .

do it?  (although that would need -I$(src), not sure how good that is.



[Cluster-devel] Re: GFS2: Add tracepoints

2009-05-30 Thread Christoph Hellwig
On Fri, May 29, 2009 at 08:50:10PM +0100, Steven Whitehouse wrote:
> > This is extremly inefficient.  We'd be much better off just storing
> > the dev_t and introducing a __trace_bdevname to expand a bdevname
> > into the tracer buffer.  It's been on my todo list for a while and
> > I'll look into it next week unless you beat me to it.
> > 
> Ok. I wasn't sure how efficient bdevname() was vs. copying the name
> around, but that sounds like a good plan if its not too expensive an
> operation.

Ok, I finally started looking into and it's not that easy.  bdevname
actually requires a struct block_device * which we can't safely store
and __bdevname is just an obsfucated way to print major and minor.

I'd still hate copying the whole device name as it's rather inefficient,
on the other hand it's quite a bit nicer than the raw major/minor
output.

At this point I think I would prefer the raw manjor/minor output as
done in the block trace even patches.  But what I think is most
important is that we all (block trace events, all filesystem trace
events (current xfs + gfs2 + ext4, and possibly vfs/vm trace points)
agree on one single format so that we can do global filtering over
all of them.



[Cluster-devel] Re: GFS2: Allow all meta/normal mount combinations

2009-06-05 Thread Christoph Hellwig
On Fri, Jun 05, 2009 at 02:52:16PM +0100, Steven Whitehouse wrote:
> +static int gfs2_get_sb(struct file_system_type *fs_type, int flags,
> +const char *dev_name, void *data, struct vfsmount *mnt)
> +{
> + struct super_block *s;
> + struct gfs2_sbd *sdp;
> + int ret;
> +
> + /* First we assume its a block device */
> + ret = get_sb_bdev(fs_type, flags, dev_name, data, fill_super, mnt);
> + if (ret != -ENOTBLK)
> + return ret;
> +
> + /* If that fails, we assume its a GFS2 inode on an existing sb */
> + s = get_gfs2_sb(dev_name);
> + if (IS_ERR(s)) {
> + printk(KERN_WARNING "GFS2: gfs2 mount does not exist\n");
> + return PTR_ERR(s);
> + }

This is pretty ugly.  Even if this is how the old gfs2meta filesystem
worked I would prefer to only allow it if mounted as type gfs2meta, not
for normal gfs2 mount and gradually phase it out.

> + sdp = s->s_fs_info;
> + mnt->mnt_sb = s;
> + mnt->mnt_root = gfs2_is_meta_fs(data) ? dget(sdp->sd_master_dir) :
> + dget(sdp->sd_root_dir);

if (gfs2_is_meta_fs(data))
mnt->mnt_root = dget(sdp->sd_master_dir);
else
mnt->mnt_root = dget(sdp->sd_root_dir);

would be a lot more readable..



[Cluster-devel] Re: GFS2: Allow all meta/normal mount combinations

2009-06-08 Thread Christoph Hellwig
On Fri, Jun 05, 2009 at 03:17:43PM +0100, Steven Whitehouse wrote:
> > This is pretty ugly.  Even if this is how the old gfs2meta filesystem
> > worked I would prefer to only allow it if mounted as type gfs2meta, not
> > for normal gfs2 mount and gradually phase it out.
> > 
> Which bit is ugly? We need to be able to do this to be sure that we can
> get a metafs which exactly matches the "normal" fs without races I
> think. Thats a requirement of the userland tools, unfortunately.

Well, the block device from /proc/self/mounts really is a unique
key you can get.

If you really insist on using a path to a file make sure that normal
mounts only take the block device, and meta mounts only take the file.

But even with that it's not a very clear interface.



[Cluster-devel] Re: [PATCH 10/24] GFS2: Don't warn when delete inode fails on ro filesystem

2009-06-10 Thread Christoph Hellwig
On Wed, Jun 10, 2009 at 09:30:51AM +0100, Steven Whitehouse wrote:
> If the filesystem is read-only, then we expect that delete inode
> will fail, so there is no need to warn about it.

Umm, ->delete_inode should never be called on a read-only filesystem
by the VFS.



[Cluster-devel] Re: [PATCH 23/24] GFS2: Fix locking issue mounting gfs2meta fs

2009-06-10 Thread Christoph Hellwig
On Wed, Jun 10, 2009 at 09:31:04AM +0100, Steven Whitehouse wrote:
>  static struct super_block *get_gfs2_sb(const char *dev_name)
>  {
> - struct super_block *sb;

Any reason you don't merge get_gfs2_sb into gfs2_get_sb_meta?  That's
how most other get_sb callbacks are structured, and would also get rid
of the rather confusing get_gfs2_sb name.



[Cluster-devel] Re: move gfs2 tracepoints to inclue/trace/events dir

2009-10-09 Thread Christoph Hellwig
On Fri, Oct 09, 2009 at 12:01:16PM -0400, Jason Baron wrote:
> hi,
> 
> I'd like to move the gfs2 tracepoints to the the common
> include/trace/events directory along with all of the other trace events.
> It makes understanding what tracepoints are available easier, and I see
> no reason why gfs2 should be different. For example, 'ext4.h' is already
> in the include/trace/events directory.

Folks, no.  Drivers and filesystems should be as self-contained as
possible.  include/trace/ is an extremly bad idea for everything that's
not actually global kernel functionality.  There's a reason all other
fs headers have moved out of include/linux, too.



[Cluster-devel] Re: move gfs2 tracepoints to inclue/trace/events dir

2009-10-25 Thread Christoph Hellwig
On Mon, Oct 12, 2009 at 12:00:37PM +0200, Ingo Molnar wrote:
> yeah. I have no objection to adding it to include/trace/. Tracepoints 
> are a fundamentally global business.
> 
> Subsystems can opt to hide their tracepoints locally, but it's better to 
> have a global view about what's out there, so that it can be extended 
> coherently, etc.

We're lacking quite a bit coherence even with it.  The originally reason
why there were global was that the infrastructure couldn't cope with
having the either in modules or elsewhere in the source tree at all.

We have managed to avoid global directories for drivers/filesystems for
as much as we can lately.  Having everything in a directory makes sure
it's self-contained and people don't use it accidentally from other
modules, which also applies to trace events - we don't want people
accidentally use gfs2 tracepoints from a driver (and if you think
that's far fetched look at the recent example of a driver using
debugging macros from the networking code that got pulled in
accidentally somewhere).




[Cluster-devel] [PATCH] gfs2: add barrier/nobarrier mount options

2009-10-30 Thread Christoph Hellwig

Currently gfs2 issues barrier unconditionally.  There are various reasons
to disable them, be that just for testing or for stupid devices flushing
large battert backed caches.  Add a nobarrier option that matches xfs and
btrfs for this.  Also add a symmetric barrier option to turn it back on
at remount time.

Signed-off-by: Christoph Hellwig 

Index: linux-2.6/fs/gfs2/incore.h
===
--- linux-2.6.orig/fs/gfs2/incore.h 2009-10-30 07:43:42.246023792 +0100
+++ linux-2.6/fs/gfs2/incore.h  2009-10-30 07:44:11.173255988 +0100
@@ -429,6 +429,7 @@ struct gfs2_args {
unsigned int ar_meta:1; /* mount metafs */
unsigned int ar_discard:1;  /* discard requests */
unsigned int ar_errors:2;   /* errors=withdraw | panic */
+   unsigned int ar_nobarrier:1;/* do not send barriers */
int ar_commit;  /* Commit interval */
 };
 
Index: linux-2.6/fs/gfs2/super.c
===
--- linux-2.6.orig/fs/gfs2/super.c  2009-10-30 07:44:29.832024397 +0100
+++ linux-2.6/fs/gfs2/super.c   2009-10-30 07:53:24.117033618 +0100
@@ -70,6 +70,8 @@ enum {
Opt_commit,
Opt_err_withdraw,
Opt_err_panic,
+   Opt_barrier,
+   Opt_nobarrier,
Opt_error,
 };
 
@@ -98,6 +100,8 @@ static const match_table_t tokens = {
{Opt_meta, "meta"},
{Opt_discard, "discard"},
{Opt_nodiscard, "nodiscard"},
+   {Opt_barrier, "barrier"},
+   {Opt_nobarrier, "nobarrier"},
{Opt_commit, "commit=%d"},
{Opt_err_withdraw, "errors=withdraw"},
{Opt_err_panic, "errors=panic"},
@@ -207,6 +211,12 @@ int gfs2_mount_args(struct gfs2_sbd *sdp
case Opt_nodiscard:
args->ar_discard = 0;
break;
+   case Opt_barrier:
+   args->ar_nobarrier = 0;
+   break;
+   case Opt_nobarrier:
+   args->ar_nobarrier = 1;
+   break;
case Opt_commit:
rv = match_int(&tmp[0], &args->ar_commit);
if (rv || args->ar_commit <= 0) {
@@ -1097,6 +1107,10 @@ static int gfs2_remount_fs(struct super_
sb->s_flags |= MS_POSIXACL;
else
sb->s_flags &= ~MS_POSIXACL;
+   if (sdp->sd_args.ar_nobarrier)
+   set_bit(SDF_NOBARRIERS, &sdp->sd_flags);
+   else
+   clear_bit(SDF_NOBARRIERS, &sdp->sd_flags);
spin_lock(>->gt_spin);
gt->gt_log_flush_secs = args.ar_commit;
spin_unlock(>->gt_spin);
Index: linux-2.6/fs/gfs2/ops_fstype.c
===
--- linux-2.6.orig/fs/gfs2/ops_fstype.c 2009-10-30 07:52:11.050003877 +0100
+++ linux-2.6/fs/gfs2/ops_fstype.c  2009-10-30 07:52:53.053005337 +0100
@@ -1143,6 +1143,8 @@ static int fill_super(struct super_block
}
if (sdp->sd_args.ar_posix_acl)
sb->s_flags |= MS_POSIXACL;
+   if (sdp->sd_args.ar_nobarrier)
+   set_bit(SDF_NOBARRIERS, &sdp->sd_flags);
 
sb->s_magic = GFS2_MAGIC;
sb->s_op = &gfs2_super_ops;



Re: [Cluster-devel] [PATCHv2 00/12]posix_acl: Add the check items

2009-12-21 Thread Christoph Hellwig
I like taking these checks into posix_acl_valid, but I think the patch
submission needs a bit more work.

All the patches are extremly whitespace mangled.  And I don't think
splitting them up makes a whole lot of sense, when we do API changes
like this we usually fix up all callers.  So please try to fix your
mailer, merged them all into one, and maybe also chose a more
descriptive subject line, e.g.

Subject: take checks for NULL and error pointers into posix_acl_valid()


And btw, at least the XFS change seems incorrect - previously we
returned NULL acl pointer and this patch changes it to -EINVAL.



Re: [Cluster-devel] [PATCH 2/2] gfs2: Use CALLER_ADDR0 macro.

2009-12-21 Thread Christoph Hellwig
On Mon, Dec 21, 2009 at 09:22:41AM +, Steven Whitehouse wrote:
> Hi,
> 
> Looks fairly harmless, even though I don't entirely see the point. Do
> you want me to add this to me tree?

We also have a _RET_IP_ macro that does the same and has a much more
useful name.  And given that we need the address as void pointer to
print it using our printk extension both seem a bit supoptimal.

Given that a lot of tracing code needs this a bit of standardization
would be useful - more important for the type to store the address than
just the macro name.



Re: [Cluster-devel] [PATCH 1/4] gfs2: add IO submission trace points

2010-02-05 Thread Christoph Hellwig
On Fri, Feb 05, 2010 at 09:49:54AM +, Steven Whitehouse wrote:
> Hi,
> 
> On Fri, 2010-02-05 at 16:45 +1100, Dave Chinner wrote:
> > Useful for tracking down where specific IOs are being issued
> > from.
> > 
> > Signed-off-by: Dave Chinner 
> > ---
> >  fs/gfs2/log.c|6 ++
> >  fs/gfs2/lops.c   |6 ++
> >  fs/gfs2/trace_gfs2.h |   41 +
> >  3 files changed, 53 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
> > index 4511b08..bd26dff 100644
> > --- a/fs/gfs2/log.c
> > +++ b/fs/gfs2/log.c
> > @@ -121,6 +121,7 @@ __acquires(&sdp->sd_log_lock)
> > lock_buffer(bh);
> > if (test_clear_buffer_dirty(bh)) {
> > bh->b_end_io = end_buffer_write_sync;
> > +   trace_gfs2_submit_bh(bh, WRITE_SYNC_PLUG, 
> > __func__);
> > submit_bh(WRITE_SYNC_PLUG, bh);
> This looks like it could be a generically useful function, I wonder if
> it would be possible to do this directly in submit_bh, since we should
> be able to use __builtin_return_address(0) to find out the origin of the
> call?

Yes, we could.  The tracing code normally uses the _RET_IP_ wrapper
for __builtin_return_address(0) for some reason, though.



[Cluster-devel] [PATCH] gfs2: do not select QUOTA

2010-03-03 Thread Christoph Hellwig
gfs2 only needs the quotactl code, not the generic quota implementation.

Signed-off-by: Christoph Hellwig 

Index: linux-2.6/fs/gfs2/Kconfig
===
--- linux-2.6.orig/fs/gfs2/Kconfig  2010-03-03 14:48:00.292026869 +0100
+++ linux-2.6/fs/gfs2/Kconfig   2010-03-03 14:48:03.546284090 +0100
@@ -8,7 +8,6 @@ config GFS2_FS
select FS_POSIX_ACL
select CRC32
select SLOW_WORK
-   select QUOTA
select QUOTACTL
help
  A cluster filesystem.



[Cluster-devel] [PATCH] gfs2: fix quota state reporting

2010-05-04 Thread Christoph Hellwig
We need to report both the accounting and enforcing flags if we are
in enforcing mode.

Signed-off-by: Christoph Hellwig 

Index: xfs/fs/gfs2/quota.c
===
--- xfs.orig/fs/gfs2/quota.c2010-05-04 23:16:59.718256886 +0200
+++ xfs/fs/gfs2/quota.c 2010-05-04 23:54:25.64328 +0200
@@ -1418,10 +1418,18 @@ static int gfs2_quota_get_xstate(struct
 
memset(fqs, 0, sizeof(struct fs_quota_stat));
fqs->qs_version = FS_QSTAT_VERSION;
-   if (sdp->sd_args.ar_quota == GFS2_QUOTA_ON)
-   fqs->qs_flags = (XFS_QUOTA_UDQ_ENFD | XFS_QUOTA_GDQ_ENFD);
-   else if (sdp->sd_args.ar_quota == GFS2_QUOTA_ACCOUNT)
-   fqs->qs_flags = (XFS_QUOTA_UDQ_ACCT | XFS_QUOTA_GDQ_ACCT);
+
+   switch (sdp->sd_args.ar_quota) {
+   case GFS2_QUOTA_ON:
+   fqs->qs_flags |= (XFS_QUOTA_UDQ_ENFD | XFS_QUOTA_GDQ_ENFD);
+   /*FALLTHRU*/
+   case GFS2_QUOTA_ACCOUNT:
+   fqs->qs_flags |= (XFS_QUOTA_UDQ_ACCT | XFS_QUOTA_GDQ_ACCT);
+   break;
+   case GFS2_QUOTA_OFF:
+   break;
+   }
+
if (sdp->sd_quota_inode) {
fqs->qs_uquota.qfs_ino = GFS2_I(sdp->sd_quota_inode)->i_no_addr;
fqs->qs_uquota.qfs_nblks = sdp->sd_quota_inode->i_blocks;



Re: [Cluster-devel] [PATCH 1/3] GFS2: Rework reclaiming unlinked dinodes

2010-05-25 Thread Christoph Hellwig
On Tue, May 25, 2010 at 09:21:27AM +0100, Steven Whitehouse wrote:
> From: Bob Peterson 
> 
> The previous patch I wrote for reclaiming unlinked dinodes
> had some shortcomings and did not prevent all hangs.
> This version is much cleaner and more logical, and has
> passed very difficult testing.  Sorry for the churn.

It would be much more helpful to have a commit message that describes
what is changed and why.  The above one won't tell anyting to someone
reading through the commit log later on.



[Cluster-devel] [PATCH] gfs2: stop using mpage_writepage

2010-06-07 Thread Christoph Hellwig
GFS2 always creates buffer_heads during ->write_begin or ->page_mkwrite,
which means mpage_writepage always falls back to block_write_full_page.

So stop calling mpage_writepage and always call block_write_full_page
directly.

Signed-off-by: Christoph Hellwig 

Index: linux-2.6/fs/gfs2/aops.c
===
--- linux-2.6.orig/fs/gfs2/aops.c   2010-06-07 11:22:42.239273013 +0200
+++ linux-2.6/fs/gfs2/aops.c2010-06-07 11:23:23.563255580 +0200
@@ -136,10 +136,7 @@ static int gfs2_writeback_writepage(stru
if (ret <= 0)
return ret;
 
-   ret = mpage_writepage(page, gfs2_get_block_noalloc, wbc);
-   if (ret == -EAGAIN)
-   ret = block_write_full_page(page, gfs2_get_block_noalloc, wbc);
-   return ret;
+   return block_write_full_page(page, gfs2_get_block_noalloc, wbc);
 }
 
 /**



Re: [Cluster-devel] GFS2: Wait for journal id on mount if not specified on mount command line

2010-06-08 Thread Christoph Hellwig
On Mon, Jun 07, 2010 at 12:34:14PM -0500, David Teigland wrote:
> IIRC, nfs recently moved to using a mount helper after *not* using one for
> many years.  It would be interesting to ask them about their motivations.

That's not correct at all.  NFS has required special mount code since
day 1 and continues to do so.  The only change is that this code used
to be built into the mount binary and now has moved to a helper in the
nfs-tools package.



Re: [Cluster-devel] GFS2: Make . and .. qstrs constant

2010-09-08 Thread Christoph Hellwig
On Wed, Sep 08, 2010 at 04:35:20PM +0100, Steven Whitehouse wrote:
> >From bc9fb728211162a24afd341c9ffb85e7b459fb8d Mon Sep 17 00:00:00 2001
> From: Steven Whitehouse 
> Date: Wed, 8 Sep 2010 16:02:13 +0100
> Subject: GFS2: Make . and .. qstrs constant
> 
> Rather than calculating the qstrs for . and .. each time
> we need them, its better to keep a constant version of
> these and just refer to them when required.
> 
> Signed-off-by: Steven Whitehouse 
> 
> +const struct qstr gfs2_qdot = { .name = ".", .len = 1, .hash = 0xed4e242};
> +const struct qstr gfs2_qdotdot = { .name = "..", .len = 2, .hash = 
> 0x9608161c};

I don't think hardcoding the values for the current dcache hash is a
good idea.  We have already changed that hash in not too recent history
and chances are we'll do again at some point.

Just calculate it once at startup time using the normal hash function.
Extra points for doing it in core code and finding places in other
filesystems that have the same issue.



Re: [Cluster-devel] [GFS2][PATCH] - Userland expects quota limit/warn/usage in 512b blocks

2010-11-19 Thread Christoph Hellwig
On Thu, Nov 18, 2010 at 11:24:24AM -0500, Abhijith Das wrote:
> Userland programs using the quotactl() syscall assume limit/warn/usage block 
> counts in 512b basic blocks which were instead being read/written in fs 
> blocksize in gfs2. With this patch, gfs2 correctly interacts with the syscall 
> using 512b blocks.

Oops, thanks for fixing this up.


Reviewed-by: Christoph Hellwig 



Re: [Cluster-devel] [PATCH 03/10] locks: generic_delete_lease doesn't need a file_lock at all

2014-08-23 Thread Christoph Hellwig
On Sat, Aug 23, 2014 at 10:41:11AM -0400, Jeff Layton wrote:
> +static int generic_delete_lease(struct file *filp)
>  {
>   struct file_lock *fl, **before;
>   struct dentry *dentry = filp->f_path.dentry;
>   struct inode *inode = dentry->d_inode;
>  
> - trace_generic_delete_lease(inode, *flp);
> -

Can you keep the tracepoint and modify it to not need the file_lock
pointer?  It really helped me with some debugging lately.

Otherwise looks fine,

Reviewed-by: Christoph Hellwig 



Re: [Cluster-devel] [PATCH 06/10] locks: plumb an "aux" pointer into the setlease routines

2014-08-23 Thread Christoph Hellwig
On Sat, Aug 23, 2014 at 10:41:14AM -0400, Jeff Layton wrote:
> In later patches, we're going to add a new lock_manager_operation to
> finish setting up the lease while still holding the i_lock.  To do
> this, we'll need to pass a little bit of info in the fcntl setlease
> case (primarily an fasync structure). Plumb the extra pointer into
> there in advance of that.
> 
> We declare this pointer as a void ** to make it clear that this is
> auxillary info, and that the caller isn't required to set this unless
> the lm_setup specifically requires it.

Can you just return -EEXIST if reusing an existing one and make it a
normal private pointer a we use elsewhere?

Btw, two things I came up with to make the lease filesystem API nicer:

 - bypass vfs_setlease/->setlease for deletes and just call directly
   into the lease code.  no one does anything special for it (except
   cifs, which forgets about it), and then rename the method to
   ->add_lease.
 - provide a no_add_lease for nfs/gfs to centralize the no-op version
   in a single place (similar no no_lseek).

Otherwise this looks really nice.



Re: [Cluster-devel] [PATCH 01/10] locks: close potential race in lease_get_mtime

2014-08-24 Thread Christoph Hellwig
On Sat, Aug 23, 2014 at 10:41:09AM -0400, Jeff Layton wrote:
> lease_get_mtime is called without the i_lock held, so there's no
> guarantee about the stability of the list. Between the time when we
> assign "flock" and then dereference it to check whether it's a lease
> and for write, the lease could be freed.
> 
> Ensure that that doesn't occur by taking the i_lock before trying
> to check the lease.

Looks good.  Also looks way cleaner than before by being just a tad more
verbose..

Reviewed-by: Christoph Hellwig 



Re: [Cluster-devel] [PATCH 02/10] nfsd: fix potential lease memory leak in nfs4_setlease

2014-08-24 Thread Christoph Hellwig
On Sat, Aug 23, 2014 at 10:41:10AM -0400, Jeff Layton wrote:
> It's unlikely to ever occur, but if there were already a lease set on
> the file then we could end up getting back a different pointer on a
> successful setlease attempt than the one we allocated. If that happens,
> the one we allocated could leak.
> 
> In practice, I don't think this will happen due to the fact that we only
> try to set up the lease once per nfs4_file, but this error handling is a
> bit more correct given the current lease API.
> 
> Cc: J. Bruce Fields 
> Signed-off-by: Jeff Layton 

Looks good,

Reviewed-by: Christoph Hellwig 



Re: [Cluster-devel] [PATCH 05/10] nfsd: don't keep a pointer to the lease in nfs4_file

2014-08-24 Thread Christoph Hellwig
On Sat, Aug 23, 2014 at 10:41:13AM -0400, Jeff Layton wrote:
> Now that we don't need to pass in an actual lease pointer to
> vfs_setlease on unlock, we can stop tracking a pointer to the lease in
> the nfs4_file.
> 
> Switch all of the places that check the fi_lease to check fi_deleg_file
> instead. We always set that at the same time so it will have the same
> semantics.
> 
> Signed-off-by: Jeff Layton 

Looks good,

Reviewed-by: Christoph Hellwig 



Re: [Cluster-devel] [PATCH 04/10] locks: clean up vfs_setlease kerneldoc comments

2014-08-24 Thread Christoph Hellwig
On Sat, Aug 23, 2014 at 10:41:12AM -0400, Jeff Layton wrote:
> Some of the latter paragraphs seem ambiguous and just plain wrong.
> In particular the break_lease comment makes no sense. We call
> break_lease (and break_deleg) from all sorts of vfs-layer functions,
> so there is clearly such a method.
> 
> Also, we are close to being able to allow for "real" filesystem
> setlease methods so remove the final comment about it not being a
> full implementation yet.

I'd remove even more:

> + *
> + * This will call the filesystem's setlease file method, if defined. Note 
> that
> + * there is no getlease method; instead, the filesystem setlease method 
> should
> + * call back to generic_setlease() to add a lease to the inode's lease list,
> + * where fcntl_getlease() can find it.  Since fcntl_getlease() only reports
> + * whether the current task holds a lease, a cluster filesystem need only do
> + * this for leases held by processes on this node.
>   */

If we'd ever want a full implementation I think we'd absolutely need
the getlease method.  But instead of hypothetizing about future
implementation I'd rather leave it to those actually implementing such
support, if that ever happens.



Re: [Cluster-devel] [PATCH 06/10] locks: plumb an "aux" pointer into the setlease routines

2014-08-24 Thread Christoph Hellwig
On Sun, Aug 24, 2014 at 06:08:01AM -0400, Jeff Layton wrote:
> > Can you just return -EEXIST if reusing an existing one and make it a
> > normal private pointer a we use elsewhere?
> > 
> 
> That sounds a little confusing...
> 
> We have two pointers we pass down to generic_setlease: the file_lock
> itself and with this patch, the "aux" pointer. We can end up using
> either, neither or both during a call to generic_setlease.
> 
> A simple error code can't properly indicate which of the two pointers
> got used. It might be clearer to turn the file_lock into a normal
> pointer and return -EEXIST if we reused it, but leave aux as a double
> pointer.

There is no way we could use a new file_lock but an existing
fasync_struct, as there won't be one on the newly allocated file_lock
structure, but otherwise you're right.

Just rename it to priv then and make me a little less grumpy ;-)



Re: [Cluster-devel] [PATCH 07/10] locks: define a lm_setup handler for leases

2014-08-24 Thread Christoph Hellwig
I like this change a lot!  But one caveat:

> + /*
> +  * Despite the fact that it's an int return function, __f_setown never
> +  * returns an error. Just ignore any error return here, but spew a
> +  * WARN_ON_ONCE in case this ever changes.
> +  */
> + WARN_ON_ONCE(__f_setown(filp, task_pid(current), PIDTYPE_PID, 0));

I don't think this is a good idea.  The errors in __f_setown come from
the security modules, and they could change easily.  If you can convince
the LSM people to change their file_set_fowner routine to return void
we could change __f_setown to return void as well and be done with it,
but without that this looks too dangerous.



Re: [Cluster-devel] [PATCH 08/10] locks: move i_lock acquisition into generic_*_lease handlers

2014-08-24 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 

Some comments on further work I'd like to see in this area, though:

> + spin_lock(&inode->i_lock);
> + time_out_leases(inode);
>   for (before = &inode->i_flock;
>   ((fl = *before) != NULL) && IS_LEASE(fl);
>   before = &fl->fl_next) {
>   if (fl->fl_file != filp)
>   continue;
> - return fl->fl_lmops->lm_change(before, F_UNLCK);
> + error = fl->fl_lmops->lm_change(before, F_UNLCK);
>   }

We really should split a lm_release from lm_change, the way it is
used is highly confusing.  In addition I think a lot of code
currently in lease_modify should move here instead, e.g. something like:


if (fl->fl_file != filp)
continue;

fl = *before;
fl->fl_type = F_UNLCK;
lease_clear_pending(fl, F_UNLCK);
locks_wake_up_blocks(fl);
if (fl->fl_lmops->lm_delete)
fl->fl_lmops->lm_delete(fl);
locks_delete_lock(before, NULL);

with lm_delete for user space leases as:

static void lease_delete(struct file_lock *fl)
{
struct file *filp = fl->fl_file;

f_delown(filp);
filp->f_owner.signum = 0;
fasync_helper(0, fl->fl_file, 0, &fl->fl_fasync);
if (fl->fl_fasync != NULL) {
printk(KERN_ERR "locks_delete_lock: fasync == %p\n",
fl->fl_fasync);
fl->fl_fasync = NULL;
}
}

and a NULL implementation for delegations.



Re: [Cluster-devel] [PATCH 09/10] locks: move freeing of leases outside of i_lock

2014-08-24 Thread Christoph Hellwig
On Sat, Aug 23, 2014 at 10:41:17AM -0400, Jeff Layton wrote:
> There was only one place where we still could free a file_lock while
> holding the i_lock -- lease_modify. Add a new list_head argument to the
> lm_change operation, pass in a private list when calling it, and fix
> those callers to dispose of the list once the lock has been dropped.

Why do we care about locks held when simply freeing a piece of memory?



Re: [Cluster-devel] [PATCH 00/10] locks/nfsd: internal lease API overhaul

2014-08-24 Thread Christoph Hellwig
One more wishlist item in addition to the one mentioned in the patches:

 - add a return value to lm_break so that the lock manager can tell the
   core code "you can delete this lease right now".  That gets rid of
   the games with the timeout which require all kinds of race avoidance
   code in the users.

And a big one that doesn't seem easily addressable, but I'll drop it in
anyway:

 - calling ->lm_break without i_lock would make everyones life a heck
   lot easier..



Re: [Cluster-devel] [PATCH 08/10] locks: move i_lock acquisition into generic_*_lease handlers

2014-08-24 Thread Christoph Hellwig
On Sun, Aug 24, 2014 at 09:06:34AM -0700, Christoph Hellwig wrote:
> We really should split a lm_release from lm_change, the way it is
> used is highly confusing.  In addition I think a lot of code
> currently in lease_modify should move here instead, e.g. something like:

At this point the old lm_change actually becomes superflous if we
simply disallow changes for delegation.



Re: [Cluster-devel] [PATCH 07/10] locks: define a lm_setup handler for leases

2014-08-26 Thread Christoph Hellwig
On Sun, Aug 24, 2014 at 09:19:53PM -0400, Jeff Layton wrote:
> > I don't think this is a good idea.  The errors in __f_setown come from
> > the security modules, and they could change easily.  If you can convince
> > the LSM people to change their file_set_fowner routine to return void
> > we could change __f_setown to return void as well and be done with it,
> > but without that this looks too dangerous.
> > 
> 
> Understood. I figured that this might not be acceptable. I can make
> this propagate the error back up, but cleaning up the mess may not be
> easy. I'll see what I can do.

I'd say talk to the LSM people.  Right now they are only using it to
set up private data pointers, so they might agree on turning it into
a void return.  Or just be bold and do that work yourself, include it in
this series and just Cc them.



Re: [Cluster-devel] [PATCH 00/10] locks/nfsd: internal lease API overhaul

2014-08-26 Thread Christoph Hellwig
On Sun, Aug 24, 2014 at 09:43:01PM -0400, Jeff Layton wrote:
> >  - add a return value to lm_break so that the lock manager can tell the
> >core code "you can delete this lease right now".  That gets rid of
> >the games with the timeout which require all kinds of race avoidance
> >code in the users.
> > 
> 
> I'm not sure I understand what you're suggesting here. Isn't it just as
> simple to have lm_break call lease_modify to just remove the lease?

Unfortunately it's not.  lm_break gets a pointer to the file_lock,
and lease_modify needs a pointer to the list position of the file_lock.



<    1   2   3   4   5   6   7   8   9   10   >