Re: [PATCH,RFC] nfsd: Make INET6 transport creation failure an informational message

2010-04-02 Thread Chuck Lever

Hi Tom-

On 04/01/2010 06:48 PM, Tom Tucker wrote:

Hi Bruce/Chuck,

RDMA Transports are currently broken in 2.6.34 because they don't have a
V4ONLY setsockopt. So what happens is that when write_ports attempts to
create the PF_INET6 transport it fails because the port is already in
use. There is discussion on linux-rdma about how to fix this, but in the
interim and perhaps indefinitely, I propose the following:

Tom

nfsd: Make INET6 transport creation failure an informational message

The write_ports code will fail both the INET4 and INET6 transport
creation if
the transport returns an error when PF_INET6 is specified. Some transports
that do not support INET6 return an error other than EAFNOSUPPORT.


That's the real bug.  Any reason the RDMA RPC transport can't return 
EAFNOSUPPORT in this case?



We
should
allow communication on INET4 even if INET6 is not yet supported or fails
for some reason.


Yes, that's why EAFNOSUPPORT is ignored in __write_ports().  People 
complain when they see messages like this, even if the result is a 
working configuration.



Signed-off-by: Tom Tucker 
---

fs/nfsd/nfsctl.c | 6 --
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index 0f0e77f..934b624 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1008,8 +1008,10 @@ static ssize_t __write_ports_addxprt(char *buf)

err = svc_create_xprt(nfsd_serv, transport,
PF_INET6, port, SVC_SOCK_ANONYMOUS);
- if (err < 0 && err != -EAFNOSUPPORT)
- goto out_close;
+ if (err < 0)
+ printk(KERN_INFO "nfsd: Error creating PF_INET6 listener "
+ "for transport '%s'\n", transport);
+
return 0;
out_close:
xprt = svc_find_xprt(nfsd_serv, transport, PF_INET, port);




--
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH,RFC] nfsd: Make INET6 transport creation failure an informational message

2010-04-02 Thread Chuck Lever

Hi Roland-

On 04/02/2010 01:22 PM, Roland Dreier wrote:

  >  >  The write_ports code will fail both the INET4 and INET6 transport
  >  >  creation if
  >  >  the transport returns an error when PF_INET6 is specified. Some 
transports
  >  >  that do not support INET6 return an error other than EAFNOSUPPORT.
  >
  >  That's the real bug.  Any reason the RDMA RPC transport can't return
  >  EAFNOSUPPORT in this case?

I think Tom's changelog is misleading.  The problem is that the RDMA
transport actually does support IPv6, but it doesn't support the
IPV6ONLY option yet.  So if NFS/RDMA binds to a port for IPv4, then the
IPv6 bind fails because of the port collision.


IPV6ONLY is a requirement for RPC over IPv6.  If the underlying 
transport does not support IPV6ONLY, then it cannot properly support RPC 
over IPv6.  It's easy enough to catch listener creation calls for IPv6 
on such transports, and simply return EAFNOSUPPORT until support for 
IPV6ONLY can be provided.


The __write_ports() interface is specifically designed to silently fall 
back to IPv4-only when IPv6 transport creation fails with ENOAFSUPPORT. 
 I don't see a good reason to change the generic logic in 
__write_ports() if there is a problem with implementing RPC over IPv6 in 
a specific transport capability.  __write_ports() will do the right 
thing if the transport returns the correct error code.



Implementing the IPV6ONLY option for RDMA binding is probably not
feasible for 2.6.34, so the best band-aid for now seems to be Tom's
patch.


My recent experience with similar changes suggests the specific solution 
Tom proposed will trigger extra bug reports and e-mails, as the change 
appears to affect non-RDMA transports as well.  This printk might fire, 
for example, for INET transports on systems that are built without IPv6 
support, or where ipv6.ko is blacklisted in user space.


In other words, I agree that there's a bug that should be addressed in 
2.6.34, and I don't have any problem with setting up only an IPv4 
listener in this case.  But I think the addition of a printk that fires 
for all transports in this case is problematic.


It would be better to address this in the RPC/RDMA transport capability, 
and not in generic upper level logic.  We already have correct behavior 
in __write_ports, and the RPC/RDMA transport capability should be 
changed to use it.


--
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svcrdma: RDMA support not yet compatible with RPC6

2010-04-05 Thread Chuck Lever

On 04/03/2010 09:27 AM, Tom Tucker wrote:

RPC6 requires that it be possible to create endpoints that listen
exclusively for IPv4 or IPv6 connection requests. This is not currently
supported by the RDMA API.

Signed-off-by: Tom Tucker
Tested-by: Steve Wise


Reviewed-by: Chuck Lever 


---

net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 -
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 3fa5751..4e6bbf9 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -678,7 +678,10 @@ static struct svc_xprt *svc_rdma_create(struct
svc_serv *serv,
int ret;

dprintk("svcrdma: Creating RDMA socket\n");
-
+ if (sa->sa_family != AF_INET) {
+ dprintk("svcrdma: Address family %d is not supported.\n", sa->sa_family);
+ return ERR_PTR(-EAFNOSUPPORT);
+ }
cma_xprt = rdma_create_xprt(serv, 1);
if (!cma_xprt)
return ERR_PTR(-ENOMEM);




--
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svcrdma: RDMA support not yet compatible with RPC6

2010-04-05 Thread Chuck Lever

On 04/05/2010 11:48 AM, J. Bruce Fields wrote:

On Mon, Apr 05, 2010 at 10:55:12AM -0400, Chuck Lever wrote:

On 04/03/2010 09:27 AM, Tom Tucker wrote:

RPC6 requires that it be possible to create endpoints that listen
exclusively for IPv4 or IPv6 connection requests. This is not currently
supported by the RDMA API.

Signed-off-by: Tom Tucker
Tested-by: Steve Wise


Reviewed-by: Chuck Lever


Thanks to all.  I take it the problem began with 37498292a "NFSD: Create
PF_INET6 listener in write_ports"?


I don't know exactly, but that would make sense.

--
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA in SLinux

2015-03-07 Thread Chuck Lever

On Mar 7, 2015, at 4:03 AM, Francisco Manuel Cardoso 
 wrote:

> Hello Sagi,
> 
> This is about NFSoRDMA, NFS on IPoIPB no issues.
> 
> The main issue is that simulation on the HPC cluster starts running
> "fine"and after a while, I get loads of errors that the NFS server is not
> responding;
> 
> Server Side getting messages such as ;
> 
> svcrdma: Error -107 posting RDMA_READ
> [ cut here ]
> WARNING: at net/sunrpc/xprtrdma/svc_rdma_transport.c:1158
> __svc_rdma_free+0x20a/0x230 [svcrdma]() (Tainted: PW
> ---   )
> Hardware name: ProLiant SL4540 Gen8 
> Modules linked in: xprtrdma svcrdma nfsd lockd nfs_acl auth_rpcgss sunrpc
> autofs4 8021q garp stp llc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
> rdma_cm ib_cm iw_cm xfs exportfs iTCO_wdt iTCO_vendor_support ipmi_devintf
> power_meter acpi_ipmi ipmi_si ipmi_msghandler hpwdt hpilo igb i2c_algo_bit
> i2c_core ptp pps_core serio_raw sg lpc_ich mfd_core ioatdma dca shpchp ext4
> jbd2 mbcache sd_mod crc_t10dif hpvsa(P)(U) hpsa mlx4_ib ib_sa ib_mad ib_core
> ib_addr ipv6 mlx4_core dm_mirror dm_region_hash dm_log dm_mod [last
> unloaded: scsi_wait_scan]
> Pid: 51, comm: events/0 Tainted: PW  ---
> 2.6.32-504.8.1.el6.x86_64 #1
> Call Trace:
> [] ? warn_slowpath_common+0x87/0xc0
> [] ? warn_slowpath_null+0x1a/0x20
> [] ? __svc_rdma_free+0x20a/0x230 [svcrdma]
> [] ? __svc_rdma_free+0x0/0x230 [svcrdma]
> [] ? worker_thread+0x170/0x2a0
> [] ? autoremove_wake_function+0x0/0x40
> [] ? worker_thread+0x0/0x2a0
> [] ? kthread+0x9e/0xc0
> [] ? child_rip+0xa/0x20
> [] ? kthread+0x0/0xc0
> [] ? child_rip+0x0/0x20
> ---[ end trace 3ee821ba0f96711f ]---
> 
> And;
> 
> svcrdma: Error fast registering memory for xprt 880c6ae13800
> svcrdma: Error fast registering memory for xprt 8802e87a3000
> svcrdma: Error fast registering memory for xprt 880bfa496c00
> svcrdma: Error fast registering memory for xprt 8808ec717000
> svcrdma: Error fast registering memory for xprt 880b82577c00
> svcrdma: Error fast registering memory for xprt 880bfa496c00
> 
> I've searched high and low for solutions and went to Red Hat KB, discovered
> all the articles regarding high workloads and the workarounds for like for
> example the " svcrdma: Error fast registering memory for xprt
> 8802e87a3000" messages that should be fixed after RH Kernel Errata on
> RHEL 6.1.
> And the "sunrpc.rdma_memreg_strategy = 6" value change.
> 
> If anyone can provide some help or insight would be really great.

I was volunteered, but I don’t have much insight.

For issues with NFS, linux-...@vger.kernel.org is the place to ask for
advice and help.

For issues with RHEL and its derivatives (I’m assuming SL6 is Scientific
Linux and not SuSE SLES), the best course of action is to work with the
distributors, since their kernels do not match any mainline tree.

In this case RHEL 6 kernel code base is very old by today’s standards,
and it pre-dates my direct involvement with NFS/RDMA.

I’ve never touched the RHEL 6 NFS/RDMA server implementation. My guess
based on my experience with the current mainline server is that it is
not production-ready. You should check the release notes to be sure it
is fully-supported.

If the RH KBs do not help, please contact RH and use their support to
address the issue. Red Hat is the authority on that code.

My advice is if you are sticking with stock RHEL 6 kernels, you should
use NFS on IPoIB.

> Cause I've seen from looking around that usually RDMA with High CPU Loads is
> "troublesome".
> 
> Regards,
> 
> Francisco
> 
> -Original Message-
> From: Sagi Grimberg [mailto:sa...@dev.mellanox.co.il] 
> Sent: 07 March 2015 02:20
> To: francisco.card...@gmail.com; Chuck Lever
> Cc: linux-rdma@vger.kernel.org
> Subject: Re: NFS over RDMA in SLinux
> 
> On 3/5/2015 9:54 PM, Francisco Manuel Cardoso wrote:
>> Hello,
>> 
>> 
>> 
>> Sorry newcomer to the group at the moment, brief question i hope 
>> someone can at least point me.
>> 
>> Are there any considerations regarding NFS over RDMA on Linux SL6 ?
>> 
>> Question I've been setting up/using an HPC cluster and NFS over IPoIB 
>> it's cool as soon as start dishing out things onto with the RDMA things go
> crazy.
>> 
>> The tipical setup is each machine is able to handle max 40 processes, 
>> using all of those to mpi, I seem to be having some performance 
>> issues, if I scale down to 39 I get much better performance still it
> crashes.
>> 
>> Anyone got any pointers ?
> 
> I'm not sure if you're asking about NFS over IPoIB or NFSoRDMA?
> 
> CC'ing Chuck which is probably the best help you can get...
> 

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 00/16] NFS/RDMA patches proposed for 4.1

2015-03-13 Thread Chuck Lever
This is a series of client-side patches for NFS/RDMA. In preparation
for increasing the transport credit limit and maximum rsize/wsize,
I've re-factored the memory registration logic into separate files,
invoked via a method API.

Two optimizations appear in this series:

The old code pre-allocated 64 MRs for every RPC, and attached 64 MRs
to each RPC before posting it. The new code attaches just enough MRs
to handle each RPC. When no data payload chunk is needed, no MRs are
attached to the RPC. For modern HCAs, only one MR is needed for NFS
read or write data payloads.

The final patch in the series splits the rb_lock in two in order to
reduce lock contention.

The series is also available in the nfs-rdma-for-4.1 topic branch at

 git://linux-nfs.org/projects/cel/cel-2.6.git

---

Chuck Lever (16):
  xprtrdma: Display IPv6 addresses and port numbers correctly
  xprtrdma: Perform a full marshal on retransmit
  xprtrdma: Add vector of ops for each memory registration strategy
  xprtrdma: Add a "max_payload" op for each memreg mode
  xprtrdma: Add a "register_external" op for each memreg mode
  xprtrdma: Add a "deregister_external" op for each memreg mode
  xprtrdma: Add "init MRs" memreg op
  xprtrdma: Add "reset MRs" memreg op
  xprtrdma: Add "destroy MRs" memreg op
  xprtrdma: Add "open" memreg op
  xprtrdma: Handle non-SEND completions via a callout
  xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()
  xprtrdma: Acquire MRs in rpcrdma_register_external()
  xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy
  xprtrdma: Make rpcrdma_{un}map_one() into inline functions
  xprtrdma: Split rb_lock


 include/linux/sunrpc/xprtrdma.h|3 
 net/sunrpc/xprtrdma/Makefile   |3 
 net/sunrpc/xprtrdma/fmr_ops.c  |  270 +++
 net/sunrpc/xprtrdma/frwr_ops.c |  485 
 net/sunrpc/xprtrdma/physical_ops.c |  110 +
 net/sunrpc/xprtrdma/rpc_rdma.c |   82 ++-
 net/sunrpc/xprtrdma/transport.c|   65 ++-
 net/sunrpc/xprtrdma/verbs.c|  856 +++-
 net/sunrpc/xprtrdma/xprt_rdma.h|  108 -
 9 files changed, 1096 insertions(+), 886 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/fmr_ops.c
 create mode 100644 net/sunrpc/xprtrdma/frwr_ops.c
 create mode 100644 net/sunrpc/xprtrdma/physical_ops.c

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 02/16] xprtrdma: Perform a full marshal on retransmit

2015-03-13 Thread Chuck Lever
Commit 6ab59945f292 ("xprtrdma: Update rkeys after transport
reconnect" added logic in the ->send_request path to update the
chunk list when an RPC/RDMA request is retransmitted.

Note that rpc_xdr_encode() resets and re-encodes the entire RPC
send buffer for each retransmit of an RPC. The RPC send buffer
is not preserved from the previous transmission of an RPC.

Revert 6ab59945f292, and instead, just force each request to be
fully marshaled every time through ->send_request. This should
preserve the fix from 6ab59945f292, while also performing pullup
during retransmits.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   71 ++-
 net/sunrpc/xprtrdma/transport.c |5 +--
 net/sunrpc/xprtrdma/xprt_rdma.h |   10 -
 3 files changed, 34 insertions(+), 52 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 91ffde8..41456d9 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -53,6 +53,14 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+enum rpcrdma_chunktype {
+   rpcrdma_noch = 0,
+   rpcrdma_readch,
+   rpcrdma_areadch,
+   rpcrdma_writech,
+   rpcrdma_replych
+};
+
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
 static const char transfertypes[][12] = {
"pure inline",  /* no chunks */
@@ -284,28 +292,6 @@ out:
 }
 
 /*
- * Marshal chunks. This routine returns the header length
- * consumed by marshaling.
- *
- * Returns positive RPC/RDMA header size, or negative errno.
- */
-
-ssize_t
-rpcrdma_marshal_chunks(struct rpc_rqst *rqst, ssize_t result)
-{
-   struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
-   struct rpcrdma_msg *headerp = rdmab_to_msg(req->rl_rdmabuf);
-
-   if (req->rl_rtype != rpcrdma_noch)
-   result = rpcrdma_create_chunks(rqst, &rqst->rq_snd_buf,
-  headerp, req->rl_rtype);
-   else if (req->rl_wtype != rpcrdma_noch)
-   result = rpcrdma_create_chunks(rqst, &rqst->rq_rcv_buf,
-  headerp, req->rl_wtype);
-   return result;
-}
-
-/*
  * Copy write data inline.
  * This function is used for "small" requests. Data which is passed
  * to RPC via iovecs (or page list) is copied directly into the
@@ -397,6 +383,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
char *base;
size_t rpclen, padlen;
ssize_t hdrlen;
+   enum rpcrdma_chunktype rtype, wtype;
struct rpcrdma_msg *headerp;
 
/*
@@ -433,13 +420,13 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * into pages; otherwise use reply chunks.
 */
if (rqst->rq_rcv_buf.buflen <= RPCRDMA_INLINE_READ_THRESHOLD(rqst))
-   req->rl_wtype = rpcrdma_noch;
+   wtype = rpcrdma_noch;
else if (rqst->rq_rcv_buf.page_len == 0)
-   req->rl_wtype = rpcrdma_replych;
+   wtype = rpcrdma_replych;
else if (rqst->rq_rcv_buf.flags & XDRBUF_READ)
-   req->rl_wtype = rpcrdma_writech;
+   wtype = rpcrdma_writech;
else
-   req->rl_wtype = rpcrdma_replych;
+   wtype = rpcrdma_replych;
 
/*
 * Chunks needed for arguments?
@@ -456,16 +443,16 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * TBD check NFSv4 setacl
 */
if (rqst->rq_snd_buf.len <= RPCRDMA_INLINE_WRITE_THRESHOLD(rqst))
-   req->rl_rtype = rpcrdma_noch;
+   rtype = rpcrdma_noch;
else if (rqst->rq_snd_buf.page_len == 0)
-   req->rl_rtype = rpcrdma_areadch;
+   rtype = rpcrdma_areadch;
else
-   req->rl_rtype = rpcrdma_readch;
+   rtype = rpcrdma_readch;
 
/* The following simplification is not true forever */
-   if (req->rl_rtype != rpcrdma_noch && req->rl_wtype == rpcrdma_replych)
-   req->rl_wtype = rpcrdma_noch;
-   if (req->rl_rtype != rpcrdma_noch && req->rl_wtype != rpcrdma_noch) {
+   if (rtype != rpcrdma_noch && wtype == rpcrdma_replych)
+   wtype = rpcrdma_noch;
+   if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC:   %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
@@ -479,7 +466,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * When padding is in use and applies to the transfer, insert
 * it and change the message type.
 */
-   if (req->rl_rtype == rpcrdma_noch) {
+   if (rtype == rpcrdma_noch) {
 
padlen = rpcrdma_inline_pullup(rqst,
RPCRDMA_INLINE_PAD_VALUE(rqst));
@@ -494,7 +481,7 @@ rpcrdma_ma

[PATCH v1 01/16] xprtrdma: Display IPv6 addresses and port numbers correctly

2015-03-13 Thread Chuck Lever
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/transport.c |   47 ---
 net/sunrpc/xprtrdma/verbs.c |   21 +++--
 2 files changed, 47 insertions(+), 21 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2e192ba..26a62e7 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -157,12 +157,47 @@ static struct ctl_table sunrpc_table[] = {
 static struct rpc_xprt_ops xprt_rdma_procs;/* forward reference */
 
 static void
+xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr *sap)
+{
+   struct sockaddr_in *sin = (struct sockaddr_in *)sap;
+   char buf[20];
+
+   snprintf(buf, sizeof(buf), "%08x", ntohl(sin->sin_addr.s_addr));
+   xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
+
+   xprt->address_strings[RPC_DISPLAY_NETID] = "rdma";
+}
+
+static void
+xprt_rdma_format_addresses6(struct rpc_xprt *xprt, struct sockaddr *sap)
+{
+   struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)sap;
+   char buf[40];
+
+   snprintf(buf, sizeof(buf), "%pi6", &sin6->sin6_addr);
+   xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
+
+   xprt->address_strings[RPC_DISPLAY_NETID] = "rdma6";
+}
+
+static void
 xprt_rdma_format_addresses(struct rpc_xprt *xprt)
 {
struct sockaddr *sap = (struct sockaddr *)
&rpcx_to_rdmad(xprt).addr;
-   struct sockaddr_in *sin = (struct sockaddr_in *)sap;
-   char buf[64];
+   char buf[128];
+
+   switch (sap->sa_family) {
+   case AF_INET:
+   xprt_rdma_format_addresses4(xprt, sap);
+   break;
+   case AF_INET6:
+   xprt_rdma_format_addresses6(xprt, sap);
+   break;
+   default:
+   pr_err("rpcrdma: Unrecognized address family\n");
+   return;
+   }
 
(void)rpc_ntop(sap, buf, sizeof(buf));
xprt->address_strings[RPC_DISPLAY_ADDR] = kstrdup(buf, GFP_KERNEL);
@@ -170,16 +205,10 @@ xprt_rdma_format_addresses(struct rpc_xprt *xprt)
snprintf(buf, sizeof(buf), "%u", rpc_get_port(sap));
xprt->address_strings[RPC_DISPLAY_PORT] = kstrdup(buf, GFP_KERNEL);
 
-   xprt->address_strings[RPC_DISPLAY_PROTO] = "rdma";
-
-   snprintf(buf, sizeof(buf), "%08x", ntohl(sin->sin_addr.s_addr));
-   xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
-
snprintf(buf, sizeof(buf), "%4hx", rpc_get_port(sap));
xprt->address_strings[RPC_DISPLAY_HEX_PORT] = kstrdup(buf, GFP_KERNEL);
 
-   /* netid */
-   xprt->address_strings[RPC_DISPLAY_NETID] = "rdma";
+   xprt->address_strings[RPC_DISPLAY_PROTO] = "rdma";
 }
 
 static void
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 124676c..1aa55b7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "xprt_rdma.h"
@@ -424,7 +425,7 @@ rpcrdma_conn_upcall(struct rdma_cm_id *id, struct 
rdma_cm_event *event)
struct rpcrdma_ia *ia = &xprt->rx_ia;
struct rpcrdma_ep *ep = &xprt->rx_ep;
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
-   struct sockaddr_in *addr = (struct sockaddr_in *) &ep->rep_remote_addr;
+   struct sockaddr *sap = (struct sockaddr *)&ep->rep_remote_addr;
 #endif
struct ib_qp_attr *attr = &ia->ri_qp_attr;
struct ib_qp_init_attr *iattr = &ia->ri_qp_init_attr;
@@ -480,9 +481,8 @@ connected:
wake_up_all(&ep->rep_connect_wait);
/*FALLTHROUGH*/
default:
-   dprintk("RPC:   %s: %pI4:%u (ep 0x%p): %s\n",
-   __func__, &addr->sin_addr.s_addr,
-   ntohs(addr->sin_port), ep,
+   dprintk("RPC:   %s: %pIS:%u (ep 0x%p): %s\n",
+   __func__, sap, rpc_get_port(sap), ep,
CONNECTION_MSG(event->event));
break;
}
@@ -491,19 +491,16 @@ connected:
if (connstate == 1) {
int ird = attr->max_dest_rd_atomic;
int tird = ep->rep_remote_cma.responder_resources;
-   printk(KERN_INFO "rpcrdma: connection to %pI4:%u "
-   "on %s, memreg %d slots %d ird %d%s\n",
-   &addr->sin_addr.s_addr,
-   ntohs(addr->sin_port),
+
+   pr_info("rpcrdma: connection to %pIS:%u on %s, memreg %d slots 
%d ird %d%s\n",
+   sap, rpc_get_port(sap),

[PATCH v1 05/16] xprtrdma: Add a "register_external" op for each memreg mode

2015-03-13 Thread Chuck Lever
There is very little common processing among the different external
memory registration functions. Have rpcrdma_create_chunks() call
the registration method directly. This removes a stack frame and a
switch statement from the external registration path.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   51 +++
 net/sunrpc/xprtrdma/frwr_ops.c |   88 ++
 net/sunrpc/xprtrdma/physical_ops.c |   17 
 net/sunrpc/xprtrdma/rpc_rdma.c |5 +
 net/sunrpc/xprtrdma/verbs.c|  172 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|6 +
 6 files changed, 166 insertions(+), 173 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index eec2660..45fb646 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -29,7 +29,58 @@ fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * RPCRDMA_MAX_FMR_SGES);
 }
 
+/* Use the ib_map_phys_fmr() verb to register a memory region
+ * for remote access via RDMA READ or RDMA WRITE.
+ */
+static int
+fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+  int nsegs, bool writing)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg1 = seg;
+   struct rpcrdma_mw *mw = seg1->rl_mw;
+   u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
+   int len, pageoff, i, rc;
+
+   pageoff = offset_in_page(seg1->mr_offset);
+   seg1->mr_offset -= pageoff; /* start of page */
+   seg1->mr_len += pageoff;
+   len = -pageoff;
+   if (nsegs > RPCRDMA_MAX_FMR_SGES)
+   nsegs = RPCRDMA_MAX_FMR_SGES;
+   for (i = 0; i < nsegs;) {
+   rpcrdma_map_one(ia, seg, writing);
+   physaddrs[i] = seg->mr_dma;
+   len += seg->mr_len;
+   ++seg;
+   ++i;
+   /* Check for holes */
+   if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
+   offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
+   break;
+   }
+
+   rc = ib_map_phys_fmr(mw->r.fmr, physaddrs, i, seg1->mr_dma);
+   if (rc)
+   goto out_maperr;
+
+   seg1->mr_rkey = mw->r.fmr->rkey;
+   seg1->mr_base = seg1->mr_dma + pageoff;
+   seg1->mr_nsegs = i;
+   seg1->mr_len = len;
+   return i;
+
+out_maperr:
+   dprintk("RPC:   %s: ib_map_phys_fmr %u@0x%llx+%i (%d) status %i\n",
+   __func__, len, (unsigned long long)seg1->mr_dma,
+   pageoff, i, rc);
+   while (i--)
+   rpcrdma_unmap_one(ia, --seg);
+   return rc;
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
+   .ro_map = fmr_op_map,
.ro_maxpages= fmr_op_maxpages,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 73a5ac8..2b5ccb0 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -29,7 +29,95 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
+/* Post a FAST_REG Work Request to register a memory region
+ * for remote access via RDMA READ or RDMA WRITE.
+ */
+static int
+frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+   int nsegs, bool writing)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg1 = seg;
+   struct rpcrdma_mw *mw = seg1->rl_mw;
+   struct rpcrdma_frmr *frmr = &mw->r.frmr;
+   struct ib_mr *mr = frmr->fr_mr;
+   struct ib_send_wr fastreg_wr, *bad_wr;
+   u8 key;
+   int len, pageoff;
+   int i, rc;
+   int seg_len;
+   u64 pa;
+   int page_no;
+
+   pageoff = offset_in_page(seg1->mr_offset);
+   seg1->mr_offset -= pageoff; /* start of page */
+   seg1->mr_len += pageoff;
+   len = -pageoff;
+   if (nsegs > ia->ri_max_frmr_depth)
+   nsegs = ia->ri_max_frmr_depth;
+   for (page_no = i = 0; i < nsegs;) {
+   rpcrdma_map_one(ia, seg, writing);
+   pa = seg->mr_dma;
+   for (seg_len = seg->mr_len; seg_len > 0; seg_len -= PAGE_SIZE) {
+   frmr->fr_pgl->page_list[page_no++] = pa;
+   pa += PAGE_SIZE;
+   }
+   len += seg->mr_len;
+   ++seg;
+   ++i;
+   /* Check for holes */
+   if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
+   offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
+   break;
+   }
+   dpri

[PATCH v1 06/16] xprtrdma: Add a "deregister_external" op for each memreg mode

2015-03-13 Thread Chuck Lever
There is very little common processing among the different external
memory deregistration functions.

In addition, instead of calling the deregistration function for each
segment, have one call release all segments for a request. This makes
the API a little asymmetrical, but a hair faster.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   37 
 net/sunrpc/xprtrdma/frwr_ops.c |   46 
 net/sunrpc/xprtrdma/physical_ops.c |   13 ++
 net/sunrpc/xprtrdma/rpc_rdma.c |7 +--
 net/sunrpc/xprtrdma/transport.c|8 +---
 net/sunrpc/xprtrdma/verbs.c|   81 
 net/sunrpc/xprtrdma/xprt_rdma.h|5 +-
 7 files changed, 103 insertions(+), 94 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 45fb646..9b983b4 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -20,6 +20,32 @@
 /* Maximum scatter/gather per FMR */
 #define RPCRDMA_MAX_FMR_SGES   (64)
 
+/* Use the ib_unmap_fmr() verb to prevent further remote
+ * access via RDMA READ or RDMA WRITE.
+ */
+static int
+__fmr_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg1 = seg;
+   int rc, nsegs = seg->mr_nsegs;
+   LIST_HEAD(l);
+
+   list_add(&seg1->rl_mw->r.fmr->list, &l);
+   rc = ib_unmap_fmr(&l);
+   read_lock(&ia->ri_qplock);
+   while (seg1->mr_nsegs--)
+   rpcrdma_unmap_one(ia, seg++);
+   read_unlock(&ia->ri_qplock);
+   if (rc)
+   goto out_err;
+   return nsegs;
+
+out_err:
+   dprintk("RPC:   %s: ib_unmap_fmr status %i\n", __func__, rc);
+   return nsegs;
+}
+
 /* FMR mode conveys up to 64 pages of payload per chunk segment.
  */
 static size_t
@@ -79,8 +105,19 @@ out_maperr:
return rc;
 }
 
+static void
+fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
+unsigned int count)
+{
+   unsigned int i;
+
+   for (i = 0; count--;)
+   i += __fmr_unmap(r_xprt, &req->rl_segments[i]);
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
+   .ro_unmap   = fmr_op_unmap,
.ro_maxpages= fmr_op_maxpages,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 2b5ccb0..05b5761 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -17,6 +17,41 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+/* Post a LOCAL_INV Work Request to prevent further remote access
+ * via RDMA READ or RDMA WRITE.
+ */
+static int
+__frwr_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   struct rpcrdma_mr_seg *seg1 = seg;
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct ib_send_wr invalidate_wr, *bad_wr;
+   int rc, nsegs = seg->mr_nsegs;
+
+   seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
+
+   memset(&invalidate_wr, 0, sizeof(invalidate_wr));
+   invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
+   invalidate_wr.opcode = IB_WR_LOCAL_INV;
+   invalidate_wr.ex.invalidate_rkey = seg1->rl_mw->r.frmr.fr_mr->rkey;
+   DECR_CQCOUNT(&r_xprt->rx_ep);
+
+   read_lock(&ia->ri_qplock);
+   while (seg1->mr_nsegs--)
+   rpcrdma_unmap_one(ia, seg++);
+   rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+   read_unlock(&ia->ri_qplock);
+   if (rc)
+   goto out_err;
+   return nsegs;
+
+out_err:
+   /* Force rpcrdma_buffer_get() to retry */
+   seg1->rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
+   dprintk("RPC:   %s: ib_post_send status %i\n", __func__, rc);
+   return nsegs;
+}
+
 /* FRWR mode conveys a list of pages per chunk segment. The
  * maximum length of that list is the FRWR page list depth.
  */
@@ -116,8 +151,19 @@ out_err:
return rc;
 }
 
+static void
+frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
+ unsigned int count)
+{
+   unsigned int i;
+
+   for (i = 0; count--;)
+   i += __frwr_unmap(r_xprt, &req->rl_segments[i]);
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
+   .ro_unmap   = frwr_op_unmap,
.ro_maxpages= frwr_op_maxpages,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index 5a284ee..f2c15be 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ 

[PATCH v1 03/16] xprtrdma: Add vector of ops for each memory registration strategy

2015-03-13 Thread Chuck Lever
Instead of employing switch() statements, let's use the typical
Linux kernel idiom for handling behavioral variation: virtual
functions.

Define a vector of operations for each supported memory registration
mode, and add a source file for each mode.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/Makefile   |3 ++-
 net/sunrpc/xprtrdma/fmr_ops.c  |   22 ++
 net/sunrpc/xprtrdma/frwr_ops.c |   22 ++
 net/sunrpc/xprtrdma/physical_ops.c |   24 
 net/sunrpc/xprtrdma/verbs.c|   11 +++
 net/sunrpc/xprtrdma/xprt_rdma.h|   12 
 6 files changed, 89 insertions(+), 5 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/fmr_ops.c
 create mode 100644 net/sunrpc/xprtrdma/frwr_ops.c
 create mode 100644 net/sunrpc/xprtrdma/physical_ops.c

diff --git a/net/sunrpc/xprtrdma/Makefile b/net/sunrpc/xprtrdma/Makefile
index da5136f..579f72b 100644
--- a/net/sunrpc/xprtrdma/Makefile
+++ b/net/sunrpc/xprtrdma/Makefile
@@ -1,6 +1,7 @@
 obj-$(CONFIG_SUNRPC_XPRT_RDMA_CLIENT) += xprtrdma.o
 
-xprtrdma-y := transport.o rpc_rdma.o verbs.o
+xprtrdma-y := transport.o rpc_rdma.o verbs.o \
+   fmr_ops.o frwr_ops.o physical_ops.o
 
 obj-$(CONFIG_SUNRPC_XPRT_RDMA_SERVER) += svcrdma.o
 
diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
new file mode 100644
index 000..ffb7d93
--- /dev/null
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -0,0 +1,22 @@
+/*
+ * Copyright (c) 2015 Oracle.  All rights reserved.
+ * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
+ */
+
+/* Lightweight memory registration using Fast Memory Regions (FMR).
+ * Referred to sometimes as MTHCAFMR mode.
+ *
+ * FMR uses synchronous memory registration and deregistration.
+ * FMR registration is known to be fast, but FMR deregistration
+ * can take tens of usecs to complete.
+ */
+
+#include "xprt_rdma.h"
+
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+# define RPCDBG_FACILITY   RPCDBG_TRANS
+#endif
+
+const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
+   .ro_displayname = "fmr",
+};
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
new file mode 100644
index 000..79173f9
--- /dev/null
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -0,0 +1,22 @@
+/*
+ * Copyright (c) 2015 Oracle.  All rights reserved.
+ * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
+ */
+
+/* Lightweight memory registration using Fast Registration Work
+ * Requests (FRWR). Also referred to sometimes as FRMR mode.
+ *
+ * FRWR features ordered asynchronous registration and deregistration
+ * of arbitrarily sized memory regions. This is the fastest and safest
+ * but most complex memory registration mode.
+ */
+
+#include "xprt_rdma.h"
+
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+# define RPCDBG_FACILITY   RPCDBG_TRANS
+#endif
+
+const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
+   .ro_displayname = "frwr",
+};
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
new file mode 100644
index 000..b0922ac
--- /dev/null
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -0,0 +1,24 @@
+/*
+ * Copyright (c) 2015 Oracle.  All rights reserved.
+ * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
+ */
+
+/* No-op chunk preparation. All client memory is pre-registered.
+ * Sometimes referred to as ALLPHYSICAL mode.
+ *
+ * Physical registration is simple because all client memory is
+ * pre-registered and never deregistered. This mode is good for
+ * adapter bring up, but is considered not safe: the server is
+ * trusted not to abuse its access to client memory not involved
+ * in RDMA I/O.
+ */
+
+#include "xprt_rdma.h"
+
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+# define RPCDBG_FACILITY   RPCDBG_TRANS
+#endif
+
+const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
+   .ro_displayname = "physical",
+};
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 1aa55b7..e4d9d9c 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -492,10 +492,10 @@ connected:
int ird = attr->max_dest_rd_atomic;
int tird = ep->rep_remote_cma.responder_resources;
 
-   pr_info("rpcrdma: connection to %pIS:%u on %s, memreg %d slots 
%d ird %d%s\n",
+   pr_info("rpcrdma: connection to %pIS:%u on %s, memreg '%s', %d 
credits, %d responders%s\n",
sap, rpc_get_port(sap),
ia->ri_id->device->name,
-   ia->ri_memreg_strategy,
+   ia->ri_ops->ro_displayname,
xprt->rx_buf.rb_max_requests,
ird, ird < 4 && ird < tird / 2 ? " (low!)&qu

[PATCH v1 04/16] xprtrdma: Add a "max_payload" op for each memreg mode

2015-03-13 Thread Chuck Lever
The max_payload computation is generalized to ensure that the
payload maximum is the lesser of RPC_MAX_DATA_SEGS and the number of
data segments that can be transmitted in an inline buffer.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   13 ++
 net/sunrpc/xprtrdma/frwr_ops.c |   13 ++
 net/sunrpc/xprtrdma/physical_ops.c |   10 +++
 net/sunrpc/xprtrdma/transport.c|5 +++-
 net/sunrpc/xprtrdma/verbs.c|   49 +++-
 net/sunrpc/xprtrdma/xprt_rdma.h|5 +++-
 6 files changed, 59 insertions(+), 36 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index ffb7d93..eec2660 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -17,6 +17,19 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+/* Maximum scatter/gather per FMR */
+#define RPCRDMA_MAX_FMR_SGES   (64)
+
+/* FMR mode conveys up to 64 pages of payload per chunk segment.
+ */
+static size_t
+fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
+{
+   return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
+rpcrdma_max_segments(r_xprt) * RPCRDMA_MAX_FMR_SGES);
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
+   .ro_maxpages= fmr_op_maxpages,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 79173f9..73a5ac8 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -17,6 +17,19 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+/* FRWR mode conveys a list of pages per chunk segment. The
+ * maximum length of that list is the FRWR page list depth.
+ */
+static size_t
+frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+
+   return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
+rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
+   .ro_maxpages= frwr_op_maxpages,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index b0922ac..28ade19 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -19,6 +19,16 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+/* PHYSICAL memory registration conveys one page per chunk segment.
+ */
+static size_t
+physical_op_maxpages(struct rpcrdma_xprt *r_xprt)
+{
+   return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
+rpcrdma_max_segments(r_xprt));
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
+   .ro_maxpages= physical_op_maxpages,
.ro_displayname = "physical",
 };
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 271d306..9a9da40 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -406,7 +406,10 @@ xprt_setup_rdma(struct xprt_create *args)
  xprt_rdma_connect_worker);
 
xprt_rdma_format_addresses(xprt);
-   xprt->max_payload = rpcrdma_max_payload(new_xprt);
+   xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
+   if (xprt->max_payload == 0)
+   goto out4;
+   xprt->max_payload <<= PAGE_SHIFT;
dprintk("RPC:   %s: transport data payload maximum: %zu bytes\n",
__func__, xprt->max_payload);
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index e4d9d9c..837d4ea 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -2215,43 +2215,24 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
return rc;
 }
 
-/* Physical mapping means one Read/Write list entry per-page.
- * All list entries must fit within an inline buffer
- *
- * NB: The server must return a Write list for NFS READ,
- * which has the same constraint. Factor in the inline
- * rsize as well.
+/* How many chunk list items fit within our inline buffers?
  */
-static size_t
-rpcrdma_physical_max_payload(struct rpcrdma_xprt *r_xprt)
+unsigned int
+rpcrdma_max_segments(struct rpcrdma_xprt *r_xprt)
 {
struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
-   unsigned int inline_size, pages;
-
-   inline_size = min_t(unsigned int,
-   cdata->inline_wsize, cdata->inline_rsize);
-   inline_size -= RPCRDMA_HDRLEN_MIN;
-   pages = inline_size / sizeof(struct rpcrdma_segment);
-   return pages << PAGE_SHIFT;
-}
+   int bytes, segments;
 
-static size_t
-rpcrdma_mr_max_payload(struct rpcrdma_xprt *r_xprt)
-{
-   return RPCRDMA_MAX_DATA_SEGS << PAGE_SHIFT;
-}
-
-size_

[PATCH v1 08/16] xprtrdma: Add "reset MRs" memreg op

2015-03-13 Thread Chuck Lever
This method is invoked when a transport instance is about to be
reconnected. Each Memory Region object is reset to its initial
state.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   23 
 net/sunrpc/xprtrdma/frwr_ops.c |   46 
 net/sunrpc/xprtrdma/physical_ops.c |6 ++
 net/sunrpc/xprtrdma/verbs.c|  103 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|1 
 5 files changed, 78 insertions(+), 101 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 1501db0..1ccb3de 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -156,10 +156,33 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_req *req,
i += __fmr_unmap(r_xprt, &req->rl_segments[i]);
 }
 
+/* After a disconnect, unmap all FMRs.
+ *
+ * This is invoked only in the transport connect worker in order
+ * to serialize with rpcrdma_register_fmr_external().
+ */
+static void
+fmr_op_reset(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct rpcrdma_mw *r;
+   LIST_HEAD(list);
+   int rc;
+
+   list_for_each_entry(r, &buf->rb_all, mw_all)
+   list_add(&r->r.fmr->list, &list);
+
+   rc = ib_unmap_fmr(&list);
+   if (rc)
+   dprintk("RPC:   %s: ib_unmap_fmr failed %i\n",
+   __func__, rc);
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap   = fmr_op_unmap,
.ro_maxpages= fmr_op_maxpages,
.ro_init= fmr_op_init,
+   .ro_reset   = fmr_op_reset,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 975372c..b4ce0e5 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -81,6 +81,18 @@ out_err:
return nsegs;
 }
 
+static void
+__frwr_release(struct rpcrdma_mw *r)
+{
+   int rc;
+
+   rc = ib_dereg_mr(r->r.frmr.fr_mr);
+   if (rc)
+   dprintk("RPC:   %s: ib_dereg_mr status %i\n",
+   __func__, rc);
+   ib_free_fast_reg_page_list(r->r.frmr.fr_pgl);
+}
+
 /* FRWR mode conveys a list of pages per chunk segment. The
  * maximum length of that list is the FRWR page list depth.
  */
@@ -226,10 +238,44 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_req *req,
i += __frwr_unmap(r_xprt, &req->rl_segments[i]);
 }
 
+/* After a disconnect, a flushed FAST_REG_MR can leave an FRMR in
+ * an unusable state. Find FRMRs in this state and dereg / reg
+ * each.  FRMRs that are VALID and attached to an rpcrdma_req are
+ * also torn down.
+ *
+ * This gives all in-use FRMRs a fresh rkey and leaves them INVALID.
+ *
+ * This is invoked only in the transport connect worker in order
+ * to serialize with rpcrdma_register_frmr_external().
+ */
+static void
+frwr_op_reset(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct ib_device *device = r_xprt->rx_ia.ri_id->device;
+   unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
+   struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
+   struct rpcrdma_mw *r;
+   int rc;
+
+   list_for_each_entry(r, &buf->rb_all, mw_all) {
+   if (r->r.frmr.fr_state == FRMR_IS_INVALID)
+   continue;
+
+   __frwr_release(r);
+   rc = __frwr_init(r, pd, device, depth);
+   if (rc)
+   continue;
+
+   r->r.frmr.fr_state = FRMR_IS_INVALID;
+   }
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
.ro_unmap   = frwr_op_unmap,
.ro_maxpages= frwr_op_maxpages,
.ro_init= frwr_op_init,
+   .ro_reset   = frwr_op_reset,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index ae2b0bc..0afc691 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -62,10 +62,16 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_req *req,
rpcrdma_unmap_one(&r_xprt->rx_ia, &req->rl_segments[i]);
 }
 
+static void
+physical_op_reset(struct rpcrdma_xprt *r_xprt)
+{
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
.ro_unmap   = physical_op_unmap,
   

[PATCH v1 14/16] xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy

2015-03-13 Thread Chuck Lever
Clean up: This field is no longer used.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/xprtrdma.h |3 ++-
 net/sunrpc/xprtrdma/verbs.c |3 ---
 net/sunrpc/xprtrdma/xprt_rdma.h |1 -
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/xprtrdma.h b/include/linux/sunrpc/xprtrdma.h
index 64a0a0a..10bc3cde 100644
--- a/include/linux/sunrpc/xprtrdma.h
+++ b/include/linux/sunrpc/xprtrdma.h
@@ -61,7 +61,8 @@
 
 #define RPCRDMA_INLINE_PAD_THRESH  (512)/* payload threshold to pad (bytes) */
 
-/* memory registration strategies */
+/* Memory registration strategies, by number.
+ * This is part of a kernel / user space API. Do not remove. */
 enum rpcrdma_memreg {
RPCRDMA_BOUNCEBUFFERS = 0,
RPCRDMA_REGISTER,
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 5b9c104..21b63fe 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -669,9 +669,6 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr 
*addr, int memreg)
dprintk("RPC:   %s: memory registration strategy is '%s'\n",
__func__, ia->ri_ops->ro_displayname);
 
-   /* Else will do memory reg/dereg for each chunk */
-   ia->ri_memreg_strategy = memreg;
-
rwlock_init(&ia->ri_qplock);
return 0;
 
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index d5aa5b4..b167e44 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -69,7 +69,6 @@ struct rpcrdma_ia {
int ri_have_dma_lkey;
struct completion   ri_done;
int ri_async_rc;
-   enum rpcrdma_memreg ri_memreg_strategy;
unsigned intri_max_frmr_depth;
struct ib_device_attr   ri_devattr;
struct ib_qp_attr   ri_qp_attr;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 07/16] xprtrdma: Add "init MRs" memreg op

2015-03-13 Thread Chuck Lever
This method is used when setting up a new transport instance to
create a pool of Memory Region objects that will be used to register
memory during operation.

Memory Regions are not needed for "physical" registration, since
->prepare and ->release are basically no-ops for that mode.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   42 +++
 net/sunrpc/xprtrdma/frwr_ops.c |   66 +++
 net/sunrpc/xprtrdma/physical_ops.c |7 ++
 net/sunrpc/xprtrdma/verbs.c|  104 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|1 
 5 files changed, 119 insertions(+), 101 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 9b983b4..1501db0 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -55,6 +55,47 @@ fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * RPCRDMA_MAX_FMR_SGES);
 }
 
+static int
+fmr_op_init(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   int mr_access_flags = IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ;
+   struct ib_fmr_attr fmr_attr = {
+   .max_pages  = RPCRDMA_MAX_FMR_SGES,
+   .max_maps   = 1,
+   .page_shift = PAGE_SHIFT
+   };
+   struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
+   struct rpcrdma_mw *r;
+   int i, rc;
+
+   INIT_LIST_HEAD(&buf->rb_mws);
+   INIT_LIST_HEAD(&buf->rb_all);
+
+   i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
+   dprintk("RPC:   %s: initalizing %d FMRs\n", __func__, i);
+
+   while (i--) {
+   r = kzalloc(sizeof(*r), GFP_KERNEL);
+   if (!r)
+   return -ENOMEM;
+
+   r->r.fmr = ib_alloc_fmr(pd, mr_access_flags, &fmr_attr);
+   if (IS_ERR(r->r.fmr))
+   goto out_fmr_err;
+
+   list_add(&r->mw_list, &buf->rb_mws);
+   list_add(&r->mw_all, &buf->rb_all);
+   }
+   return 0;
+
+out_fmr_err:
+   rc = PTR_ERR(r->r.fmr);
+   dprintk("RPC:   %s: ib_alloc_fmr status %i\n", __func__, rc);
+   kfree(r);
+   return rc;
+}
+
 /* Use the ib_map_phys_fmr() verb to register a memory region
  * for remote access via RDMA READ or RDMA WRITE.
  */
@@ -119,5 +160,6 @@ const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap   = fmr_op_unmap,
.ro_maxpages= fmr_op_maxpages,
+   .ro_init= fmr_op_init,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 05b5761..975372c 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -17,6 +17,35 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+static int
+__frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, struct ib_device *device,
+   unsigned int depth)
+{
+   struct rpcrdma_frmr *f = &r->r.frmr;
+   int rc;
+
+   f->fr_mr = ib_alloc_fast_reg_mr(pd, depth);
+   if (IS_ERR(f->fr_mr))
+   goto out_mr_err;
+   f->fr_pgl = ib_alloc_fast_reg_page_list(device, depth);
+   if (IS_ERR(f->fr_pgl))
+   goto out_list_err;
+   return 0;
+
+out_mr_err:
+   rc = PTR_ERR(f->fr_mr);
+   dprintk("RPC:   %s: ib_alloc_fast_reg_mr status %i\n",
+   __func__, rc);
+   return rc;
+
+out_list_err:
+   rc = PTR_ERR(f->fr_pgl);
+   dprintk("RPC:   %s: ib_alloc_fast_reg_page_list status %i\n",
+   __func__, rc);
+   ib_dereg_mr(f->fr_mr);
+   return rc;
+}
+
 /* Post a LOCAL_INV Work Request to prevent further remote access
  * via RDMA READ or RDMA WRITE.
  */
@@ -64,6 +93,42 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
+static int
+frwr_op_init(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct ib_device *device = r_xprt->rx_ia.ri_id->device;
+   unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
+   struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
+   int i;
+
+   INIT_LIST_HEAD(&buf->rb_mws);
+   INIT_LIST_HEAD(&buf->rb_all);
+
+   i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
+   dprintk("RPC:   %s: initalizing %d FRMRs\n", __func__, i);
+
+   while (i--) {
+   struct rpcrdma_mw *r;
+   int rc;
+
+   r = kzalloc(sizeof(*r), GFP_KERNEL);
+   if (!r)
+   retur

[PATCH v1 16/16] xprtrdma: Split rb_lock

2015-03-13 Thread Chuck Lever
/proc/lock_stat showed contention between rpcrdma_buffer_get/put
and the MR allocation functions during I/O intensive workloads.

Now that MRs are no longer allocated in rpcrdma_buffer_get(),
there's no reason the rb_mws list has to be managed using the
same lock as the send/receive buffers. Split that lock. The
new lock does not need to disable interrupts because buffer
get/put is never called in an interrupt context.

struct rpcrdma_buffer is re-arranged to ensure rb_mwlock and
rb_mws is always in a different cacheline than rb_lock and the
buffer pointers.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c   |6 +++---
 net/sunrpc/xprtrdma/frwr_ops.c  |   28 
 net/sunrpc/xprtrdma/verbs.c |5 ++---
 net/sunrpc/xprtrdma/xprt_rdma.h |   16 +---
 4 files changed, 26 insertions(+), 29 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 1404f20..00d362d 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -96,6 +96,7 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
struct rpcrdma_mw *r;
int i, rc;
 
+   spin_lock_init(&buf->rb_mwlock);
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);
 
@@ -128,9 +129,8 @@ __fmr_get_mw(struct rpcrdma_xprt *r_xprt)
 {
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
struct rpcrdma_mw *mw = NULL;
-   unsigned long flags;
 
-   spin_lock_irqsave(&buf->rb_lock, flags);
+   spin_lock(&buf->rb_mwlock);
 
if (!list_empty(&buf->rb_mws)) {
mw = list_entry(buf->rb_mws.next,
@@ -140,7 +140,7 @@ __fmr_get_mw(struct rpcrdma_xprt *r_xprt)
pr_err("RPC:   %s: no MWs available\n", __func__);
}
 
-   spin_unlock_irqrestore(&buf->rb_lock, flags);
+   spin_unlock(&buf->rb_mwlock);
return mw;
 }
 
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 4494668..7973b94 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -196,6 +196,7 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
int i;
 
+   spin_lock_init(&buf->rb_mwlock);
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);
 
@@ -228,10 +229,9 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
  * Redo only the ib_post_send().
  */
 static void
-__retry_local_inv(struct rpcrdma_ia *ia, struct rpcrdma_mw *r)
+__retry_local_inv(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *r)
 {
-   struct rpcrdma_xprt *r_xprt =
-   container_of(ia, struct rpcrdma_xprt, rx_ia);
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct ib_send_wr invalidate_wr, *bad_wr;
int rc;
 
@@ -258,30 +258,27 @@ __retry_local_inv(struct rpcrdma_ia *ia, struct 
rpcrdma_mw *r)
dprintk("RPC:   %s: ib_post_send status %i\n",
__func__, rc);
}
+
+   rpcrdma_put_mw(r_xprt, r);
 }
 
 static void
-__retry_flushed_linv(struct rpcrdma_buffer *buf, struct list_head *stale)
+__retry_flushed_linv(struct rpcrdma_xprt *r_xprt, struct list_head *stale)
 {
-   struct rpcrdma_ia *ia = rdmab_to_ia(buf);
struct list_head *pos;
struct rpcrdma_mw *r;
-   unsigned long flags;
unsigned count;
 
count = 0;
list_for_each(pos, stale) {
r = list_entry(pos, struct rpcrdma_mw, mw_list);
-   __retry_local_inv(ia, r);
+   list_del_init(&r->mw_list);
+   __retry_local_inv(r_xprt, r);
++count;
}
 
-   pr_warn("RPC:   %s: adding %u MRs to rb_mws\n",
+   pr_warn("RPC:   %s: added %u MRs to rb_mws\n",
__func__, count);
-
-   spin_lock_irqsave(&buf->rb_lock, flags);
-   list_splice_tail(stale, &buf->rb_mws);
-   spin_unlock_irqrestore(&buf->rb_lock, flags);
 }
 
 static struct rpcrdma_mw *
@@ -289,11 +286,10 @@ __frwr_get_mw(struct rpcrdma_xprt *r_xprt)
 {
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
struct rpcrdma_mw *mw;
-   unsigned long flags;
LIST_HEAD(stale);
int count;
 
-   spin_lock_irqsave(&buf->rb_lock, flags);
+   spin_lock(&buf->rb_mwlock);
count = 0;
while (!list_empty(&buf->rb_mws)) {
mw = list_entry(buf->rb_mws.next, struct rpcrdma_mw, mw_list);
@@ -308,11 +304,11 @@ __frwr_get_mw(struct rpcrdma_xprt *r_xprt)
mw = NULL;
 
 out_unlock:
-   spin_unlock_irqrestore(&buf->rb_lock, flags);
+   spin_unlock(&buf->rb_mwlock);
if (!list_empty(&stale)) {
dprintk("RPC:   %s: found %d unsuitable MWs\n&quo

[PATCH v1 09/16] xprtrdma: Add "destroy MRs" memreg op

2015-03-13 Thread Chuck Lever
Memory Region objects associated with a transport instance are
destroyed before the instance is shutdown and destroyed.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   21 +++
 net/sunrpc/xprtrdma/frwr_ops.c |   17 
 net/sunrpc/xprtrdma/physical_ops.c |6 
 net/sunrpc/xprtrdma/verbs.c|   52 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|1 +
 5 files changed, 46 insertions(+), 51 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 1ccb3de..3115e4b 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -178,11 +178,32 @@ fmr_op_reset(struct rpcrdma_xprt *r_xprt)
__func__, rc);
 }
 
+static void
+fmr_op_destroy(struct rpcrdma_buffer *buf)
+{
+   struct rpcrdma_mw *r;
+   int rc;
+
+   while (!list_empty(&buf->rb_all)) {
+   r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
+   list_del(&r->mw_all);
+   list_del(&r->mw_list);
+
+   rc = ib_dealloc_fmr(r->r.fmr);
+   if (rc)
+   dprintk("RPC:   %s: ib_dealloc_fmr failed %i\n",
+   __func__, rc);
+
+   kfree(r);
+   }
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap   = fmr_op_unmap,
.ro_maxpages= fmr_op_maxpages,
.ro_init= fmr_op_init,
.ro_reset   = fmr_op_reset,
+   .ro_destroy = fmr_op_destroy,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index b4ce0e5..fc3a228 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -271,11 +271,28 @@ frwr_op_reset(struct rpcrdma_xprt *r_xprt)
}
 }
 
+static void
+frwr_op_destroy(struct rpcrdma_buffer *buf)
+{
+   struct rpcrdma_mw *r;
+
+   while (!list_empty(&buf->rb_all)) {
+   r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
+   list_del(&r->mw_all);
+   list_del(&r->mw_list);
+
+   __frwr_release(r);
+
+   kfree(r);
+   }
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
.ro_unmap   = frwr_op_unmap,
.ro_maxpages= frwr_op_maxpages,
.ro_init= frwr_op_init,
.ro_reset   = frwr_op_reset,
+   .ro_destroy = frwr_op_destroy,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index 0afc691..f8da8c4 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -67,11 +67,17 @@ physical_op_reset(struct rpcrdma_xprt *r_xprt)
 {
 }
 
+static void
+physical_op_destroy(struct rpcrdma_buffer *buf)
+{
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
.ro_unmap   = physical_op_unmap,
.ro_maxpages= physical_op_maxpages,
.ro_init= physical_op_init,
.ro_reset   = physical_op_reset,
+   .ro_destroy = physical_op_destroy,
.ro_displayname = "physical",
 };
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index e17d91a..dcbc736 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1198,47 +1198,6 @@ rpcrdma_destroy_req(struct rpcrdma_ia *ia, struct 
rpcrdma_req *req)
kfree(req);
 }
 
-static void
-rpcrdma_destroy_fmrs(struct rpcrdma_buffer *buf)
-{
-   struct rpcrdma_mw *r;
-   int rc;
-
-   while (!list_empty(&buf->rb_all)) {
-   r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
-   list_del(&r->mw_all);
-   list_del(&r->mw_list);
-
-   rc = ib_dealloc_fmr(r->r.fmr);
-   if (rc)
-   dprintk("RPC:   %s: ib_dealloc_fmr failed %i\n",
-   __func__, rc);
-
-   kfree(r);
-   }
-}
-
-static void
-rpcrdma_destroy_frmrs(struct rpcrdma_buffer *buf)
-{
-   struct rpcrdma_mw *r;
-   int rc;
-
-   while (!list_empty(&buf->rb_all)) {
-   r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
-   list_del(&r->mw_all);
-   list_del(&r->mw_l

[PATCH v1 12/16] xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()

2015-03-13 Thread Chuck Lever
Acquiring 64 FMRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because FMR mode can
transfer up to a 1MB payload using just a single ib_fmr.

Instead, acquire ib_fmrs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c   |   41 ---
 net/sunrpc/xprtrdma/verbs.c |   41 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h |1 +
 3 files changed, 54 insertions(+), 29 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 96e6cd3..9c6c2e8 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -28,10 +28,11 @@ __fmr_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mr_seg *seg1 = seg;
+   struct rpcrdma_mw *mw = seg1->rl_mw;
int rc, nsegs = seg->mr_nsegs;
LIST_HEAD(l);
 
-   list_add(&seg1->rl_mw->r.fmr->list, &l);
+   list_add(&mw->r.fmr->list, &l);
rc = ib_unmap_fmr(&l);
read_lock(&ia->ri_qplock);
while (seg1->mr_nsegs--)
@@ -39,11 +40,14 @@ __fmr_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
read_unlock(&ia->ri_qplock);
if (rc)
goto out_err;
+out:
+   seg1->rl_mw = NULL;
+   rpcrdma_put_mw(r_xprt, mw);
return nsegs;
 
 out_err:
dprintk("RPC:   %s: ib_unmap_fmr status %i\n", __func__, rc);
-   return nsegs;
+   goto out;
 }
 
 static int
@@ -117,6 +121,27 @@ out_fmr_err:
return rc;
 }
 
+static struct rpcrdma_mw *
+__fmr_get_mw(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct rpcrdma_mw *mw = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(&buf->rb_lock, flags);
+
+   if (!list_empty(&buf->rb_mws)) {
+   mw = list_entry(buf->rb_mws.next,
+   struct rpcrdma_mw, mw_list);
+   list_del_init(&mw->mw_list);
+   } else {
+   pr_err("RPC:   %s: no MWs available\n", __func__);
+   }
+
+   spin_unlock_irqrestore(&buf->rb_lock, flags);
+   return mw;
+}
+
 /* Use the ib_map_phys_fmr() verb to register a memory region
  * for remote access via RDMA READ or RDMA WRITE.
  */
@@ -126,10 +151,18 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mr_seg *seg1 = seg;
-   struct rpcrdma_mw *mw = seg1->rl_mw;
+   struct rpcrdma_mw *mw;
u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
int len, pageoff, i, rc;
 
+   mw = __fmr_get_mw(r_xprt);
+   if (!mw)
+   return -ENOMEM;
+   if (seg1->rl_mw) {
+   rpcrdma_put_mw(r_xprt, seg1->rl_mw);
+   seg1->rl_mw = NULL;
+   }
+
pageoff = offset_in_page(seg1->mr_offset);
seg1->mr_offset -= pageoff; /* start of page */
seg1->mr_len += pageoff;
@@ -152,6 +185,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
if (rc)
goto out_maperr;
 
+   seg1->rl_mw = mw;
seg1->mr_rkey = mw->r.fmr->rkey;
seg1->mr_base = seg1->mr_dma + pageoff;
seg1->mr_nsegs = i;
@@ -164,6 +198,7 @@ out_maperr:
pageoff, i, rc);
while (i--)
rpcrdma_unmap_one(ia, --seg);
+   rpcrdma_put_mw(r_xprt, mw);
return rc;
 }
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index f108a57..f2316d8 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1171,6 +1171,21 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
kfree(buf->rb_pool);
 }
 
+void
+rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *mw)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   unsigned long flags;
+
+   if (!list_empty(&mw->mw_list))
+   pr_warn("RPC:   %s: mw %p still on a list!\n",
+   __func__, mw);
+
+   spin_lock_irqsave(&buf->rb_lock, flags);
+   list_add_tail(&mw->mw_list, &buf->rb_mws);
+   spin_unlock_irqrestore(&buf->rb_lock, flags);
+}
+
 /* "*mw" can be NULL when rpcrdma_buffer_get_mrs() fails, leaving
  * some req segments uninitialized.
  */
@@ -1292,28 +1307,6 @@ rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req, struct 
rpcrdma_buffer *buf,
return NULL;
 }
 
-static struct rpcrdma_req *
-rpcrdma_buffer_get_fmrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf)
-{
-   struct rpcrdm

[PATCH v1 11/16] xprtrdma: Handle non-SEND completions via a callout

2015-03-13 Thread Chuck Lever
Allow each memory registration mode to plug in a callout that handles
the completion of a memory registration operation.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/frwr_ops.c  |   17 +
 net/sunrpc/xprtrdma/verbs.c |   14 +-
 net/sunrpc/xprtrdma/xprt_rdma.h |5 +
 3 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 9bb4b2d..6e6d8ba 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -164,6 +164,22 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
+/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs to be reset. */
+static void
+frwr_sendcompletion(struct ib_wc *wc)
+{
+   struct rpcrdma_mw *r;
+
+   if (likely(wc->status == IB_WC_SUCCESS))
+   return;
+
+   /* WARNING: Only wr_id and status are reliable at this point */
+   r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+   dprintk("RPC:   %s: frmr %p (stale), status %d\n",
+   __func__, r, wc->status);
+   r->r.frmr.fr_state = FRMR_IS_STALE;
+}
+
 static int
 frwr_op_init(struct rpcrdma_xprt *r_xprt)
 {
@@ -195,6 +211,7 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
 
list_add(&r->mw_list, &buf->rb_mws);
list_add(&r->mw_all, &buf->rb_all);
+   r->mw_sendcompletion = frwr_sendcompletion;
}
 
return 0;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 17b2a29..f108a57 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -204,21 +204,17 @@ static const char * const wc_status[] = {
 static void
 rpcrdma_sendcq_process_wc(struct ib_wc *wc)
 {
-   if (likely(wc->status == IB_WC_SUCCESS))
-   return;
-
/* WARNING: Only wr_id and status are reliable at this point */
-   if (wc->wr_id == 0ULL) {
-   if (wc->status != IB_WC_WR_FLUSH_ERR)
+   if (wc->wr_id == RPCRDMA_IGNORE_COMPLETION) {
+   if (wc->status != IB_WC_SUCCESS &&
+   wc->status != IB_WC_WR_FLUSH_ERR)
pr_err("RPC:   %s: SEND: %s\n",
   __func__, COMPLETION_MSG(wc->status));
} else {
struct rpcrdma_mw *r;
 
r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
-   r->r.frmr.fr_state = FRMR_IS_STALE;
-   pr_err("RPC:   %s: frmr %p (stale): %s\n",
-  __func__, r, COMPLETION_MSG(wc->status));
+   r->mw_sendcompletion(wc);
}
 }
 
@@ -1616,7 +1612,7 @@ rpcrdma_ep_post(struct rpcrdma_ia *ia,
}
 
send_wr.next = NULL;
-   send_wr.wr_id = 0ULL;   /* no send cookie */
+   send_wr.wr_id = RPCRDMA_IGNORE_COMPLETION;
send_wr.sg_list = req->rl_send_iov;
send_wr.num_sge = req->rl_niovs;
send_wr.opcode = IB_WR_SEND;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index a53a564..40a0ee8 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -106,6 +106,10 @@ struct rpcrdma_ep {
 #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
 #define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)
 
+/* Force completion handler to ignore the signal
+ */
+#define RPCRDMA_IGNORE_COMPLETION  (0ULL)
+
 /* Registered buffer -- registered kmalloc'd memory for RDMA SEND/RECV
  *
  * The below structure appears at the front of a large region of kmalloc'd
@@ -206,6 +210,7 @@ struct rpcrdma_mw {
struct ib_fmr   *fmr;
struct rpcrdma_frmr frmr;
} r;
+   void(*mw_sendcompletion)(struct ib_wc *);
struct list_headmw_list;
struct list_headmw_all;
 };

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 13/16] xprtrdma: Acquire MRs in rpcrdma_register_external()

2015-03-13 Thread Chuck Lever
Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because most modern adapters
can transfer 100s of KBs of payload using just a single MR.

Instead, acquire MRs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Note: commit 539431a437d2 ("xprtrdma: Don't invalidate FRMRs if
registration fails") is reverted: There is now a valid case where
registration can fail (with -ENOMEM) but the QP is still in RTS.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/frwr_ops.c |  126 ---
 net/sunrpc/xprtrdma/rpc_rdma.c |3 -
 net/sunrpc/xprtrdma/verbs.c|  130 
 3 files changed, 120 insertions(+), 139 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 6e6d8ba..d23e064 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -54,15 +54,16 @@ __frwr_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
 {
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mw *mw = seg1->rl_mw;
struct ib_send_wr invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;
 
-   seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
+   mw->r.frmr.fr_state = FRMR_IS_INVALID;
 
memset(&invalidate_wr, 0, sizeof(invalidate_wr));
-   invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
+   invalidate_wr.wr_id = (unsigned long)(void *)mw;
invalidate_wr.opcode = IB_WR_LOCAL_INV;
-   invalidate_wr.ex.invalidate_rkey = seg1->rl_mw->r.frmr.fr_mr->rkey;
+   invalidate_wr.ex.invalidate_rkey = mw->r.frmr.fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);
 
read_lock(&ia->ri_qplock);
@@ -72,13 +73,17 @@ __frwr_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
read_unlock(&ia->ri_qplock);
if (rc)
goto out_err;
+
+out:
+   seg1->rl_mw = NULL;
+   rpcrdma_put_mw(r_xprt, mw);
return nsegs;
 
 out_err:
/* Force rpcrdma_buffer_get() to retry */
-   seg1->rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
+   mw->r.frmr.fr_state = FRMR_IS_STALE;
dprintk("RPC:   %s: ib_post_send status %i\n", __func__, rc);
-   return nsegs;
+   goto out;
 }
 
 static void
@@ -217,6 +222,99 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
return 0;
 }
 
+/* rpcrdma_unmap_one() was already done by rpcrdma_frwr_releasesge().
+ * Redo only the ib_post_send().
+ */
+static void
+__retry_local_inv(struct rpcrdma_ia *ia, struct rpcrdma_mw *r)
+{
+   struct rpcrdma_xprt *r_xprt =
+   container_of(ia, struct rpcrdma_xprt, rx_ia);
+   struct ib_send_wr invalidate_wr, *bad_wr;
+   int rc;
+
+   pr_warn("RPC:   %s: FRMR %p is stale\n", __func__, r);
+
+   /* When this FRMR is re-inserted into rb_mws, it is no longer stale */
+   r->r.frmr.fr_state = FRMR_IS_INVALID;
+
+   memset(&invalidate_wr, 0, sizeof(invalidate_wr));
+   invalidate_wr.wr_id = (unsigned long)(void *)r;
+   invalidate_wr.opcode = IB_WR_LOCAL_INV;
+   invalidate_wr.ex.invalidate_rkey = r->r.frmr.fr_mr->rkey;
+   DECR_CQCOUNT(&r_xprt->rx_ep);
+
+   pr_warn("RPC:   %s: frmr %p invalidating rkey %08x\n",
+   __func__, r, r->r.frmr.fr_mr->rkey);
+
+   read_lock(&ia->ri_qplock);
+   rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+   read_unlock(&ia->ri_qplock);
+   if (rc) {
+   /* Force __frwr_get_mw() to retry */
+   r->r.frmr.fr_state = FRMR_IS_STALE;
+   dprintk("RPC:   %s: ib_post_send status %i\n",
+   __func__, rc);
+   }
+}
+
+static void
+__retry_flushed_linv(struct rpcrdma_buffer *buf, struct list_head *stale)
+{
+   struct rpcrdma_ia *ia = rdmab_to_ia(buf);
+   struct list_head *pos;
+   struct rpcrdma_mw *r;
+   unsigned long flags;
+   unsigned count;
+
+   count = 0;
+   list_for_each(pos, stale) {
+   r = list_entry(pos, struct rpcrdma_mw, mw_list);
+   __retry_local_inv(ia, r);
+   ++count;
+   }
+
+   pr_warn("RPC:   %s: adding %u MRs to rb_mws\n",
+   __func__, count);
+
+   spin_lock_irqsave(&buf->rb_lock, flags);
+   list_splice_tail(stale, &buf->rb_mws);
+   spin_unlock_irqrestore(&buf->rb_lock, flags);
+}
+
+static struct rpcrdma_mw *
+__frwr_get_mw(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct rpcrdma_mw *mw;
+   unsigned long flags;
+   

[PATCH v1 10/16] xprtrdma: Add "open" memreg op

2015-03-13 Thread Chuck Lever
The open op determines the size of various transport data structures
based on device capabilities and memory registration mode.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   22 +
 net/sunrpc/xprtrdma/frwr_ops.c |   60 
 net/sunrpc/xprtrdma/physical_ops.c |   22 +
 net/sunrpc/xprtrdma/verbs.c|   54 ++--
 net/sunrpc/xprtrdma/xprt_rdma.h|3 ++
 5 files changed, 110 insertions(+), 51 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 3115e4b..96e6cd3 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -46,6 +46,27 @@ out_err:
return nsegs;
 }
 
+static int
+fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
+   struct rpcrdma_create_data_internal *cdata)
+{
+   struct ib_device_attr *devattr = &ia->ri_devattr;
+   unsigned int wrs, max_wrs;
+
+   max_wrs = devattr->max_qp_wr;
+   if (cdata->max_requests > max_wrs)
+   cdata->max_requests = max_wrs;
+
+   wrs = cdata->max_requests;
+   ep->rep_attr.cap.max_send_wr = wrs;
+   ep->rep_attr.cap.max_recv_wr = wrs;
+
+   dprintk("RPC:   %s: pre-allocating %u send WRs, %u recv WRs\n",
+   __func__, ep->rep_attr.cap.max_send_wr,
+   ep->rep_attr.cap.max_recv_wr);
+   return 0;
+}
+
 /* FMR mode conveys up to 64 pages of payload per chunk segment.
  */
 static size_t
@@ -201,6 +222,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap   = fmr_op_unmap,
+   .ro_open= fmr_op_open,
.ro_maxpages= fmr_op_maxpages,
.ro_init= fmr_op_init,
.ro_reset   = fmr_op_reset,
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index fc3a228..9bb4b2d 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -93,6 +93,65 @@ __frwr_release(struct rpcrdma_mw *r)
ib_free_fast_reg_page_list(r->r.frmr.fr_pgl);
 }
 
+static int
+frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
+struct rpcrdma_create_data_internal *cdata)
+{
+   struct ib_device_attr *devattr = &ia->ri_devattr;
+   unsigned int wrs, max_wrs;
+   int depth = 7;
+
+   max_wrs = devattr->max_qp_wr;
+   if (cdata->max_requests > max_wrs)
+   cdata->max_requests = max_wrs;
+
+   wrs = cdata->max_requests;
+   ep->rep_attr.cap.max_send_wr = wrs;
+   ep->rep_attr.cap.max_recv_wr = wrs;
+
+   ia->ri_max_frmr_depth =
+   min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
+ devattr->max_fast_reg_page_list_len);
+   dprintk("RPC:   %s: device's max FR page list len = %u\n",
+   __func__, ia->ri_max_frmr_depth);
+
+   /* Add room for frmr register and invalidate WRs.
+* 1. FRMR reg WR for head
+* 2. FRMR invalidate WR for head
+* 3. N FRMR reg WRs for pagelist
+* 4. N FRMR invalidate WRs for pagelist
+* 5. FRMR reg WR for tail
+* 6. FRMR invalidate WR for tail
+* 7. The RDMA_SEND WR
+*/
+
+   /* Calculate N if the device max FRMR depth is smaller than
+* RPCRDMA_MAX_DATA_SEGS.
+*/
+   if (ia->ri_max_frmr_depth < RPCRDMA_MAX_DATA_SEGS) {
+   int delta = RPCRDMA_MAX_DATA_SEGS - ia->ri_max_frmr_depth;
+
+   do {
+   depth += 2; /* FRMR reg + invalidate */
+   delta -= ia->ri_max_frmr_depth;
+   } while (delta > 0);
+   }
+
+   ep->rep_attr.cap.max_send_wr *= depth;
+   if (ep->rep_attr.cap.max_send_wr > max_wrs) {
+   cdata->max_requests = max_wrs / depth;
+   if (!cdata->max_requests)
+   return -EINVAL;
+   ep->rep_attr.cap.max_send_wr = cdata->max_requests *
+  depth;
+   }
+
+   dprintk("RPC:   %s: pre-allocating %u send WRs, %u recv WRs\n",
+   __func__, ep->rep_attr.cap.max_send_wr,
+   ep->rep_attr.cap.max_recv_wr);
+   return 0;
+}
+
 /* FRWR mode conveys a list of pages per chunk segment. The
  * maximum length of that list is the FRWR page list depth.
  */
@@ -290,6 +349,7 @@ frwr_op_destroy(struct rpcrdma_buffer *buf)
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
.ro_unmap   = frwr_op_unmap,
+   .ro_open  

[PATCH v1 15/16] xprtrdma: Make rpcrdma_{un}map_one() into inline functions

2015-03-13 Thread Chuck Lever
These functions are called in a loop for each page transferred via
RDMA READ or WRITE. Extract loop invariants and inline them to
reduce CPU overhead.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   10 ++--
 net/sunrpc/xprtrdma/frwr_ops.c |   10 ++--
 net/sunrpc/xprtrdma/physical_ops.c |   11 ++---
 net/sunrpc/xprtrdma/verbs.c|   44 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h|   45 ++--
 5 files changed, 73 insertions(+), 47 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 9c6c2e8..1404f20 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -29,14 +29,16 @@ __fmr_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_mw *mw = seg1->rl_mw;
+   struct ib_device *device;
int rc, nsegs = seg->mr_nsegs;
LIST_HEAD(l);
 
list_add(&mw->r.fmr->list, &l);
rc = ib_unmap_fmr(&l);
read_lock(&ia->ri_qplock);
+   device = ia->ri_id->device;
while (seg1->mr_nsegs--)
-   rpcrdma_unmap_one(ia, seg++);
+   rpcrdma_unmap_one(device, seg++);
read_unlock(&ia->ri_qplock);
if (rc)
goto out_err;
@@ -150,6 +152,8 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
   int nsegs, bool writing)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct ib_device *device = ia->ri_id->device;
+   enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_mw *mw;
u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
@@ -170,7 +174,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
if (nsegs > RPCRDMA_MAX_FMR_SGES)
nsegs = RPCRDMA_MAX_FMR_SGES;
for (i = 0; i < nsegs;) {
-   rpcrdma_map_one(ia, seg, writing);
+   rpcrdma_map_one(device, seg, direction);
physaddrs[i] = seg->mr_dma;
len += seg->mr_len;
++seg;
@@ -197,7 +201,7 @@ out_maperr:
__func__, len, (unsigned long long)seg1->mr_dma,
pageoff, i, rc);
while (i--)
-   rpcrdma_unmap_one(ia, --seg);
+   rpcrdma_unmap_one(device, --seg);
rpcrdma_put_mw(r_xprt, mw);
return rc;
 }
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index d23e064..4494668 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -55,6 +55,7 @@ __frwr_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mw *mw = seg1->rl_mw;
+   struct ib_device *device;
struct ib_send_wr invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;
 
@@ -67,8 +68,9 @@ __frwr_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
DECR_CQCOUNT(&r_xprt->rx_ep);
 
read_lock(&ia->ri_qplock);
+   device = ia->ri_id->device;
while (seg1->mr_nsegs--)
-   rpcrdma_unmap_one(ia, seg++);
+   rpcrdma_unmap_one(device, seg++);
rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
read_unlock(&ia->ri_qplock);
if (rc)
@@ -323,6 +325,8 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
int nsegs, bool writing)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct ib_device *device = ia->ri_id->device;
+   enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_frmr *frmr;
struct rpcrdma_mw *mw;
@@ -351,7 +355,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
nsegs = ia->ri_max_frmr_depth;
frmr = &mw->r.frmr;
for (page_no = i = 0; i < nsegs;) {
-   rpcrdma_map_one(ia, seg, writing);
+   rpcrdma_map_one(device, seg, direction);
pa = seg->mr_dma;
for (seg_len = seg->mr_len; seg_len > 0; seg_len -= PAGE_SIZE) {
frmr->fr_pgl->page_list[page_no++] = pa;
@@ -409,7 +413,7 @@ out_senderr:
 out_err:
frmr->fr_state = FRMR_IS_INVALID;
while (i--)
-   rpcrdma_unmap_one(ia, --seg);
+   rpcrdma_unmap_one(device, --seg);
rpcrdma_put_mw(r_xprt, mw);
return rc;
 }
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index 0998f4f..9ce7

Re: [PATCH v1 01/16] xprtrdma: Display IPv6 addresses and port numbers correctly

2015-03-16 Thread Chuck Lever

On Mar 14, 2015, at 7:50 PM, Sagi Grimberg  wrote:

> On 3/13/2015 11:26 PM, Chuck Lever wrote:
>> Signed-off-by: Chuck Lever 
>> ---
>>  net/sunrpc/xprtrdma/transport.c |   47 
>> ---
>>  net/sunrpc/xprtrdma/verbs.c |   21 +++--
>>  2 files changed, 47 insertions(+), 21 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/transport.c 
>> b/net/sunrpc/xprtrdma/transport.c
>> index 2e192ba..26a62e7 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -157,12 +157,47 @@ static struct ctl_table sunrpc_table[] = {
>>  static struct rpc_xprt_ops xprt_rdma_procs; /* forward reference */
>> 
>>  static void
>> +xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr *sap)
>> +{
>> +struct sockaddr_in *sin = (struct sockaddr_in *)sap;
>> +char buf[20];
>> +
>> +snprintf(buf, sizeof(buf), "%08x", ntohl(sin->sin_addr.s_addr));
>> +xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
>> +
>> +xprt->address_strings[RPC_DISPLAY_NETID] = "rdma";
>> +}
>> +
>> +static void
>> +xprt_rdma_format_addresses6(struct rpc_xprt *xprt, struct sockaddr *sap)
>> +{
>> +struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)sap;
>> +char buf[40];
>> +
>> +snprintf(buf, sizeof(buf), "%pi6", &sin6->sin6_addr);
> 
> Don't you prefer %pIS can handle both IPv4/v6 instead of two different
> routines? or maybe even %pISp can be better?
> 
>> +xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);

The string does not contain a presentation address, which is what
we’d get from %pIS. These are non-standard hexadecimal representations
of the addresses. So, need separate functions for IPv4 and IPv6.

>> +
>> +xprt->address_strings[RPC_DISPLAY_NETID] = "rdma6";
> 
> Was RPC_DISPLAY_NETID "rdma6" before this patch?

Section 12 of RFC 5666 defines NETIDs for RPC/RDMA with IPv4 and with
IPv6 addressing. I can cite that section in a comment.

Actually, I should define these in include/linux/sunrpc/msg_prot.h
with the other netids, and use the macro instead.

> 
>> +}
>> +
>> +static void
>>  xprt_rdma_format_addresses(struct rpc_xprt *xprt)
>>  {
>>  struct sockaddr *sap = (struct sockaddr *)
>>  &rpcx_to_rdmad(xprt).addr;
>> -struct sockaddr_in *sin = (struct sockaddr_in *)sap;
>> -char buf[64];
>> +char buf[128];
>> +
>> +switch (sap->sa_family) {
>> +case AF_INET:
>> +xprt_rdma_format_addresses4(xprt, sap);
>> +break;
>> +case AF_INET6:
>> +xprt_rdma_format_addresses6(xprt, sap);
>> +break;
>> +default:
>> +pr_err("rpcrdma: Unrecognized address family\n");
>> +return;
>> +}
>> 
>>  (void)rpc_ntop(sap, buf, sizeof(buf));
>>  xprt->address_strings[RPC_DISPLAY_ADDR] = kstrdup(buf, GFP_KERNEL);
>> @@ -170,16 +205,10 @@ xprt_rdma_format_addresses(struct rpc_xprt *xprt)
>>  snprintf(buf, sizeof(buf), "%u", rpc_get_port(sap));
>>  xprt->address_strings[RPC_DISPLAY_PORT] = kstrdup(buf, GFP_KERNEL);
>> 
>> -xprt->address_strings[RPC_DISPLAY_PROTO] = "rdma";
>> -
>> -snprintf(buf, sizeof(buf), "%08x", ntohl(sin->sin_addr.s_addr));
>> -xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
>> -
>>  snprintf(buf, sizeof(buf), "%4hx", rpc_get_port(sap));
>>  xprt->address_strings[RPC_DISPLAY_HEX_PORT] = kstrdup(buf, GFP_KERNEL);
>> 
>> -/* netid */
>> -xprt->address_strings[RPC_DISPLAY_NETID] = "rdma";
>> +xprt->address_strings[RPC_DISPLAY_PROTO] = "rdma";
>>  }
>> 
>>  static void
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 124676c..1aa55b7 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -50,6 +50,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>> 
>>  #include "xprt_rdma.h"
>> @@ -424,7 +425,7 @@ rpcrdma_conn_upcall(struct rdma_cm_id *id, struct 
>> rdma_cm_event *event)
>>  struct rpcrdma_ia *ia = &xprt->rx_ia;
>>  struct rpcrdma_ep *ep = &xprt->rx_ep;
>

Re: [PATCH v1 05/16] xprtrdma: Add a "register_external" op for each memreg mode

2015-03-16 Thread Chuck Lever

On Mar 16, 2015, at 3:28 AM, Sagi Grimberg  wrote:

> On 3/13/2015 11:27 PM, Chuck Lever wrote:
>> There is very little common processing among the different external
>> memory registration functions. Have rpcrdma_create_chunks() call
>> the registration method directly. This removes a stack frame and a
>> switch statement from the external registration path.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>>  net/sunrpc/xprtrdma/fmr_ops.c  |   51 +++
>>  net/sunrpc/xprtrdma/frwr_ops.c |   88 ++
>>  net/sunrpc/xprtrdma/physical_ops.c |   17 
>>  net/sunrpc/xprtrdma/rpc_rdma.c |5 +
>>  net/sunrpc/xprtrdma/verbs.c|  172 
>> +---
>>  net/sunrpc/xprtrdma/xprt_rdma.h|6 +
>>  6 files changed, 166 insertions(+), 173 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>> index eec2660..45fb646 100644
>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>> @@ -29,7 +29,58 @@ fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
>>   rpcrdma_max_segments(r_xprt) * RPCRDMA_MAX_FMR_SGES);
>>  }
>> 
>> +/* Use the ib_map_phys_fmr() verb to register a memory region
>> + * for remote access via RDMA READ or RDMA WRITE.
>> + */
>> +static int
>> +fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>> +   int nsegs, bool writing)
>> +{
>> +struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> +struct rpcrdma_mr_seg *seg1 = seg;
>> +struct rpcrdma_mw *mw = seg1->rl_mw;
>> +u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
>> +int len, pageoff, i, rc;
>> +
>> +pageoff = offset_in_page(seg1->mr_offset);
>> +seg1->mr_offset -= pageoff; /* start of page */
>> +seg1->mr_len += pageoff;
>> +len = -pageoff;
>> +if (nsegs > RPCRDMA_MAX_FMR_SGES)
>> +nsegs = RPCRDMA_MAX_FMR_SGES;
>> +for (i = 0; i < nsegs;) {
>> +rpcrdma_map_one(ia, seg, writing);
>> +physaddrs[i] = seg->mr_dma;
>> +len += seg->mr_len;
>> +++seg;
>> +++i;
>> +/* Check for holes */
>> +if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
>> +offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
>> +break;
>> +}
>> +
>> +rc = ib_map_phys_fmr(mw->r.fmr, physaddrs, i, seg1->mr_dma);
>> +if (rc)
>> +goto out_maperr;
>> +
>> +seg1->mr_rkey = mw->r.fmr->rkey;
>> +seg1->mr_base = seg1->mr_dma + pageoff;
>> +seg1->mr_nsegs = i;
>> +seg1->mr_len = len;
>> +return i;
>> +
>> +out_maperr:
>> +dprintk("RPC:   %s: ib_map_phys_fmr %u@0x%llx+%i (%d) status %i\n",
>> +__func__, len, (unsigned long long)seg1->mr_dma,
>> +pageoff, i, rc);
>> +while (i--)
>> +rpcrdma_unmap_one(ia, --seg);
>> +return rc;
>> +}
>> +
>>  const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
>> +.ro_map = fmr_op_map,
>>  .ro_maxpages= fmr_op_maxpages,
>>  .ro_displayname = "fmr",
>>  };
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
>> index 73a5ac8..2b5ccb0 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -29,7 +29,95 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
>>   rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
>>  }
>> 
>> +/* Post a FAST_REG Work Request to register a memory region
>> + * for remote access via RDMA READ or RDMA WRITE.
>> + */
>> +static int
>> +frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>> +int nsegs, bool writing)
>> +{
>> +struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> +struct rpcrdma_mr_seg *seg1 = seg;
>> +struct rpcrdma_mw *mw = seg1->rl_mw;
>> +struct rpcrdma_frmr *frmr = &mw->r.frmr;
>> +struct ib_mr *mr = frmr->fr_mr;
>> +struct ib_send_wr fastreg_wr, *bad_wr;
>> +u8 key;
>> +int len, pageoff;
>> +int i, rc;
>> +int seg_len;
>> +u64 pa;
>> +int page_no;
>> +
>> +pageoff = offset_in_page(seg1->mr_offset);
>> +seg1-&g

Re: [PATCH v1 09/16] xprtrdma: Add "destroy MRs" memreg op

2015-03-16 Thread Chuck Lever

On Mar 16, 2015, at 3:46 AM, Sagi Grimberg  wrote:

> On 3/13/2015 11:27 PM, Chuck Lever wrote:
>> Memory Region objects associated with a transport instance are
>> destroyed before the instance is shutdown and destroyed.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>>  net/sunrpc/xprtrdma/fmr_ops.c  |   21 +++
>>  net/sunrpc/xprtrdma/frwr_ops.c |   17 
>>  net/sunrpc/xprtrdma/physical_ops.c |6 
>>  net/sunrpc/xprtrdma/verbs.c|   52 
>> +---
>>  net/sunrpc/xprtrdma/xprt_rdma.h|1 +
>>  5 files changed, 46 insertions(+), 51 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>> index 1ccb3de..3115e4b 100644
>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>> @@ -178,11 +178,32 @@ fmr_op_reset(struct rpcrdma_xprt *r_xprt)
>>  __func__, rc);
>>  }
>> 
>> +static void
>> +fmr_op_destroy(struct rpcrdma_buffer *buf)
>> +{
>> +struct rpcrdma_mw *r;
>> +int rc;
>> +
>> +while (!list_empty(&buf->rb_all)) {
>> +r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
>> +list_del(&r->mw_all);
>> +list_del(&r->mw_list);
> 
> Again, I understand this patch is just moving code, but is it
> guaranteed that mr_list is in rb_mws at this point?

Good call, there probably is no such guarantee in the current code base.
Since the transport is being destroyed anyway, I think I can just remove
that list_del(&r->mw_list).


> 
>> +
>> +rc = ib_dealloc_fmr(r->r.fmr);
>> +if (rc)
>> +dprintk("RPC:   %s: ib_dealloc_fmr failed %i\n",
>> +__func__, rc);
>> +
>> +kfree(r);
>> +}
>> +}
>> +
>>  const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
>>  .ro_map = fmr_op_map,
>>  .ro_unmap   = fmr_op_unmap,
>>  .ro_maxpages= fmr_op_maxpages,
>>  .ro_init= fmr_op_init,
>>  .ro_reset   = fmr_op_reset,
>> +.ro_destroy = fmr_op_destroy,
>>  .ro_displayname = "fmr",
>>  };
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
>> index b4ce0e5..fc3a228 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -271,11 +271,28 @@ frwr_op_reset(struct rpcrdma_xprt *r_xprt)
>>  }
>>  }
>> 
>> +static void
>> +frwr_op_destroy(struct rpcrdma_buffer *buf)
>> +{
>> +struct rpcrdma_mw *r;
>> +
>> +while (!list_empty(&buf->rb_all)) {
>> +r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
>> +list_del(&r->mw_all);
>> +list_del(&r->mw_list);
>> +
>> +__frwr_release(r);
>> +
>> +kfree(r);
>> +}
>> +}
>> +
>>  const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
>>  .ro_map = frwr_op_map,
>>  .ro_unmap   = frwr_op_unmap,
>>  .ro_maxpages= frwr_op_maxpages,
>>  .ro_init= frwr_op_init,
>>  .ro_reset   = frwr_op_reset,
>> +.ro_destroy = frwr_op_destroy,
>>  .ro_displayname = "frwr",
>>  };
>> diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
>> b/net/sunrpc/xprtrdma/physical_ops.c
>> index 0afc691..f8da8c4 100644
>> --- a/net/sunrpc/xprtrdma/physical_ops.c
>> +++ b/net/sunrpc/xprtrdma/physical_ops.c
>> @@ -67,11 +67,17 @@ physical_op_reset(struct rpcrdma_xprt *r_xprt)
>>  {
>>  }
>> 
>> +static void
>> +physical_op_destroy(struct rpcrdma_buffer *buf)
>> +{
>> +}
>> +
>>  const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
>>  .ro_map = physical_op_map,
>>  .ro_unmap   = physical_op_unmap,
>>  .ro_maxpages= physical_op_maxpages,
>>  .ro_init= physical_op_init,
>>  .ro_reset   = physical_op_reset,
>> +.ro_destroy = physical_op_destroy,
>>  .ro_displayname

Re: [PATCH v1 06/16] xprtrdma: Add a "deregister_external" op for each memreg mode

2015-03-16 Thread Chuck Lever

On Mar 16, 2015, at 3:34 AM, Sagi Grimberg  wrote:

> On 3/13/2015 11:27 PM, Chuck Lever wrote:
>> There is very little common processing among the different external
>> memory deregistration functions.
>> 
>> In addition, instead of calling the deregistration function for each
>> segment, have one call release all segments for a request. This makes
>> the API a little asymmetrical, but a hair faster.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>>  net/sunrpc/xprtrdma/fmr_ops.c  |   37 
>>  net/sunrpc/xprtrdma/frwr_ops.c |   46 
>>  net/sunrpc/xprtrdma/physical_ops.c |   13 ++
>>  net/sunrpc/xprtrdma/rpc_rdma.c |7 +--
>>  net/sunrpc/xprtrdma/transport.c|8 +---
>>  net/sunrpc/xprtrdma/verbs.c|   81 
>> 
>>  net/sunrpc/xprtrdma/xprt_rdma.h|5 +-
>>  7 files changed, 103 insertions(+), 94 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>> index 45fb646..9b983b4 100644
>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>> @@ -20,6 +20,32 @@
>>  /* Maximum scatter/gather per FMR */
>>  #define RPCRDMA_MAX_FMR_SGES(64)
>> 
>> +/* Use the ib_unmap_fmr() verb to prevent further remote
>> + * access via RDMA READ or RDMA WRITE.
>> + */
>> +static int
>> +__fmr_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
>> +{
>> +struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> +struct rpcrdma_mr_seg *seg1 = seg;
>> +int rc, nsegs = seg->mr_nsegs;
>> +LIST_HEAD(l);
>> +
>> +list_add(&seg1->rl_mw->r.fmr->list, &l);
>> +rc = ib_unmap_fmr(&l);
>> +read_lock(&ia->ri_qplock);
>> +while (seg1->mr_nsegs--)
>> +rpcrdma_unmap_one(ia, seg++);
>> +read_unlock(&ia->ri_qplock);
> 
> So I know you are just moving things around here, but can you explain
> why the read_lock is taken here? Why do you need this protection?

The deregistration method is not serialized with transport reconnect.
That means ->ri_id and ->qp can be NULL in some corner cases when the
upper layer invokes this code path.

rpcrdma_unmap_one() dereferences ->ri_id, so it has to be protected.

The expedient fix in commit 73806c88 was to ensure ->qp was always
able to be dereferenced by adding the rwlock here.

The longer term fix is what we discussed last week: use a work queue
sentinel to co-ordinate the destruction of the QP with callers like
this one. That would also protect our upcall code, and we could remove
the rwlock.

> 
>> +if (rc)
>> +goto out_err;
>> +return nsegs;
>> +
>> +out_err:
>> +dprintk("RPC:   %s: ib_unmap_fmr status %i\n", __func__, rc);
>> +return nsegs;
>> +}
>> +
>>  /* FMR mode conveys up to 64 pages of payload per chunk segment.
>>   */
>>  static size_t
>> @@ -79,8 +105,19 @@ out_maperr:
>>  return rc;
>>  }
>> 
>> +static void
>> +fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
>> + unsigned int count)
>> +{
>> +unsigned int i;
>> +
>> +for (i = 0; count--;)
>> +i += __fmr_unmap(r_xprt, &req->rl_segments[i]);
>> +}
>> +
>>  const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
>>  .ro_map = fmr_op_map,
>> +.ro_unmap   = fmr_op_unmap,
>>  .ro_maxpages= fmr_op_maxpages,
>>  .ro_displayname = "fmr",
>>  };
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
>> index 2b5ccb0..05b5761 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -17,6 +17,41 @@
>>  # define RPCDBG_FACILITYRPCDBG_TRANS
>>  #endif
>> 
>> +/* Post a LOCAL_INV Work Request to prevent further remote access
>> + * via RDMA READ or RDMA WRITE.
>> + */
>> +static int
>> +__frwr_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
>> +{
>> +struct rpcrdma_mr_seg *seg1 = seg;
>> +struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> +struct ib_send_wr invalidate_wr, *bad_wr;
>> +int rc, nsegs = seg->mr_nsegs;
>> +
>> +seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
>> +
>> +memset(&invalidate_wr, 0, sizeof(invalidate_wr));
>> +invalidate_wr.wr_id = (unsig

Re: [PATCH v1 12/16] xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()

2015-03-16 Thread Chuck Lever

On Mar 16, 2015, at 5:28 AM, Sagi Grimberg  wrote:

> On 3/13/2015 11:28 PM, Chuck Lever wrote:
>> Acquiring 64 FMRs in rpcrdma_buffer_get() while holding the buffer
>> pool lock is expensive, and unnecessary because FMR mode can
>> transfer up to a 1MB payload using just a single ib_fmr.
>> 
>> Instead, acquire ib_fmrs one-at-a-time as chunks are registered, and
>> return them to rb_mws immediately during deregistration.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>>  net/sunrpc/xprtrdma/fmr_ops.c   |   41 
>> ---
>>  net/sunrpc/xprtrdma/verbs.c |   41 
>> ++-
>>  net/sunrpc/xprtrdma/xprt_rdma.h |1 +
>>  3 files changed, 54 insertions(+), 29 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>> index 96e6cd3..9c6c2e8 100644
>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>> @@ -28,10 +28,11 @@ __fmr_unmap(struct rpcrdma_xprt *r_xprt, struct 
>> rpcrdma_mr_seg *seg)
>>  {
>>  struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>  struct rpcrdma_mr_seg *seg1 = seg;
>> +struct rpcrdma_mw *mw = seg1->rl_mw;
>>  int rc, nsegs = seg->mr_nsegs;
>>  LIST_HEAD(l);
>> 
>> -list_add(&seg1->rl_mw->r.fmr->list, &l);
>> +list_add(&mw->r.fmr->list, &l);
>>  rc = ib_unmap_fmr(&l);
>>  read_lock(&ia->ri_qplock);
>>  while (seg1->mr_nsegs--)
>> @@ -39,11 +40,14 @@ __fmr_unmap(struct rpcrdma_xprt *r_xprt, struct 
>> rpcrdma_mr_seg *seg)
>>  read_unlock(&ia->ri_qplock);
>>  if (rc)
>>  goto out_err;
>> +out:
>> +seg1->rl_mw = NULL;
>> +rpcrdma_put_mw(r_xprt, mw);
>>  return nsegs;
>> 
>>  out_err:
>>  dprintk("RPC:   %s: ib_unmap_fmr status %i\n", __func__, rc);
>> -return nsegs;
>> +goto out;
>>  }
>> 
>>  static int
>> @@ -117,6 +121,27 @@ out_fmr_err:
>>  return rc;
>>  }
>> 
>> +static struct rpcrdma_mw *
>> +__fmr_get_mw(struct rpcrdma_xprt *r_xprt)
> 
> This introduces an asymmetric approach where you have fmr/frwr get_mw
> routines but have a single rpcrdma_put_mw. I noticed that the
> frwr_get_mw (next patch) is almost completely different - but I wander
> if that should really be that different?

FMR doesn’t need to deal with asynchronous LOCAL_INV getting flushed
when the transport disconnects.

I will explain this further in response to 13/16.

> Just raising the question.
> 
>> +{
>> +struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>> +struct rpcrdma_mw *mw = NULL;
>> +unsigned long flags;
>> +
>> +spin_lock_irqsave(&buf->rb_lock, flags);
>> +
>> +if (!list_empty(&buf->rb_mws)) {
>> +mw = list_entry(buf->rb_mws.next,
>> +struct rpcrdma_mw, mw_list);
>> +list_del_init(&mw->mw_list);
>> +} else {
>> +pr_err("RPC:   %s: no MWs available\n", __func__);
>> +}
>> +
>> +spin_unlock_irqrestore(&buf->rb_lock, flags);
>> +return mw;
>> +}
>> +
>>  /* Use the ib_map_phys_fmr() verb to register a memory region
>>   * for remote access via RDMA READ or RDMA WRITE.
>>   */
>> @@ -126,10 +151,18 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
>> rpcrdma_mr_seg *seg,
>>  {
>>  struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>  struct rpcrdma_mr_seg *seg1 = seg;
>> -struct rpcrdma_mw *mw = seg1->rl_mw;
>> +struct rpcrdma_mw *mw;
>>  u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
>>  int len, pageoff, i, rc;
>> 
>> +mw = __fmr_get_mw(r_xprt);
>> +if (!mw)
>> +return -ENOMEM;
>> +if (seg1->rl_mw) {
>> +rpcrdma_put_mw(r_xprt, seg1->rl_mw);
>> +seg1->rl_mw = NULL;
>> +}
>> +
> 
> How can this happen? getting to op_map with existing rl_mw? and
> wouldn't it be better to use rl_mw instead of getting a new mw and
> putting seg1->rl_mw?
> 
>>  pageoff = offset_in_page(seg1->mr_offset);
>>  seg1->mr_offset -= pageoff; /* start of page */
>>  seg1->mr_len += pageoff;
>> @@ -152,6 +185,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
>> rpcrdma_mr_seg *seg,
>>  if (rc)
>>  goto out_maperr;
&g

Re: [PATCH v1 13/16] xprtrdma: Acquire MRs in rpcrdma_register_external()

2015-03-16 Thread Chuck Lever

On Mar 16, 2015, at 5:44 AM, Sagi Grimberg  wrote:

> On 3/13/2015 11:28 PM, Chuck Lever wrote:
>> Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
>> pool lock is expensive, and unnecessary because most modern adapters
>> can transfer 100s of KBs of payload using just a single MR.
>> 
>> Instead, acquire MRs one-at-a-time as chunks are registered, and
>> return them to rb_mws immediately during deregistration.
>> 
>> Note: commit 539431a437d2 ("xprtrdma: Don't invalidate FRMRs if
>> registration fails") is reverted: There is now a valid case where
>> registration can fail (with -ENOMEM) but the QP is still in RTS.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>>  net/sunrpc/xprtrdma/frwr_ops.c |  126 
>> ---
>>  net/sunrpc/xprtrdma/rpc_rdma.c |3 -
>>  net/sunrpc/xprtrdma/verbs.c|  130 
>> 
>>  3 files changed, 120 insertions(+), 139 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
>> index 6e6d8ba..d23e064 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -54,15 +54,16 @@ __frwr_unmap(struct rpcrdma_xprt *r_xprt, struct 
>> rpcrdma_mr_seg *seg)
>>  {
>>  struct rpcrdma_mr_seg *seg1 = seg;
>>  struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> +struct rpcrdma_mw *mw = seg1->rl_mw;
>>  struct ib_send_wr invalidate_wr, *bad_wr;
>>  int rc, nsegs = seg->mr_nsegs;
>> 
>> -seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
>> +mw->r.frmr.fr_state = FRMR_IS_INVALID;
>> 
>>  memset(&invalidate_wr, 0, sizeof(invalidate_wr));
>> -invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
>> +invalidate_wr.wr_id = (unsigned long)(void *)mw;
>>  invalidate_wr.opcode = IB_WR_LOCAL_INV;
>> -invalidate_wr.ex.invalidate_rkey = seg1->rl_mw->r.frmr.fr_mr->rkey;
>> +invalidate_wr.ex.invalidate_rkey = mw->r.frmr.fr_mr->rkey;
>>  DECR_CQCOUNT(&r_xprt->rx_ep);
>> 
>>  read_lock(&ia->ri_qplock);
>> @@ -72,13 +73,17 @@ __frwr_unmap(struct rpcrdma_xprt *r_xprt, struct 
>> rpcrdma_mr_seg *seg)
>>  read_unlock(&ia->ri_qplock);
>>  if (rc)
>>  goto out_err;
>> +
>> +out:
>> +seg1->rl_mw = NULL;
>> +rpcrdma_put_mw(r_xprt, mw);
>>  return nsegs;
>> 
>>  out_err:
>>  /* Force rpcrdma_buffer_get() to retry */
>> -seg1->rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
>> +mw->r.frmr.fr_state = FRMR_IS_STALE;
>>  dprintk("RPC:   %s: ib_post_send status %i\n", __func__, rc);
>> -return nsegs;
>> +goto out;
>>  }
>> 
>>  static void
>> @@ -217,6 +222,99 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
>>  return 0;
>>  }
>> 
>> +/* rpcrdma_unmap_one() was already done by rpcrdma_frwr_releasesge().
>> + * Redo only the ib_post_send().
>> + */
>> +static void
>> +__retry_local_inv(struct rpcrdma_ia *ia, struct rpcrdma_mw *r)
>> +{
>> +struct rpcrdma_xprt *r_xprt =
>> +container_of(ia, struct rpcrdma_xprt, rx_ia);
>> +struct ib_send_wr invalidate_wr, *bad_wr;
>> +int rc;
>> +
>> +pr_warn("RPC:   %s: FRMR %p is stale\n", __func__, r);
>> +
>> +/* When this FRMR is re-inserted into rb_mws, it is no longer stale */
>> +r->r.frmr.fr_state = FRMR_IS_INVALID;
>> +
>> +memset(&invalidate_wr, 0, sizeof(invalidate_wr));
>> +invalidate_wr.wr_id = (unsigned long)(void *)r;
>> +invalidate_wr.opcode = IB_WR_LOCAL_INV;
>> +invalidate_wr.ex.invalidate_rkey = r->r.frmr.fr_mr->rkey;
>> +DECR_CQCOUNT(&r_xprt->rx_ep);
>> +
>> +pr_warn("RPC:   %s: frmr %p invalidating rkey %08x\n",
>> +__func__, r, r->r.frmr.fr_mr->rkey);
>> +
>> +read_lock(&ia->ri_qplock);
>> +rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
>> +read_unlock(&ia->ri_qplock);
>> +if (rc) {
>> +/* Force __frwr_get_mw() to retry */
>> +r->r.frmr.fr_state = FRMR_IS_STALE;
> 
> Why should you retry this? Why should it succeed next time around?
> I would think that if you didn't succeed in post_send just destroy and
> recreate the mr (in another context or j

Re: [PATCH v1 05/16] xprtrdma: Add a "register_external" op for each memreg mode

2015-03-16 Thread Chuck Lever

On Mar 16, 2015, at 1:13 PM, Steve Wise  wrote:

> On 3/16/2015 1:15 PM, Sagi Grimberg wrote:
>> On 3/16/2015 6:48 PM, Chuck Lever wrote:
>>> 
>>> On Mar 16, 2015, at 3:28 AM, Sagi Grimberg  wrote:
>>> 
>>>> On 3/13/2015 11:27 PM, Chuck Lever wrote:
>>>>> There is very little common processing among the different external
>>>>> memory registration functions. Have rpcrdma_create_chunks() call
>>>>> the registration method directly. This removes a stack frame and a
>>>>> switch statement from the external registration path.
>>>>> 
>>>>> Signed-off-by: Chuck Lever 
>>>>> ---
>>>>>  net/sunrpc/xprtrdma/fmr_ops.c  |   51 +++
>>>>>  net/sunrpc/xprtrdma/frwr_ops.c |   88 ++
>>>>>  net/sunrpc/xprtrdma/physical_ops.c |   17 
>>>>>  net/sunrpc/xprtrdma/rpc_rdma.c |5 +
>>>>>  net/sunrpc/xprtrdma/verbs.c|  172 
>>>>> +---
>>>>>  net/sunrpc/xprtrdma/xprt_rdma.h|6 +
>>>>>  6 files changed, 166 insertions(+), 173 deletions(-)
>>>>> 
>>>>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>>>>> index eec2660..45fb646 100644
>>>>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>>>>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>>>>> @@ -29,7 +29,58 @@ fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
>>>>>   rpcrdma_max_segments(r_xprt) * RPCRDMA_MAX_FMR_SGES);
>>>>>  }
>>>>> 
>>>>> +/* Use the ib_map_phys_fmr() verb to register a memory region
>>>>> + * for remote access via RDMA READ or RDMA WRITE.
>>>>> + */
>>>>> +static int
>>>>> +fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>>>>> +   int nsegs, bool writing)
>>>>> +{
>>>>> +struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>>>>> +struct rpcrdma_mr_seg *seg1 = seg;
>>>>> +struct rpcrdma_mw *mw = seg1->rl_mw;
>>>>> +u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
>>>>> +int len, pageoff, i, rc;
>>>>> +
>>>>> +pageoff = offset_in_page(seg1->mr_offset);
>>>>> +seg1->mr_offset -= pageoff;/* start of page */
>>>>> +seg1->mr_len += pageoff;
>>>>> +len = -pageoff;
>>>>> +if (nsegs > RPCRDMA_MAX_FMR_SGES)
>>>>> +nsegs = RPCRDMA_MAX_FMR_SGES;
>>>>> +for (i = 0; i < nsegs;) {
>>>>> +rpcrdma_map_one(ia, seg, writing);
>>>>> +physaddrs[i] = seg->mr_dma;
>>>>> +len += seg->mr_len;
>>>>> +++seg;
>>>>> +++i;
>>>>> +/* Check for holes */
>>>>> +if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
>>>>> +offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
>>>>> +break;
>>>>> +}
>>>>> +
>>>>> +rc = ib_map_phys_fmr(mw->r.fmr, physaddrs, i, seg1->mr_dma);
>>>>> +if (rc)
>>>>> +goto out_maperr;
>>>>> +
>>>>> +seg1->mr_rkey = mw->r.fmr->rkey;
>>>>> +seg1->mr_base = seg1->mr_dma + pageoff;
>>>>> +seg1->mr_nsegs = i;
>>>>> +seg1->mr_len = len;
>>>>> +return i;
>>>>> +
>>>>> +out_maperr:
>>>>> +dprintk("RPC:   %s: ib_map_phys_fmr %u@0x%llx+%i (%d) status 
>>>>> %i\n",
>>>>> +__func__, len, (unsigned long long)seg1->mr_dma,
>>>>> +pageoff, i, rc);
>>>>> +while (i--)
>>>>> +rpcrdma_unmap_one(ia, --seg);
>>>>> +return rc;
>>>>> +}
>>>>> +
>>>>>  const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
>>>>> +.ro_map= fmr_op_map,
>>>>>  .ro_maxpages= fmr_op_maxpages,
>>>>>  .ro_displayname= "fmr",
>>>>>  };
>>>>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c 
>>>>> b/net/sunrpc/xprtrdma/frwr_ops.c
>>>>> index 73a5ac8..2b5ccb0 100644
>>>>&g

Re: [PATCH v1 01/16] xprtrdma: Display IPv6 addresses and port numbers correctly

2015-03-24 Thread Chuck Lever

On Mar 24, 2015, at 4:27 AM, Devesh Sharma  wrote:

> I see in the svcrdma code, there is a big check to abort creating listener if 
> AF is not AF_INET,
> Do we have plans to address this on the server as well?

No specific plans yet, but yes, the server needs IPv6 updates as well.


> static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
>struct net *net,
>struct sockaddr *sa, int salen,
>int flags)
> {
>struct rdma_cm_id *listen_id;
>struct svcxprt_rdma *cma_xprt;
>int ret;
> 
>dprintk("svcrdma: Creating RDMA socket\n");
>if (sa->sa_family != AF_INET) {
>dprintk("svcrdma: Address family %d is not supported.\n", 
> sa->sa_family);
>return ERR_PTR(-EAFNOSUPPORT);
> 
> -Regards
> Devesh
> 
>> -Original Message-
>> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
>> ow...@vger.kernel.org] On Behalf Of Chuck Lever
>> Sent: Saturday, March 14, 2015 2:57 AM
>> To: linux-rdma@vger.kernel.org
>> Subject: [PATCH v1 01/16] xprtrdma: Display IPv6 addresses and port numbers
>> correctly
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>> net/sunrpc/xprtrdma/transport.c |   47
>> ---
>> net/sunrpc/xprtrdma/verbs.c |   21 +++--
>> 2 files changed, 47 insertions(+), 21 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/transport.c 
>> b/net/sunrpc/xprtrdma/transport.c
>> index 2e192ba..26a62e7 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -157,12 +157,47 @@ static struct ctl_table sunrpc_table[] = {
>> static struct rpc_xprt_ops xprt_rdma_procs;  /* forward reference */
>> 
>> static void
>> +xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr
>> +*sap) {
>> +struct sockaddr_in *sin = (struct sockaddr_in *)sap;
>> +char buf[20];
>> +
>> +snprintf(buf, sizeof(buf), "%08x", ntohl(sin->sin_addr.s_addr));
>> +xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf,
>> +GFP_KERNEL);
>> +
>> +xprt->address_strings[RPC_DISPLAY_NETID] = "rdma"; }
>> +
>> +static void
>> +xprt_rdma_format_addresses6(struct rpc_xprt *xprt, struct sockaddr
>> +*sap) {
>> +struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)sap;
>> +char buf[40];
>> +
>> +snprintf(buf, sizeof(buf), "%pi6", &sin6->sin6_addr);
>> +xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf,
>> +GFP_KERNEL);
>> +
>> +xprt->address_strings[RPC_DISPLAY_NETID] = "rdma6"; }
>> +
>> +static void
>> xprt_rdma_format_addresses(struct rpc_xprt *xprt)  {
>>  struct sockaddr *sap = (struct sockaddr *)
>>  &rpcx_to_rdmad(xprt).addr;
>> -struct sockaddr_in *sin = (struct sockaddr_in *)sap;
>> -char buf[64];
>> +char buf[128];
>> +
>> +switch (sap->sa_family) {
>> +case AF_INET:
>> +xprt_rdma_format_addresses4(xprt, sap);
>> +break;
>> +case AF_INET6:
>> +xprt_rdma_format_addresses6(xprt, sap);
>> +break;
>> +default:
>> +pr_err("rpcrdma: Unrecognized address family\n");
>> +return;
>> +}
>> 
>>  (void)rpc_ntop(sap, buf, sizeof(buf));
>>  xprt->address_strings[RPC_DISPLAY_ADDR] = kstrdup(buf,
>> GFP_KERNEL); @@ -170,16 +205,10 @@ xprt_rdma_format_addresses(struct
>> rpc_xprt *xprt)
>>  snprintf(buf, sizeof(buf), "%u", rpc_get_port(sap));
>>  xprt->address_strings[RPC_DISPLAY_PORT] = kstrdup(buf,
>> GFP_KERNEL);
>> 
>> -xprt->address_strings[RPC_DISPLAY_PROTO] = "rdma";
>> -
>> -snprintf(buf, sizeof(buf), "%08x", ntohl(sin->sin_addr.s_addr));
>> -xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf,
>> GFP_KERNEL);
>> -
>>  snprintf(buf, sizeof(buf), "%4hx", rpc_get_port(sap));
>>  xprt->address_strings[RPC_DISPLAY_HEX_PORT] = kstrdup(buf,
>> GFP_KERNEL);
>> 
>> -/* netid */
>> -xprt->address_strings[RPC_DISPLAY_NETID] = "rdma";
>> +xprt->address_strings[RPC_DISPLAY_PROTO] = "rdma";
>> }
>> 
>> static void
>> diff --git a/ne

Re: [PATCH v1 06/16] xprtrdma: Add a "deregister_external" op for each memreg mode

2015-03-24 Thread Chuck Lever

On Mar 24, 2015, at 7:12 AM, Devesh Sharma  wrote:

>> -Original Message-
>> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
>> ow...@vger.kernel.org] On Behalf Of Chuck Lever
>> Sent: Saturday, March 14, 2015 2:57 AM
>> To: linux-rdma@vger.kernel.org
>> Subject: [PATCH v1 06/16] xprtrdma: Add a "deregister_external" op for each
>> memreg mode
>> 
>> There is very little common processing among the different external memory
>> deregistration functions.
>> 
>> In addition, instead of calling the deregistration function for each segment,
>> have one call release all segments for a request. This makes the API a little
>> asymmetrical, but a hair faster.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>> net/sunrpc/xprtrdma/fmr_ops.c  |   37 
>> net/sunrpc/xprtrdma/frwr_ops.c |   46 
>> net/sunrpc/xprtrdma/physical_ops.c |   13 ++
>> net/sunrpc/xprtrdma/rpc_rdma.c |7 +--
>> net/sunrpc/xprtrdma/transport.c|8 +---
>> net/sunrpc/xprtrdma/verbs.c|   81 
>> 
>> net/sunrpc/xprtrdma/xprt_rdma.h|5 +-
>> 7 files changed, 103 insertions(+), 94 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>> index 45fb646..9b983b4 100644
>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>> @@ -20,6 +20,32 @@
>> /* Maximum scatter/gather per FMR */
>> #define RPCRDMA_MAX_FMR_SGES (64)
>> 
>> +/* Use the ib_unmap_fmr() verb to prevent further remote
>> + * access via RDMA READ or RDMA WRITE.
>> + */
>> +static int
>> +__fmr_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg) {
>> +struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> +struct rpcrdma_mr_seg *seg1 = seg;
>> +int rc, nsegs = seg->mr_nsegs;
>> +LIST_HEAD(l);
>> +
>> +list_add(&seg1->rl_mw->r.fmr->list, &l);
>> +rc = ib_unmap_fmr(&l);
>> +read_lock(&ia->ri_qplock);
>> +while (seg1->mr_nsegs--)
>> +rpcrdma_unmap_one(ia, seg++);
>> +read_unlock(&ia->ri_qplock);
>> +if (rc)
>> +goto out_err;
>> +return nsegs;
>> +
>> +out_err:
>> +dprintk("RPC:   %s: ib_unmap_fmr status %i\n", __func__, rc);
>> +return nsegs;
>> +}
>> +
>> /* FMR mode conveys up to 64 pages of payload per chunk segment.
>>  */
>> static size_t
>> @@ -79,8 +105,19 @@ out_maperr:
>>  return rc;
>> }
>> 
>> +static void
>> +fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req,
>> + unsigned int count)
>> +{
>> +unsigned int i;
>> +
>> +for (i = 0; count--;)
>> +i += __fmr_unmap(r_xprt, &req->rl_segments[i]); }
>> +
>> const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
>>  .ro_map = fmr_op_map,
>> +.ro_unmap   = fmr_op_unmap,
>>  .ro_maxpages= fmr_op_maxpages,
>>  .ro_displayname = "fmr",
>> };
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
>> index 2b5ccb0..05b5761 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -17,6 +17,41 @@
>> # define RPCDBG_FACILITY RPCDBG_TRANS
>> #endif
>> 
>> +/* Post a LOCAL_INV Work Request to prevent further remote access
>> + * via RDMA READ or RDMA WRITE.
>> + */
>> +static int
>> +__frwr_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg) {
>> +struct rpcrdma_mr_seg *seg1 = seg;
>> +struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> +struct ib_send_wr invalidate_wr, *bad_wr;
>> +int rc, nsegs = seg->mr_nsegs;
>> +
>> +seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
>> +
>> +memset(&invalidate_wr, 0, sizeof(invalidate_wr));
>> +invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
>> +invalidate_wr.opcode = IB_WR_LOCAL_INV;
>> +invalidate_wr.ex.invalidate_rkey = seg1->rl_mw->r.frmr.fr_mr->rkey;
>> +DECR_CQCOUNT(&r_xprt->rx_ep);
>> +
>> +read_lock(&ia->ri_qplock);
>> +while (seg1->mr_nsegs--)
>> +rpcrdma_unmap_one(ia, seg++);
>> +rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
&

Re: [PATCH v1 08/16] xprtrdma: Add "reset MRs" memreg op

2015-03-24 Thread Chuck Lever

On Mar 24, 2015, at 7:27 AM, Devesh Sharma  wrote:

>> -Original Message-
>> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
>> ow...@vger.kernel.org] On Behalf Of Chuck Lever
>> Sent: Saturday, March 14, 2015 2:58 AM
>> To: linux-rdma@vger.kernel.org
>> Subject: [PATCH v1 08/16] xprtrdma: Add "reset MRs" memreg op
>> 
>> This method is invoked when a transport instance is about to be reconnected.
>> Each Memory Region object is reset to its initial state.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>> net/sunrpc/xprtrdma/fmr_ops.c  |   23 
>> net/sunrpc/xprtrdma/frwr_ops.c |   46 
>> net/sunrpc/xprtrdma/physical_ops.c |6 ++
>> net/sunrpc/xprtrdma/verbs.c|  103 
>> +---
>> net/sunrpc/xprtrdma/xprt_rdma.h|1
>> 5 files changed, 78 insertions(+), 101 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>> index 1501db0..1ccb3de 100644
>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>> @@ -156,10 +156,33 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_req *req,
>>  i += __fmr_unmap(r_xprt, &req->rl_segments[i]);  }
>> 
>> +/* After a disconnect, unmap all FMRs.
>> + *
>> + * This is invoked only in the transport connect worker in order
>> + * to serialize with rpcrdma_register_fmr_external().
>> + */
>> +static void
>> +fmr_op_reset(struct rpcrdma_xprt *r_xprt) {
>> +struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>> +struct rpcrdma_mw *r;
>> +LIST_HEAD(list);
>> +int rc;
>> +
>> +list_for_each_entry(r, &buf->rb_all, mw_all)
>> +list_add(&r->r.fmr->list, &list);
>> +
>> +rc = ib_unmap_fmr(&list);
>> +if (rc)
>> +dprintk("RPC:   %s: ib_unmap_fmr failed %i\n",
>> +__func__, rc);
>> +}
>> +
>> const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
>>  .ro_map = fmr_op_map,
>>  .ro_unmap   = fmr_op_unmap,
>>  .ro_maxpages= fmr_op_maxpages,
>>  .ro_init= fmr_op_init,
>> +.ro_reset   = fmr_op_reset,
>>  .ro_displayname = "fmr",
>> };
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
>> index 975372c..b4ce0e5 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -81,6 +81,18 @@ out_err:
>>  return nsegs;
>> }
>> 
>> +static void
>> +__frwr_release(struct rpcrdma_mw *r)
>> +{
>> +int rc;
>> +
>> +rc = ib_dereg_mr(r->r.frmr.fr_mr);
>> +if (rc)
>> +dprintk("RPC:   %s: ib_dereg_mr status %i\n",
>> +__func__, rc);
>> +ib_free_fast_reg_page_list(r->r.frmr.fr_pgl);
>> +}
>> +
>> /* FRWR mode conveys a list of pages per chunk segment. The
>>  * maximum length of that list is the FRWR page list depth.
>>  */
>> @@ -226,10 +238,44 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct
>> rpcrdma_req *req,
>>  i += __frwr_unmap(r_xprt, &req->rl_segments[i]);  }
>> 
>> +/* After a disconnect, a flushed FAST_REG_MR can leave an FRMR in
>> + * an unusable state. Find FRMRs in this state and dereg / reg
>> + * each.  FRMRs that are VALID and attached to an rpcrdma_req are
>> + * also torn down.
>> + *
>> + * This gives all in-use FRMRs a fresh rkey and leaves them INVALID.
>> + *
>> + * This is invoked only in the transport connect worker in order
>> + * to serialize with rpcrdma_register_frmr_external().
>> + */
>> +static void
>> +frwr_op_reset(struct rpcrdma_xprt *r_xprt) {
>> +struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>> +struct ib_device *device = r_xprt->rx_ia.ri_id->device;
>> +unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
>> +struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
>> +struct rpcrdma_mw *r;
>> +int rc;
>> +
>> +list_for_each_entry(r, &buf->rb_all, mw_all) {
>> +if (r->r.frmr.fr_state == FRMR_IS_INVALID)
>> +continue;
>> +
>> +__frwr_release(r);
>> +rc = __frwr_init(r, pd, device, depth);
>> + 

Re: [PATCH v1 10/16] xprtrdma: Add "open" memreg op

2015-03-24 Thread Chuck Lever

On Mar 24, 2015, at 7:34 AM, Devesh Sharma  wrote:

>> -Original Message-
>> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
>> ow...@vger.kernel.org] On Behalf Of Chuck Lever
>> Sent: Saturday, March 14, 2015 2:58 AM
>> To: linux-rdma@vger.kernel.org
>> Subject: [PATCH v1 10/16] xprtrdma: Add "open" memreg op
>> 
>> The open op determines the size of various transport data structures based on
>> device capabilities and memory registration mode.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>> net/sunrpc/xprtrdma/fmr_ops.c  |   22 +
>> net/sunrpc/xprtrdma/frwr_ops.c |   60
>> 
>> net/sunrpc/xprtrdma/physical_ops.c |   22 +
>> net/sunrpc/xprtrdma/verbs.c|   54 ++--
>> net/sunrpc/xprtrdma/xprt_rdma.h|3 ++
>> 5 files changed, 110 insertions(+), 51 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
>> index 3115e4b..96e6cd3 100644
>> --- a/net/sunrpc/xprtrdma/fmr_ops.c
>> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
>> @@ -46,6 +46,27 @@ out_err:
>>  return nsegs;
>> }
>> 
>> +static int
>> +fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
>> +struct rpcrdma_create_data_internal *cdata) {
>> +struct ib_device_attr *devattr = &ia->ri_devattr;
>> +unsigned int wrs, max_wrs;
>> +
>> +max_wrs = devattr->max_qp_wr;
>> +if (cdata->max_requests > max_wrs)
>> +cdata->max_requests = max_wrs;
>> +
>> +wrs = cdata->max_requests;
>> +ep->rep_attr.cap.max_send_wr = wrs;
>> +ep->rep_attr.cap.max_recv_wr = wrs;
>> +
>> +dprintk("RPC:   %s: pre-allocating %u send WRs, %u recv WRs\n",
>> +__func__, ep->rep_attr.cap.max_send_wr,
>> +ep->rep_attr.cap.max_recv_wr);
>> +return 0;
>> +}
>> +
>> /* FMR mode conveys up to 64 pages of payload per chunk segment.
>>  */
>> static size_t
>> @@ -201,6 +222,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)  const
>> struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
>>  .ro_map = fmr_op_map,
>>  .ro_unmap   = fmr_op_unmap,
>> +.ro_open= fmr_op_open,
>>  .ro_maxpages= fmr_op_maxpages,
>>  .ro_init= fmr_op_init,
>>  .ro_reset   = fmr_op_reset,
>> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
>> index fc3a228..9bb4b2d 100644
>> --- a/net/sunrpc/xprtrdma/frwr_ops.c
>> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
>> @@ -93,6 +93,65 @@ __frwr_release(struct rpcrdma_mw *r)
>>  ib_free_fast_reg_page_list(r->r.frmr.fr_pgl);
>> }
>> 
>> +static int
>> +frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
>> + struct rpcrdma_create_data_internal *cdata) {
>> +struct ib_device_attr *devattr = &ia->ri_devattr;
>> +unsigned int wrs, max_wrs;
>> +int depth = 7;
>> +
>> +max_wrs = devattr->max_qp_wr;
>> +if (cdata->max_requests > max_wrs)
>> +cdata->max_requests = max_wrs;
>> +
>> +wrs = cdata->max_requests;
>> +ep->rep_attr.cap.max_send_wr = wrs;
>> +ep->rep_attr.cap.max_recv_wr = wrs;
>> +
>> +ia->ri_max_frmr_depth =
>> +min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
>> +  devattr->max_fast_reg_page_list_len);
>> +dprintk("RPC:   %s: device's max FR page list len = %u\n",
>> +__func__, ia->ri_max_frmr_depth);
>> +
>> +/* Add room for frmr register and invalidate WRs.
>> + * 1. FRMR reg WR for head
>> + * 2. FRMR invalidate WR for head
>> + * 3. N FRMR reg WRs for pagelist
>> + * 4. N FRMR invalidate WRs for pagelist
>> + * 5. FRMR reg WR for tail
>> + * 6. FRMR invalidate WR for tail
>> + * 7. The RDMA_SEND WR
>> + */
>> +
>> +/* Calculate N if the device max FRMR depth is smaller than
>> + * RPCRDMA_MAX_DATA_SEGS.
>> + */
>> +if (ia->ri_max_frmr_depth < RPCRDMA_MAX_DATA_SEGS) {
>> +int delta = RPCRDMA_MAX_DATA_SEGS - ia-
>>> ri_max_frmr_depth;
>> +
>> +do {
>> +depth += 2; /* FRMR reg + inva

[PATCH v2 00/15] NFS/RDMA patches proposed for 4.1

2015-03-24 Thread Chuck Lever
This is a series of client-side patches for NFS/RDMA. In preparation
for increasing the transport credit limit and maximum rsize/wsize,
I've re-factored the memory registration logic into separate files,
invoked via a method API.

The two main optimizations in v1 of this series have been dropped.
Sagi Grimberg didn't like the complexity of the solution, and there
isn't enough time to rework it, test the new version, and get it
reviewed before the 4.1 merge window opens. I'm going to prepare
these for 4.2.

Fixes suggested by reviewers have been included before the
refactoring patches to make it easier to backport them to previous
kernels.

The series is available in the nfs-rdma-for-4.1 topic branch at

git://linux-nfs.org/projects/cel/cel-2.6.git

Changes since v1:
- Rebased on 4.0-rc5
- Main optimizations postponed to 4.2
- Addressed review comments from Anna, Sagi, and Devesh

---

Chuck Lever (15):
  SUNRPC: Introduce missing well-known netids
  xprtrdma: Display IPv6 addresses and port numbers correctly
  xprtrdma: Perform a full marshal on retransmit
  xprtrdma: Byte-align FRWR registration
  xprtrdma: Prevent infinite loop in rpcrdma_ep_create()
  xprtrdma: Add vector of ops for each memory registration strategy
  xprtrdma: Add a "max_payload" op for each memreg mode
  xprtrdma: Add a "register_external" op for each memreg mode
  xprtrdma: Add a "deregister_external" op for each memreg mode
  xprtrdma: Add "init MRs" memreg op
  xprtrdma: Add "reset MRs" memreg op
  xprtrdma: Add "destroy MRs" memreg op
  xprtrdma: Add "open" memreg op
  xprtrdma: Handle non-SEND completions via a callout
  xprtrdma: Make rpcrdma_{un}map_one() into inline functions


 include/linux/sunrpc/msg_prot.h|8 
 net/sunrpc/xprtrdma/Makefile   |3 
 net/sunrpc/xprtrdma/fmr_ops.c  |  208 +++
 net/sunrpc/xprtrdma/frwr_ops.c |  353 ++
 net/sunrpc/xprtrdma/physical_ops.c |   94 +
 net/sunrpc/xprtrdma/rpc_rdma.c |   87 ++--
 net/sunrpc/xprtrdma/transport.c|   61 ++-
 net/sunrpc/xprtrdma/verbs.c|  699 +++-
 net/sunrpc/xprtrdma/xprt_rdma.h|   90 -
 9 files changed, 882 insertions(+), 721 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/fmr_ops.c
 create mode 100644 net/sunrpc/xprtrdma/frwr_ops.c
 create mode 100644 net/sunrpc/xprtrdma/physical_ops.c

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/15] xprtrdma: Add vector of ops for each memory registration strategy

2015-03-24 Thread Chuck Lever
Instead of employing switch() statements, let's use the typical
Linux kernel idiom for handling behavioral variation: virtual
functions.

Start by defining a vector of operations for each supported memory
registration mode, and by adding a source file for each mode.

Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
---
 net/sunrpc/xprtrdma/Makefile   |3 ++-
 net/sunrpc/xprtrdma/fmr_ops.c  |   22 ++
 net/sunrpc/xprtrdma/frwr_ops.c |   22 ++
 net/sunrpc/xprtrdma/physical_ops.c |   24 
 net/sunrpc/xprtrdma/verbs.c|   11 +++
 net/sunrpc/xprtrdma/xprt_rdma.h|   12 
 6 files changed, 89 insertions(+), 5 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/fmr_ops.c
 create mode 100644 net/sunrpc/xprtrdma/frwr_ops.c
 create mode 100644 net/sunrpc/xprtrdma/physical_ops.c

diff --git a/net/sunrpc/xprtrdma/Makefile b/net/sunrpc/xprtrdma/Makefile
index da5136f..579f72b 100644
--- a/net/sunrpc/xprtrdma/Makefile
+++ b/net/sunrpc/xprtrdma/Makefile
@@ -1,6 +1,7 @@
 obj-$(CONFIG_SUNRPC_XPRT_RDMA_CLIENT) += xprtrdma.o
 
-xprtrdma-y := transport.o rpc_rdma.o verbs.o
+xprtrdma-y := transport.o rpc_rdma.o verbs.o \
+   fmr_ops.o frwr_ops.o physical_ops.o
 
 obj-$(CONFIG_SUNRPC_XPRT_RDMA_SERVER) += svcrdma.o
 
diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
new file mode 100644
index 000..ffb7d93
--- /dev/null
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -0,0 +1,22 @@
+/*
+ * Copyright (c) 2015 Oracle.  All rights reserved.
+ * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
+ */
+
+/* Lightweight memory registration using Fast Memory Regions (FMR).
+ * Referred to sometimes as MTHCAFMR mode.
+ *
+ * FMR uses synchronous memory registration and deregistration.
+ * FMR registration is known to be fast, but FMR deregistration
+ * can take tens of usecs to complete.
+ */
+
+#include "xprt_rdma.h"
+
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+# define RPCDBG_FACILITY   RPCDBG_TRANS
+#endif
+
+const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
+   .ro_displayname = "fmr",
+};
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
new file mode 100644
index 000..79173f9
--- /dev/null
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -0,0 +1,22 @@
+/*
+ * Copyright (c) 2015 Oracle.  All rights reserved.
+ * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
+ */
+
+/* Lightweight memory registration using Fast Registration Work
+ * Requests (FRWR). Also referred to sometimes as FRMR mode.
+ *
+ * FRWR features ordered asynchronous registration and deregistration
+ * of arbitrarily sized memory regions. This is the fastest and safest
+ * but most complex memory registration mode.
+ */
+
+#include "xprt_rdma.h"
+
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+# define RPCDBG_FACILITY   RPCDBG_TRANS
+#endif
+
+const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
+   .ro_displayname = "frwr",
+};
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
new file mode 100644
index 000..b0922ac
--- /dev/null
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -0,0 +1,24 @@
+/*
+ * Copyright (c) 2015 Oracle.  All rights reserved.
+ * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
+ */
+
+/* No-op chunk preparation. All client memory is pre-registered.
+ * Sometimes referred to as ALLPHYSICAL mode.
+ *
+ * Physical registration is simple because all client memory is
+ * pre-registered and never deregistered. This mode is good for
+ * adapter bring up, but is considered not safe: the server is
+ * trusted not to abuse its access to client memory not involved
+ * in RDMA I/O.
+ */
+
+#include "xprt_rdma.h"
+
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+# define RPCDBG_FACILITY   RPCDBG_TRANS
+#endif
+
+const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
+   .ro_displayname = "physical",
+};
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 99752b5..c3319e1 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -492,10 +492,10 @@ connected:
int ird = attr->max_dest_rd_atomic;
int tird = ep->rep_remote_cma.responder_resources;
 
-   pr_info("rpcrdma: connection to %pIS:%u on %s, memreg %d slots 
%d ird %d%s\n",
+   pr_info("rpcrdma: connection to %pIS:%u on %s, memreg '%s', %d 
credits, %d responders%s\n",
sap, rpc_get_port(sap),
ia->ri_id->device->name,
-   ia->ri_memreg_strategy,
+   ia->ri_ops->ro_displayname,
xprt->rx_buf.rb_max_requests,
ird, ird < 4 &

[PATCH v2 05/15] xprtrdma: Prevent infinite loop in rpcrdma_ep_create()

2015-03-24 Thread Chuck Lever
If a provider advertizes a zero max_fast_reg_page_list_len, FRWR
depth detection loops forever. Instead of just failing the mount,
try other memory registration modes.

Fixes: 0fc6c4e7bb28 ("xprtrdma: mind the device's max fast . . .")
Reported-by: Devesh Sharma 
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 60f3317..99752b5 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -618,9 +618,10 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr 
*addr, int memreg)
 
if (memreg == RPCRDMA_FRMR) {
/* Requires both frmr reg and local dma lkey */
-   if ((devattr->device_cap_flags &
+   if (((devattr->device_cap_flags &
 (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) !=
-   (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) {
+   (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) ||
+ (devattr->max_fast_reg_page_list_len == 0)) {
dprintk("RPC:   %s: FRMR registration "
"not supported by HCA\n", __func__);
memreg = RPCRDMA_MTHCAFMR;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/15] xprtrdma: Add a "max_payload" op for each memreg mode

2015-03-24 Thread Chuck Lever
The max_payload computation is generalized to ensure that the
payload maximum is the lesser of RPC_MAX_DATA_SEGS and the number of
data segments that can be transmitted in an inline buffer.

Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   13 ++
 net/sunrpc/xprtrdma/frwr_ops.c |   13 ++
 net/sunrpc/xprtrdma/physical_ops.c |   10 +++
 net/sunrpc/xprtrdma/transport.c|5 +++-
 net/sunrpc/xprtrdma/verbs.c|   49 +++-
 net/sunrpc/xprtrdma/xprt_rdma.h|5 +++-
 6 files changed, 59 insertions(+), 36 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index ffb7d93..eec2660 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -17,6 +17,19 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+/* Maximum scatter/gather per FMR */
+#define RPCRDMA_MAX_FMR_SGES   (64)
+
+/* FMR mode conveys up to 64 pages of payload per chunk segment.
+ */
+static size_t
+fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
+{
+   return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
+rpcrdma_max_segments(r_xprt) * RPCRDMA_MAX_FMR_SGES);
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
+   .ro_maxpages= fmr_op_maxpages,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 79173f9..73a5ac8 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -17,6 +17,19 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+/* FRWR mode conveys a list of pages per chunk segment. The
+ * maximum length of that list is the FRWR page list depth.
+ */
+static size_t
+frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+
+   return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
+rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
+   .ro_maxpages= frwr_op_maxpages,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index b0922ac..28ade19 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -19,6 +19,16 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+/* PHYSICAL memory registration conveys one page per chunk segment.
+ */
+static size_t
+physical_op_maxpages(struct rpcrdma_xprt *r_xprt)
+{
+   return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
+rpcrdma_max_segments(r_xprt));
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
+   .ro_maxpages= physical_op_maxpages,
.ro_displayname = "physical",
 };
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 97f6562..da71a24 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -406,7 +406,10 @@ xprt_setup_rdma(struct xprt_create *args)
  xprt_rdma_connect_worker);
 
xprt_rdma_format_addresses(xprt);
-   xprt->max_payload = rpcrdma_max_payload(new_xprt);
+   xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
+   if (xprt->max_payload == 0)
+   goto out4;
+   xprt->max_payload <<= PAGE_SHIFT;
dprintk("RPC:   %s: transport data payload maximum: %zu bytes\n",
__func__, xprt->max_payload);
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index c3319e1..da55cda 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -2212,43 +2212,24 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
return rc;
 }
 
-/* Physical mapping means one Read/Write list entry per-page.
- * All list entries must fit within an inline buffer
- *
- * NB: The server must return a Write list for NFS READ,
- * which has the same constraint. Factor in the inline
- * rsize as well.
+/* How many chunk list items fit within our inline buffers?
  */
-static size_t
-rpcrdma_physical_max_payload(struct rpcrdma_xprt *r_xprt)
+unsigned int
+rpcrdma_max_segments(struct rpcrdma_xprt *r_xprt)
 {
struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
-   unsigned int inline_size, pages;
-
-   inline_size = min_t(unsigned int,
-   cdata->inline_wsize, cdata->inline_rsize);
-   inline_size -= RPCRDMA_HDRLEN_MIN;
-   pages = inline_size / sizeof(struct rpcrdma_segment);
-   return pages << PAGE_SHIFT;
-}
+   int bytes, segments;
 
-static size_t
-rpcrdma_mr_max_payload(struct rpcrdma_xprt *r_xprt)
-{
-   return RPCRDMA_MAX_DATA_SEGS <&l

[PATCH v2 01/15] SUNRPC: Introduce missing well-known netids

2015-03-24 Thread Chuck Lever
Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/msg_prot.h |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/msg_prot.h b/include/linux/sunrpc/msg_prot.h
index aadc6a0..8073713 100644
--- a/include/linux/sunrpc/msg_prot.h
+++ b/include/linux/sunrpc/msg_prot.h
@@ -142,12 +142,18 @@ typedef __be32rpc_fraghdr;
(RPC_REPHDRSIZE + (2 + RPC_MAX_AUTH_SIZE/4))
 
 /*
- * RFC1833/RFC3530 rpcbind (v3+) well-known netid's.
+ * Well-known netids. See:
+ *
+ *   http://www.iana.org/assignments/rpc-netids/rpc-netids.xhtml
  */
 #define RPCBIND_NETID_UDP  "udp"
 #define RPCBIND_NETID_TCP  "tcp"
+#define RPCBIND_NETID_RDMA "rdma"
+#define RPCBIND_NETID_SCTP "sctp"
 #define RPCBIND_NETID_UDP6 "udp6"
 #define RPCBIND_NETID_TCP6 "tcp6"
+#define RPCBIND_NETID_RDMA6"rdma6"
+#define RPCBIND_NETID_SCTP6"sctp6"
 #define RPCBIND_NETID_LOCAL"local"
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 03/15] xprtrdma: Perform a full marshal on retransmit

2015-03-24 Thread Chuck Lever
Commit 6ab59945f292 ("xprtrdma: Update rkeys after transport
reconnect" added logic in the ->send_request path to update the
chunk list when an RPC/RDMA request is retransmitted.

Note that rpc_xdr_encode() resets and re-encodes the entire RPC
send buffer for each retransmit of an RPC. The RPC send buffer
is not preserved from the previous transmission of an RPC.

Revert 6ab59945f292, and instead, just force each request to be
fully marshaled every time through ->send_request. This should
preserve the fix from 6ab59945f292, while also performing pullup
during retransmits.

Signed-off-by: Chuck Lever 
Acked-by: Sagi Grimberg 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   71 ++-
 net/sunrpc/xprtrdma/transport.c |5 +--
 net/sunrpc/xprtrdma/xprt_rdma.h |   10 -
 3 files changed, 34 insertions(+), 52 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 91ffde8..41456d9 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -53,6 +53,14 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+enum rpcrdma_chunktype {
+   rpcrdma_noch = 0,
+   rpcrdma_readch,
+   rpcrdma_areadch,
+   rpcrdma_writech,
+   rpcrdma_replych
+};
+
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
 static const char transfertypes[][12] = {
"pure inline",  /* no chunks */
@@ -284,28 +292,6 @@ out:
 }
 
 /*
- * Marshal chunks. This routine returns the header length
- * consumed by marshaling.
- *
- * Returns positive RPC/RDMA header size, or negative errno.
- */
-
-ssize_t
-rpcrdma_marshal_chunks(struct rpc_rqst *rqst, ssize_t result)
-{
-   struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
-   struct rpcrdma_msg *headerp = rdmab_to_msg(req->rl_rdmabuf);
-
-   if (req->rl_rtype != rpcrdma_noch)
-   result = rpcrdma_create_chunks(rqst, &rqst->rq_snd_buf,
-  headerp, req->rl_rtype);
-   else if (req->rl_wtype != rpcrdma_noch)
-   result = rpcrdma_create_chunks(rqst, &rqst->rq_rcv_buf,
-  headerp, req->rl_wtype);
-   return result;
-}
-
-/*
  * Copy write data inline.
  * This function is used for "small" requests. Data which is passed
  * to RPC via iovecs (or page list) is copied directly into the
@@ -397,6 +383,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
char *base;
size_t rpclen, padlen;
ssize_t hdrlen;
+   enum rpcrdma_chunktype rtype, wtype;
struct rpcrdma_msg *headerp;
 
/*
@@ -433,13 +420,13 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * into pages; otherwise use reply chunks.
 */
if (rqst->rq_rcv_buf.buflen <= RPCRDMA_INLINE_READ_THRESHOLD(rqst))
-   req->rl_wtype = rpcrdma_noch;
+   wtype = rpcrdma_noch;
else if (rqst->rq_rcv_buf.page_len == 0)
-   req->rl_wtype = rpcrdma_replych;
+   wtype = rpcrdma_replych;
else if (rqst->rq_rcv_buf.flags & XDRBUF_READ)
-   req->rl_wtype = rpcrdma_writech;
+   wtype = rpcrdma_writech;
else
-   req->rl_wtype = rpcrdma_replych;
+   wtype = rpcrdma_replych;
 
/*
 * Chunks needed for arguments?
@@ -456,16 +443,16 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * TBD check NFSv4 setacl
 */
if (rqst->rq_snd_buf.len <= RPCRDMA_INLINE_WRITE_THRESHOLD(rqst))
-   req->rl_rtype = rpcrdma_noch;
+   rtype = rpcrdma_noch;
else if (rqst->rq_snd_buf.page_len == 0)
-   req->rl_rtype = rpcrdma_areadch;
+   rtype = rpcrdma_areadch;
else
-   req->rl_rtype = rpcrdma_readch;
+   rtype = rpcrdma_readch;
 
/* The following simplification is not true forever */
-   if (req->rl_rtype != rpcrdma_noch && req->rl_wtype == rpcrdma_replych)
-   req->rl_wtype = rpcrdma_noch;
-   if (req->rl_rtype != rpcrdma_noch && req->rl_wtype != rpcrdma_noch) {
+   if (rtype != rpcrdma_noch && wtype == rpcrdma_replych)
+   wtype = rpcrdma_noch;
+   if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC:   %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
@@ -479,7 +466,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * When padding is in use and applies to the transfer, insert
 * it and change the message type.
 */
-   if (req->rl_rtype == rpcrdma_noch) {
+   if (rtype == rpcrdma_noch) {
 
padlen = rpcrdma_inline_pullup(rqst,
RPCRDMA_INLINE_PAD_VALUE(rqst));

[PATCH v2 02/15] xprtrdma: Display IPv6 addresses and port numbers correctly

2015-03-24 Thread Chuck Lever
Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
---
 net/sunrpc/xprtrdma/transport.c |   47 ---
 net/sunrpc/xprtrdma/verbs.c |   21 +++--
 2 files changed, 47 insertions(+), 21 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2e192ba..9be7f97 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -157,12 +157,47 @@ static struct ctl_table sunrpc_table[] = {
 static struct rpc_xprt_ops xprt_rdma_procs;/* forward reference */
 
 static void
+xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr *sap)
+{
+   struct sockaddr_in *sin = (struct sockaddr_in *)sap;
+   char buf[20];
+
+   snprintf(buf, sizeof(buf), "%08x", ntohl(sin->sin_addr.s_addr));
+   xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
+
+   xprt->address_strings[RPC_DISPLAY_NETID] = RPCBIND_NETID_RDMA;
+}
+
+static void
+xprt_rdma_format_addresses6(struct rpc_xprt *xprt, struct sockaddr *sap)
+{
+   struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)sap;
+   char buf[40];
+
+   snprintf(buf, sizeof(buf), "%pi6", &sin6->sin6_addr);
+   xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
+
+   xprt->address_strings[RPC_DISPLAY_NETID] = RPCBIND_NETID_RDMA6;
+}
+
+static void
 xprt_rdma_format_addresses(struct rpc_xprt *xprt)
 {
struct sockaddr *sap = (struct sockaddr *)
&rpcx_to_rdmad(xprt).addr;
-   struct sockaddr_in *sin = (struct sockaddr_in *)sap;
-   char buf[64];
+   char buf[128];
+
+   switch (sap->sa_family) {
+   case AF_INET:
+   xprt_rdma_format_addresses4(xprt, sap);
+   break;
+   case AF_INET6:
+   xprt_rdma_format_addresses6(xprt, sap);
+   break;
+   default:
+   pr_err("rpcrdma: Unrecognized address family\n");
+   return;
+   }
 
(void)rpc_ntop(sap, buf, sizeof(buf));
xprt->address_strings[RPC_DISPLAY_ADDR] = kstrdup(buf, GFP_KERNEL);
@@ -170,16 +205,10 @@ xprt_rdma_format_addresses(struct rpc_xprt *xprt)
snprintf(buf, sizeof(buf), "%u", rpc_get_port(sap));
xprt->address_strings[RPC_DISPLAY_PORT] = kstrdup(buf, GFP_KERNEL);
 
-   xprt->address_strings[RPC_DISPLAY_PROTO] = "rdma";
-
-   snprintf(buf, sizeof(buf), "%08x", ntohl(sin->sin_addr.s_addr));
-   xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
-
snprintf(buf, sizeof(buf), "%4hx", rpc_get_port(sap));
xprt->address_strings[RPC_DISPLAY_HEX_PORT] = kstrdup(buf, GFP_KERNEL);
 
-   /* netid */
-   xprt->address_strings[RPC_DISPLAY_NETID] = "rdma";
+   xprt->address_strings[RPC_DISPLAY_PROTO] = "rdma";
 }
 
 static void
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 124676c..1aa55b7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "xprt_rdma.h"
@@ -424,7 +425,7 @@ rpcrdma_conn_upcall(struct rdma_cm_id *id, struct 
rdma_cm_event *event)
struct rpcrdma_ia *ia = &xprt->rx_ia;
struct rpcrdma_ep *ep = &xprt->rx_ep;
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
-   struct sockaddr_in *addr = (struct sockaddr_in *) &ep->rep_remote_addr;
+   struct sockaddr *sap = (struct sockaddr *)&ep->rep_remote_addr;
 #endif
struct ib_qp_attr *attr = &ia->ri_qp_attr;
struct ib_qp_init_attr *iattr = &ia->ri_qp_init_attr;
@@ -480,9 +481,8 @@ connected:
wake_up_all(&ep->rep_connect_wait);
/*FALLTHROUGH*/
default:
-   dprintk("RPC:   %s: %pI4:%u (ep 0x%p): %s\n",
-   __func__, &addr->sin_addr.s_addr,
-   ntohs(addr->sin_port), ep,
+   dprintk("RPC:   %s: %pIS:%u (ep 0x%p): %s\n",
+   __func__, sap, rpc_get_port(sap), ep,
CONNECTION_MSG(event->event));
break;
}
@@ -491,19 +491,16 @@ connected:
if (connstate == 1) {
int ird = attr->max_dest_rd_atomic;
int tird = ep->rep_remote_cma.responder_resources;
-   printk(KERN_INFO "rpcrdma: connection to %pI4:%u "
-   "on %s, memreg %d slots %d ird %d%s\n",
-   &addr->sin_addr.s_addr,
-   ntohs(addr->sin_port),
+
+   pr_info("rpcrdma: connection to %pIS:%u on %s, memreg %d slots 
%d ird %d%s\n",
+   

[PATCH v2 04/15] xprtrdma: Byte-align FRWR registration

2015-03-24 Thread Chuck Lever
The RPC/RDMA transport's FRWR registration logic registers whole
pages. This means areas in the first and last pages that are not
involved in the RDMA I/O are needlessly exposed to the server.

Buffered I/O is typically page-aligned, so not a problem there. But
for direct I/O, which can be byte-aligned, and for reply chunks,
which are nearly always smaller than a page, the transport could
expose memory outside the I/O buffer.

FRWR allows byte-aligned memory registration, so let's use it as
it was intended.

Reported-by: Sagi Grimberg 
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |   12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 1aa55b7..60f3317 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1924,23 +1924,19 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg 
*seg,
offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
break;
}
-   dprintk("RPC:   %s: Using frmr %p to map %d segments\n",
-   __func__, mw, i);
+   dprintk("RPC:   %s: Using frmr %p to map %d segments (%d bytes)\n",
+   __func__, mw, i, len);
 
frmr->fr_state = FRMR_IS_VALID;
 
memset(&fastreg_wr, 0, sizeof(fastreg_wr));
fastreg_wr.wr_id = (unsigned long)(void *)mw;
fastreg_wr.opcode = IB_WR_FAST_REG_MR;
-   fastreg_wr.wr.fast_reg.iova_start = seg1->mr_dma;
+   fastreg_wr.wr.fast_reg.iova_start = seg1->mr_dma + pageoff;
fastreg_wr.wr.fast_reg.page_list = frmr->fr_pgl;
fastreg_wr.wr.fast_reg.page_list_len = page_no;
fastreg_wr.wr.fast_reg.page_shift = PAGE_SHIFT;
-   fastreg_wr.wr.fast_reg.length = page_no << PAGE_SHIFT;
-   if (fastreg_wr.wr.fast_reg.length < len) {
-   rc = -EIO;
-   goto out_err;
-   }
+   fastreg_wr.wr.fast_reg.length = len;
 
/* Bump the key */
key = (u8)(mr->rkey & 0x00FF);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 08/15] xprtrdma: Add a "register_external" op for each memreg mode

2015-03-24 Thread Chuck Lever
There is very little common processing among the different external
memory registration functions. Have rpcrdma_create_chunks() call
the registration method directly. This removes a stack frame and a
switch statement from the external registration path.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   51 +++
 net/sunrpc/xprtrdma/frwr_ops.c |   82 ++
 net/sunrpc/xprtrdma/physical_ops.c |   17 
 net/sunrpc/xprtrdma/rpc_rdma.c |5 +
 net/sunrpc/xprtrdma/verbs.c|  168 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|6 +
 6 files changed, 160 insertions(+), 169 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index eec2660..45fb646 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -29,7 +29,58 @@ fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * RPCRDMA_MAX_FMR_SGES);
 }
 
+/* Use the ib_map_phys_fmr() verb to register a memory region
+ * for remote access via RDMA READ or RDMA WRITE.
+ */
+static int
+fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+  int nsegs, bool writing)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg1 = seg;
+   struct rpcrdma_mw *mw = seg1->rl_mw;
+   u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
+   int len, pageoff, i, rc;
+
+   pageoff = offset_in_page(seg1->mr_offset);
+   seg1->mr_offset -= pageoff; /* start of page */
+   seg1->mr_len += pageoff;
+   len = -pageoff;
+   if (nsegs > RPCRDMA_MAX_FMR_SGES)
+   nsegs = RPCRDMA_MAX_FMR_SGES;
+   for (i = 0; i < nsegs;) {
+   rpcrdma_map_one(ia, seg, writing);
+   physaddrs[i] = seg->mr_dma;
+   len += seg->mr_len;
+   ++seg;
+   ++i;
+   /* Check for holes */
+   if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
+   offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
+   break;
+   }
+
+   rc = ib_map_phys_fmr(mw->r.fmr, physaddrs, i, seg1->mr_dma);
+   if (rc)
+   goto out_maperr;
+
+   seg1->mr_rkey = mw->r.fmr->rkey;
+   seg1->mr_base = seg1->mr_dma + pageoff;
+   seg1->mr_nsegs = i;
+   seg1->mr_len = len;
+   return i;
+
+out_maperr:
+   dprintk("RPC:   %s: ib_map_phys_fmr %u@0x%llx+%i (%d) status %i\n",
+   __func__, len, (unsigned long long)seg1->mr_dma,
+   pageoff, i, rc);
+   while (i--)
+   rpcrdma_unmap_one(ia, --seg);
+   return rc;
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
+   .ro_map = fmr_op_map,
.ro_maxpages= fmr_op_maxpages,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 73a5ac8..23e4d99 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -29,7 +29,89 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
+/* Post a FAST_REG Work Request to register a memory region
+ * for remote access via RDMA READ or RDMA WRITE.
+ */
+static int
+frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+   int nsegs, bool writing)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg1 = seg;
+   struct rpcrdma_mw *mw = seg1->rl_mw;
+   struct rpcrdma_frmr *frmr = &mw->r.frmr;
+   struct ib_mr *mr = frmr->fr_mr;
+   struct ib_send_wr fastreg_wr, *bad_wr;
+   u8 key;
+   int len, pageoff;
+   int i, rc;
+   int seg_len;
+   u64 pa;
+   int page_no;
+
+   pageoff = offset_in_page(seg1->mr_offset);
+   seg1->mr_offset -= pageoff; /* start of page */
+   seg1->mr_len += pageoff;
+   len = -pageoff;
+   if (nsegs > ia->ri_max_frmr_depth)
+   nsegs = ia->ri_max_frmr_depth;
+   for (page_no = i = 0; i < nsegs;) {
+   rpcrdma_map_one(ia, seg, writing);
+   pa = seg->mr_dma;
+   for (seg_len = seg->mr_len; seg_len > 0; seg_len -= PAGE_SIZE) {
+   frmr->fr_pgl->page_list[page_no++] = pa;
+   pa += PAGE_SIZE;
+   }
+   len += seg->mr_len;
+   ++seg;
+   ++i;
+   /* Check for holes */
+   if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
+   offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
+   break;
+   }
+   dprintk(&

[PATCH v2 09/15] xprtrdma: Add a "deregister_external" op for each memreg mode

2015-03-24 Thread Chuck Lever
There is very little common processing among the different external
memory deregistration functions.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   27 
 net/sunrpc/xprtrdma/frwr_ops.c |   36 
 net/sunrpc/xprtrdma/physical_ops.c |   10 
 net/sunrpc/xprtrdma/rpc_rdma.c |   11 +++--
 net/sunrpc/xprtrdma/transport.c|4 +-
 net/sunrpc/xprtrdma/verbs.c|   81 
 net/sunrpc/xprtrdma/xprt_rdma.h|5 +-
 7 files changed, 84 insertions(+), 90 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 45fb646..888aa10 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -79,8 +79,35 @@ out_maperr:
return rc;
 }
 
+/* Use the ib_unmap_fmr() verb to prevent further remote
+ * access via RDMA READ or RDMA WRITE.
+ */
+static int
+fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg1 = seg;
+   int rc, nsegs = seg->mr_nsegs;
+   LIST_HEAD(l);
+
+   list_add(&seg1->rl_mw->r.fmr->list, &l);
+   rc = ib_unmap_fmr(&l);
+   read_lock(&ia->ri_qplock);
+   while (seg1->mr_nsegs--)
+   rpcrdma_unmap_one(ia, seg++);
+   read_unlock(&ia->ri_qplock);
+   if (rc)
+   goto out_err;
+   return nsegs;
+
+out_err:
+   dprintk("RPC:   %s: ib_unmap_fmr status %i\n", __func__, rc);
+   return nsegs;
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
+   .ro_unmap   = fmr_op_unmap,
.ro_maxpages= fmr_op_maxpages,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 23e4d99..35b725b 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -110,8 +110,44 @@ out_senderr:
return rc;
 }
 
+/* Post a LOCAL_INV Work Request to prevent further remote access
+ * via RDMA READ or RDMA WRITE.
+ */
+static int
+frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   struct rpcrdma_mr_seg *seg1 = seg;
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct ib_send_wr invalidate_wr, *bad_wr;
+   int rc, nsegs = seg->mr_nsegs;
+
+   seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
+
+   memset(&invalidate_wr, 0, sizeof(invalidate_wr));
+   invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
+   invalidate_wr.opcode = IB_WR_LOCAL_INV;
+   invalidate_wr.ex.invalidate_rkey = seg1->rl_mw->r.frmr.fr_mr->rkey;
+   DECR_CQCOUNT(&r_xprt->rx_ep);
+
+   read_lock(&ia->ri_qplock);
+   while (seg1->mr_nsegs--)
+   rpcrdma_unmap_one(ia, seg++);
+   rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+   read_unlock(&ia->ri_qplock);
+   if (rc)
+   goto out_err;
+   return nsegs;
+
+out_err:
+   /* Force rpcrdma_buffer_get() to retry */
+   seg1->rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
+   dprintk("RPC:   %s: ib_post_send status %i\n", __func__, rc);
+   return nsegs;
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
+   .ro_unmap   = frwr_op_unmap,
.ro_maxpages= frwr_op_maxpages,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index 5a284ee..5b5a63a 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -44,8 +44,18 @@ physical_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
return 1;
 }
 
+/* Unmap a memory region, but leave it registered.
+ */
+static int
+physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   rpcrdma_unmap_one(&r_xprt->rx_ia, seg);
+   return 1;
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
+   .ro_unmap   = physical_op_unmap,
.ro_maxpages= physical_op_maxpages,
.ro_displayname = "physical",
 };
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 6ab1d03..2c53ea9 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -284,11 +284,12 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct 
xdr_buf *target,
return (unsigned char *)iptr - (unsigned char *)headerp;
 
 out:
-   if (r_xprt->rx_ia.

[PATCH v2 13/15] xprtrdma: Add "open" memreg op

2015-03-24 Thread Chuck Lever
The open op determines the size of various transport data structures
based on device capabilities and memory registration mode.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |8 ++
 net/sunrpc/xprtrdma/frwr_ops.c |   48 +++
 net/sunrpc/xprtrdma/physical_ops.c |8 ++
 net/sunrpc/xprtrdma/verbs.c|   49 ++--
 net/sunrpc/xprtrdma/xprt_rdma.h|3 ++
 5 files changed, 70 insertions(+), 46 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index e9ca594..e8a9837 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -20,6 +20,13 @@
 /* Maximum scatter/gather per FMR */
 #define RPCRDMA_MAX_FMR_SGES   (64)
 
+static int
+fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
+   struct rpcrdma_create_data_internal *cdata)
+{
+   return 0;
+}
+
 /* FMR mode conveys up to 64 pages of payload per chunk segment.
  */
 static size_t
@@ -188,6 +195,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap   = fmr_op_unmap,
+   .ro_open= fmr_op_open,
.ro_maxpages= fmr_op_maxpages,
.ro_init= fmr_op_init,
.ro_reset   = fmr_op_reset,
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 121e400..e17d54d 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -58,6 +58,53 @@ __frwr_release(struct rpcrdma_mw *r)
ib_free_fast_reg_page_list(r->r.frmr.fr_pgl);
 }
 
+static int
+frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
+struct rpcrdma_create_data_internal *cdata)
+{
+   struct ib_device_attr *devattr = &ia->ri_devattr;
+   int depth, delta;
+
+   ia->ri_max_frmr_depth =
+   min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
+ devattr->max_fast_reg_page_list_len);
+   dprintk("RPC:   %s: device's max FR page list len = %u\n",
+   __func__, ia->ri_max_frmr_depth);
+
+   /* Add room for frmr register and invalidate WRs.
+* 1. FRMR reg WR for head
+* 2. FRMR invalidate WR for head
+* 3. N FRMR reg WRs for pagelist
+* 4. N FRMR invalidate WRs for pagelist
+* 5. FRMR reg WR for tail
+* 6. FRMR invalidate WR for tail
+* 7. The RDMA_SEND WR
+*/
+   depth = 7;
+
+   /* Calculate N if the device max FRMR depth is smaller than
+* RPCRDMA_MAX_DATA_SEGS.
+*/
+   if (ia->ri_max_frmr_depth < RPCRDMA_MAX_DATA_SEGS) {
+   delta = RPCRDMA_MAX_DATA_SEGS - ia->ri_max_frmr_depth;
+   do {
+   depth += 2; /* FRMR reg + invalidate */
+   delta -= ia->ri_max_frmr_depth;
+   } while (delta > 0);
+   }
+
+   ep->rep_attr.cap.max_send_wr *= depth;
+   if (ep->rep_attr.cap.max_send_wr > devattr->max_qp_wr) {
+   cdata->max_requests = devattr->max_qp_wr / depth;
+   if (!cdata->max_requests)
+   return -EINVAL;
+   ep->rep_attr.cap.max_send_wr = cdata->max_requests *
+  depth;
+   }
+
+   return 0;
+}
+
 /* FRWR mode conveys a list of pages per chunk segment. The
  * maximum length of that list is the FRWR page list depth.
  */
@@ -276,6 +323,7 @@ frwr_op_destroy(struct rpcrdma_buffer *buf)
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
.ro_unmap   = frwr_op_unmap,
+   .ro_open= frwr_op_open,
.ro_maxpages= frwr_op_maxpages,
.ro_init= frwr_op_init,
.ro_reset   = frwr_op_reset,
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index eb39011..0ba130b 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -19,6 +19,13 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+static int
+physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
+struct rpcrdma_create_data_internal *cdata)
+{
+   return 0;
+}
+
 /* PHYSICAL memory registration conveys one page per chunk segment.
  */
 static size_t
@@ -72,6 +79,7 @@ physical_op_destroy(struct rpcrdma_buffer *buf)
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
.ro_unmap   = physical_op_unmap,
+   

[PATCH v2 10/15] xprtrdma: Add "init MRs" memreg op

2015-03-24 Thread Chuck Lever
This method is used when setting up a new transport instance to
create a pool of Memory Region objects that will be used to register
memory during operation.

Memory Regions are not needed for "physical" registration, since
->prepare and ->release are no-ops for that mode.

Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   42 +++
 net/sunrpc/xprtrdma/frwr_ops.c |   66 +++
 net/sunrpc/xprtrdma/physical_ops.c |7 ++
 net/sunrpc/xprtrdma/verbs.c|  104 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|1 
 5 files changed, 119 insertions(+), 101 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 888aa10..825ce96 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -29,6 +29,47 @@ fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * RPCRDMA_MAX_FMR_SGES);
 }
 
+static int
+fmr_op_init(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   int mr_access_flags = IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ;
+   struct ib_fmr_attr fmr_attr = {
+   .max_pages  = RPCRDMA_MAX_FMR_SGES,
+   .max_maps   = 1,
+   .page_shift = PAGE_SHIFT
+   };
+   struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
+   struct rpcrdma_mw *r;
+   int i, rc;
+
+   INIT_LIST_HEAD(&buf->rb_mws);
+   INIT_LIST_HEAD(&buf->rb_all);
+
+   i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
+   dprintk("RPC:   %s: initalizing %d FMRs\n", __func__, i);
+
+   while (i--) {
+   r = kzalloc(sizeof(*r), GFP_KERNEL);
+   if (!r)
+   return -ENOMEM;
+
+   r->r.fmr = ib_alloc_fmr(pd, mr_access_flags, &fmr_attr);
+   if (IS_ERR(r->r.fmr))
+   goto out_fmr_err;
+
+   list_add(&r->mw_list, &buf->rb_mws);
+   list_add(&r->mw_all, &buf->rb_all);
+   }
+   return 0;
+
+out_fmr_err:
+   rc = PTR_ERR(r->r.fmr);
+   dprintk("RPC:   %s: ib_alloc_fmr status %i\n", __func__, rc);
+   kfree(r);
+   return rc;
+}
+
 /* Use the ib_map_phys_fmr() verb to register a memory region
  * for remote access via RDMA READ or RDMA WRITE.
  */
@@ -109,5 +150,6 @@ const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap   = fmr_op_unmap,
.ro_maxpages= fmr_op_maxpages,
+   .ro_init= fmr_op_init,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 35b725b..9168c15 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -17,6 +17,35 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+static int
+__frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, struct ib_device *device,
+   unsigned int depth)
+{
+   struct rpcrdma_frmr *f = &r->r.frmr;
+   int rc;
+
+   f->fr_mr = ib_alloc_fast_reg_mr(pd, depth);
+   if (IS_ERR(f->fr_mr))
+   goto out_mr_err;
+   f->fr_pgl = ib_alloc_fast_reg_page_list(device, depth);
+   if (IS_ERR(f->fr_pgl))
+   goto out_list_err;
+   return 0;
+
+out_mr_err:
+   rc = PTR_ERR(f->fr_mr);
+   dprintk("RPC:   %s: ib_alloc_fast_reg_mr status %i\n",
+   __func__, rc);
+   return rc;
+
+out_list_err:
+   rc = PTR_ERR(f->fr_pgl);
+   dprintk("RPC:   %s: ib_alloc_fast_reg_page_list status %i\n",
+   __func__, rc);
+   ib_dereg_mr(f->fr_mr);
+   return rc;
+}
+
 /* FRWR mode conveys a list of pages per chunk segment. The
  * maximum length of that list is the FRWR page list depth.
  */
@@ -29,6 +58,42 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
+static int
+frwr_op_init(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct ib_device *device = r_xprt->rx_ia.ri_id->device;
+   unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
+   struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
+   int i;
+
+   INIT_LIST_HEAD(&buf->rb_mws);
+   INIT_LIST_HEAD(&buf->rb_all);
+
+   i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
+   dprintk("RPC:   %s: initalizing %d FRMRs\n", __func__, i);
+
+   while (i--) {
+   struct rpcrdma_mw *r;
+   int rc;
+
+   r = kzalloc(sizeof(*r), GFP_KERNEL);
+   

[PATCH v2 11/15] xprtrdma: Add "reset MRs" memreg op

2015-03-24 Thread Chuck Lever
This method is invoked when a transport instance is about to be
reconnected. Each Memory Region object is reset to its initial
state.

Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   23 
 net/sunrpc/xprtrdma/frwr_ops.c |   51 ++
 net/sunrpc/xprtrdma/physical_ops.c |6 ++
 net/sunrpc/xprtrdma/verbs.c|  103 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|1 
 5 files changed, 83 insertions(+), 101 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 825ce96..93261b0 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -146,10 +146,33 @@ out_err:
return nsegs;
 }
 
+/* After a disconnect, unmap all FMRs.
+ *
+ * This is invoked only in the transport connect worker in order
+ * to serialize with rpcrdma_register_fmr_external().
+ */
+static void
+fmr_op_reset(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct rpcrdma_mw *r;
+   LIST_HEAD(list);
+   int rc;
+
+   list_for_each_entry(r, &buf->rb_all, mw_all)
+   list_add(&r->r.fmr->list, &list);
+
+   rc = ib_unmap_fmr(&list);
+   if (rc)
+   dprintk("RPC:   %s: ib_unmap_fmr failed %i\n",
+   __func__, rc);
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap   = fmr_op_unmap,
.ro_maxpages= fmr_op_maxpages,
.ro_init= fmr_op_init,
+   .ro_reset   = fmr_op_reset,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 9168c15..c2bb29d 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -46,6 +46,18 @@ out_list_err:
return rc;
 }
 
+static void
+__frwr_release(struct rpcrdma_mw *r)
+{
+   int rc;
+
+   rc = ib_dereg_mr(r->r.frmr.fr_mr);
+   if (rc)
+   dprintk("RPC:   %s: ib_dereg_mr status %i\n",
+   __func__, rc);
+   ib_free_fast_reg_page_list(r->r.frmr.fr_pgl);
+}
+
 /* FRWR mode conveys a list of pages per chunk segment. The
  * maximum length of that list is the FRWR page list depth.
  */
@@ -210,10 +222,49 @@ out_err:
return nsegs;
 }
 
+/* After a disconnect, a flushed FAST_REG_MR can leave an FRMR in
+ * an unusable state. Find FRMRs in this state and dereg / reg
+ * each.  FRMRs that are VALID and attached to an rpcrdma_req are
+ * also torn down.
+ *
+ * This gives all in-use FRMRs a fresh rkey and leaves them INVALID.
+ *
+ * This is invoked only in the transport connect worker in order
+ * to serialize with rpcrdma_register_frmr_external().
+ */
+static void
+frwr_op_reset(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct ib_device *device = r_xprt->rx_ia.ri_id->device;
+   unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
+   struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
+   struct rpcrdma_mw *r;
+   int rc;
+
+   list_for_each_entry(r, &buf->rb_all, mw_all) {
+   if (r->r.frmr.fr_state == FRMR_IS_INVALID)
+   continue;
+
+   __frwr_release(r);
+   rc = __frwr_init(r, pd, device, depth);
+   if (rc) {
+   dprintk("RPC:   %s: mw %p left %s\n",
+   __func__, r,
+   (r->r.frmr.fr_state == FRMR_IS_STALE ?
+   "stale" : "valid"));
+   continue;
+   }
+
+   r->r.frmr.fr_state = FRMR_IS_INVALID;
+   }
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
.ro_unmap   = frwr_op_unmap,
.ro_maxpages= frwr_op_maxpages,
.ro_init= frwr_op_init,
+   .ro_reset   = frwr_op_reset,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index c372051..e060713 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -59,10 +59,16 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
return 1;
 }
 
+static void
+physical_op_reset(struct rpcrdma_xprt *r_xprt)
+{
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
.ro_unmap   

[PATCH v2 14/15] xprtrdma: Handle non-SEND completions via a callout

2015-03-24 Thread Chuck Lever
Allow each memory registration mode to plug in a callout that handles
the completion of a memory registration operation.

Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
---
 net/sunrpc/xprtrdma/frwr_ops.c  |   17 +
 net/sunrpc/xprtrdma/verbs.c |   16 ++--
 net/sunrpc/xprtrdma/xprt_rdma.h |5 +
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index e17d54d..ea59c1b 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -117,6 +117,22 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
+/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs to be reset. */
+static void
+frwr_sendcompletion(struct ib_wc *wc)
+{
+   struct rpcrdma_mw *r;
+
+   if (likely(wc->status == IB_WC_SUCCESS))
+   return;
+
+   /* WARNING: Only wr_id and status are reliable at this point */
+   r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+   dprintk("RPC:   %s: frmr %p (stale), status %d\n",
+   __func__, r, wc->status);
+   r->r.frmr.fr_state = FRMR_IS_STALE;
+}
+
 static int
 frwr_op_init(struct rpcrdma_xprt *r_xprt)
 {
@@ -148,6 +164,7 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
 
list_add(&r->mw_list, &buf->rb_mws);
list_add(&r->mw_all, &buf->rb_all);
+   r->mw_sendcompletion = frwr_sendcompletion;
}
 
return 0;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index b697b3e..cac06f2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -186,7 +186,7 @@ static const char * const wc_status[] = {
"remote access error",
"remote operation error",
"transport retry counter exceeded",
-   "RNR retrycounter exceeded",
+   "RNR retry counter exceeded",
"local RDD violation error",
"remove invalid RD request",
"operation aborted",
@@ -204,21 +204,17 @@ static const char * const wc_status[] = {
 static void
 rpcrdma_sendcq_process_wc(struct ib_wc *wc)
 {
-   if (likely(wc->status == IB_WC_SUCCESS))
-   return;
-
/* WARNING: Only wr_id and status are reliable at this point */
-   if (wc->wr_id == 0ULL) {
-   if (wc->status != IB_WC_WR_FLUSH_ERR)
+   if (wc->wr_id == RPCRDMA_IGNORE_COMPLETION) {
+   if (wc->status != IB_WC_SUCCESS &&
+   wc->status != IB_WC_WR_FLUSH_ERR)
pr_err("RPC:   %s: SEND: %s\n",
   __func__, COMPLETION_MSG(wc->status));
} else {
struct rpcrdma_mw *r;
 
r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
-   r->r.frmr.fr_state = FRMR_IS_STALE;
-   pr_err("RPC:   %s: frmr %p (stale): %s\n",
-  __func__, r, COMPLETION_MSG(wc->status));
+   r->mw_sendcompletion(wc);
}
 }
 
@@ -1622,7 +1618,7 @@ rpcrdma_ep_post(struct rpcrdma_ia *ia,
}
 
send_wr.next = NULL;
-   send_wr.wr_id = 0ULL;   /* no send cookie */
+   send_wr.wr_id = RPCRDMA_IGNORE_COMPLETION;
send_wr.sg_list = req->rl_send_iov;
send_wr.num_sge = req->rl_niovs;
send_wr.opcode = IB_WR_SEND;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 9036fb4..54bcbe4 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -106,6 +106,10 @@ struct rpcrdma_ep {
 #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
 #define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)
 
+/* Force completion handler to ignore the signal
+ */
+#define RPCRDMA_IGNORE_COMPLETION  (0ULL)
+
 /* Registered buffer -- registered kmalloc'd memory for RDMA SEND/RECV
  *
  * The below structure appears at the front of a large region of kmalloc'd
@@ -206,6 +210,7 @@ struct rpcrdma_mw {
struct ib_fmr   *fmr;
struct rpcrdma_frmr frmr;
} r;
+   void(*mw_sendcompletion)(struct ib_wc *);
struct list_headmw_list;
struct list_headmw_all;
 };

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 12/15] xprtrdma: Add "destroy MRs" memreg op

2015-03-24 Thread Chuck Lever
Memory Region objects associated with a transport instance are
destroyed before the instance is shutdown and destroyed.

Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   18 
 net/sunrpc/xprtrdma/frwr_ops.c |   14 ++
 net/sunrpc/xprtrdma/physical_ops.c |6 
 net/sunrpc/xprtrdma/verbs.c|   52 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|1 +
 5 files changed, 40 insertions(+), 51 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 93261b0..e9ca594 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -168,11 +168,29 @@ fmr_op_reset(struct rpcrdma_xprt *r_xprt)
__func__, rc);
 }
 
+static void
+fmr_op_destroy(struct rpcrdma_buffer *buf)
+{
+   struct rpcrdma_mw *r;
+   int rc;
+
+   while (!list_empty(&buf->rb_all)) {
+   r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
+   list_del(&r->mw_all);
+   rc = ib_dealloc_fmr(r->r.fmr);
+   if (rc)
+   dprintk("RPC:   %s: ib_dealloc_fmr failed %i\n",
+   __func__, rc);
+   kfree(r);
+   }
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap   = fmr_op_unmap,
.ro_maxpages= fmr_op_maxpages,
.ro_init= fmr_op_init,
.ro_reset   = fmr_op_reset,
+   .ro_destroy = fmr_op_destroy,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index c2bb29d..121e400 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -260,11 +260,25 @@ frwr_op_reset(struct rpcrdma_xprt *r_xprt)
}
 }
 
+static void
+frwr_op_destroy(struct rpcrdma_buffer *buf)
+{
+   struct rpcrdma_mw *r;
+
+   while (!list_empty(&buf->rb_all)) {
+   r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
+   list_del(&r->mw_all);
+   __frwr_release(r);
+   kfree(r);
+   }
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
.ro_unmap   = frwr_op_unmap,
.ro_maxpages= frwr_op_maxpages,
.ro_init= frwr_op_init,
.ro_reset   = frwr_op_reset,
+   .ro_destroy = frwr_op_destroy,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index e060713..eb39011 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -64,11 +64,17 @@ physical_op_reset(struct rpcrdma_xprt *r_xprt)
 {
 }
 
+static void
+physical_op_destroy(struct rpcrdma_buffer *buf)
+{
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
.ro_unmap   = physical_op_unmap,
.ro_maxpages= physical_op_maxpages,
.ro_init= physical_op_init,
.ro_reset   = physical_op_reset,
+   .ro_destroy = physical_op_destroy,
.ro_displayname = "physical",
 };
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 1b2c1f4..a7fb314 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1199,47 +1199,6 @@ rpcrdma_destroy_req(struct rpcrdma_ia *ia, struct 
rpcrdma_req *req)
kfree(req);
 }
 
-static void
-rpcrdma_destroy_fmrs(struct rpcrdma_buffer *buf)
-{
-   struct rpcrdma_mw *r;
-   int rc;
-
-   while (!list_empty(&buf->rb_all)) {
-   r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
-   list_del(&r->mw_all);
-   list_del(&r->mw_list);
-
-   rc = ib_dealloc_fmr(r->r.fmr);
-   if (rc)
-   dprintk("RPC:   %s: ib_dealloc_fmr failed %i\n",
-   __func__, rc);
-
-   kfree(r);
-   }
-}
-
-static void
-rpcrdma_destroy_frmrs(struct rpcrdma_buffer *buf)
-{
-   struct rpcrdma_mw *r;
-   int rc;
-
-   while (!list_empty(&buf->rb_all)) {
-   r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
-   list_del(&r->mw_all);
-   list_del(&r->mw_list);
-
-   rc = ib_dereg_mr(r->r.frmr.fr_mr);
-

[PATCH v2 15/15] xprtrdma: Make rpcrdma_{un}map_one() into inline functions

2015-03-24 Thread Chuck Lever
These functions are called in a loop for each page transferred via
RDMA READ or WRITE. Extract loop invariants and inline them to
reduce CPU overhead.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   10 ++--
 net/sunrpc/xprtrdma/frwr_ops.c |   10 ++--
 net/sunrpc/xprtrdma/physical_ops.c |   10 ++--
 net/sunrpc/xprtrdma/verbs.c|   44 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h|   45 ++--
 5 files changed, 73 insertions(+), 46 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index e8a9837..a91ba2c 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -85,6 +85,8 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg 
*seg,
   int nsegs, bool writing)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct ib_device *device = ia->ri_id->device;
+   enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_mw *mw = seg1->rl_mw;
u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
@@ -97,7 +99,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg 
*seg,
if (nsegs > RPCRDMA_MAX_FMR_SGES)
nsegs = RPCRDMA_MAX_FMR_SGES;
for (i = 0; i < nsegs;) {
-   rpcrdma_map_one(ia, seg, writing);
+   rpcrdma_map_one(device, seg, direction);
physaddrs[i] = seg->mr_dma;
len += seg->mr_len;
++seg;
@@ -123,7 +125,7 @@ out_maperr:
__func__, len, (unsigned long long)seg1->mr_dma,
pageoff, i, rc);
while (i--)
-   rpcrdma_unmap_one(ia, --seg);
+   rpcrdma_unmap_one(device, --seg);
return rc;
 }
 
@@ -135,14 +137,16 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mr_seg *seg1 = seg;
+   struct ib_device *device;
int rc, nsegs = seg->mr_nsegs;
LIST_HEAD(l);
 
list_add(&seg1->rl_mw->r.fmr->list, &l);
rc = ib_unmap_fmr(&l);
read_lock(&ia->ri_qplock);
+   device = ia->ri_id->device;
while (seg1->mr_nsegs--)
-   rpcrdma_unmap_one(ia, seg++);
+   rpcrdma_unmap_one(device, seg++);
read_unlock(&ia->ri_qplock);
if (rc)
goto out_err;
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index ea59c1b..0a7b9df 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -178,6 +178,8 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
int nsegs, bool writing)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct ib_device *device = ia->ri_id->device;
+   enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_mw *mw = seg1->rl_mw;
struct rpcrdma_frmr *frmr = &mw->r.frmr;
@@ -197,7 +199,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;
for (page_no = i = 0; i < nsegs;) {
-   rpcrdma_map_one(ia, seg, writing);
+   rpcrdma_map_one(device, seg, direction);
pa = seg->mr_dma;
for (seg_len = seg->mr_len; seg_len > 0; seg_len -= PAGE_SIZE) {
frmr->fr_pgl->page_list[page_no++] = pa;
@@ -247,7 +249,7 @@ out_senderr:
ib_update_fast_reg_key(mr, --key);
frmr->fr_state = FRMR_IS_INVALID;
while (i--)
-   rpcrdma_unmap_one(ia, --seg);
+   rpcrdma_unmap_one(device, --seg);
return rc;
 }
 
@@ -261,6 +263,7 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct ib_send_wr invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;
+   struct ib_device *device;
 
seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
 
@@ -271,8 +274,9 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
DECR_CQCOUNT(&r_xprt->rx_ep);
 
read_lock(&ia->ri_qplock);
+   device = ia->ri_id->device;
while (seg1->mr_nsegs--)
-   rpcrdma_unmap_one(ia, seg++);
+   rpcrdma_unmap_one(device, seg++);
rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
read_unlock(&ia->ri_qplock);
if (rc)
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index 0ba130b..ba518af 100644
--- a/net/sunrpc/x

Re: [PATCH v2 00/15] NFS/RDMA patches proposed for 4.1

2015-03-26 Thread Chuck Lever

On Mar 26, 2015, at 1:39 PM, Anna Schumaker  wrote:

> Hey Chuck,
> 
> I didn't see anything that needs to be fixed up in these patches.  Are they 
> ready for me?

Thanks for the review. IMO we can go one of two routes:

 - Wait for HCA vendors to test this latest version of the series, or

 - Merge it now, and simply apply any needed fixes on top before the 4.1
window opens.

What do you prefer? Is it possible to get this series in front of the
zero-day test folks before you merge?


> Anna
> 
> On 03/24/2015 04:30 PM, Chuck Lever wrote:
>> This is a series of client-side patches for NFS/RDMA. In preparation
>> for increasing the transport credit limit and maximum rsize/wsize,
>> I've re-factored the memory registration logic into separate files,
>> invoked via a method API.
>> 
>> The two main optimizations in v1 of this series have been dropped.
>> Sagi Grimberg didn't like the complexity of the solution, and there
>> isn't enough time to rework it, test the new version, and get it
>> reviewed before the 4.1 merge window opens. I'm going to prepare
>> these for 4.2.
>> 
>> Fixes suggested by reviewers have been included before the
>> refactoring patches to make it easier to backport them to previous
>> kernels.
>> 
>> The series is available in the nfs-rdma-for-4.1 topic branch at
>> 
>> git://linux-nfs.org/projects/cel/cel-2.6.git
>> 
>> Changes since v1:
>> - Rebased on 4.0-rc5
>> - Main optimizations postponed to 4.2
>> - Addressed review comments from Anna, Sagi, and Devesh
>> 
>> ---
>> 
>> Chuck Lever (15):
>>  SUNRPC: Introduce missing well-known netids
>>  xprtrdma: Display IPv6 addresses and port numbers correctly
>>  xprtrdma: Perform a full marshal on retransmit
>>  xprtrdma: Byte-align FRWR registration
>>  xprtrdma: Prevent infinite loop in rpcrdma_ep_create()
>>  xprtrdma: Add vector of ops for each memory registration strategy
>>  xprtrdma: Add a "max_payload" op for each memreg mode
>>  xprtrdma: Add a "register_external" op for each memreg mode
>>  xprtrdma: Add a "deregister_external" op for each memreg mode
>>  xprtrdma: Add "init MRs" memreg op
>>  xprtrdma: Add "reset MRs" memreg op
>>  xprtrdma: Add "destroy MRs" memreg op
>>  xprtrdma: Add "open" memreg op
>>  xprtrdma: Handle non-SEND completions via a callout
>>  xprtrdma: Make rpcrdma_{un}map_one() into inline functions
>> 
>> 
>> include/linux/sunrpc/msg_prot.h|8 
>> net/sunrpc/xprtrdma/Makefile   |3 
>> net/sunrpc/xprtrdma/fmr_ops.c  |  208 +++
>> net/sunrpc/xprtrdma/frwr_ops.c |  353 ++
>> net/sunrpc/xprtrdma/physical_ops.c |   94 +
>> net/sunrpc/xprtrdma/rpc_rdma.c |   87 ++--
>> net/sunrpc/xprtrdma/transport.c|   61 ++-
>> net/sunrpc/xprtrdma/verbs.c|  699 
>> +++-
>> net/sunrpc/xprtrdma/xprt_rdma.h|   90 -
>> 9 files changed, 882 insertions(+), 721 deletions(-)
>> create mode 100644 net/sunrpc/xprtrdma/fmr_ops.c
>> create mode 100644 net/sunrpc/xprtrdma/frwr_ops.c
>> create mode 100644 net/sunrpc/xprtrdma/physical_ops.c
>> 
>> --
>> Chuck Lever
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/15] NFS/RDMA patches proposed for 4.1

2015-03-27 Thread Chuck Lever

On Mar 27, 2015, at 12:44 AM, Devesh Sharma  wrote:

>> -Original Message-
>> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
>> ow...@vger.kernel.org] On Behalf Of Devesh Sharma
>> Sent: Friday, March 27, 2015 11:13 AM
>> To: Anna Schumaker; Chuck Lever; linux-rdma@vger.kernel.org; linux-
>> n...@vger.kernel.org
>> Subject: RE: [PATCH v2 00/15] NFS/RDMA patches proposed for 4.1
>> 
>> Hi Chuck,
>> 
>> I have validated these set of patches with ocrdma device, iozone passes with
>> these.
> 
> 
> Thanks to Meghna.

Hi Devesh-

Is there a Tested-by tag that Anna can add to these patches?


>> 
>> -Regards
>> Devesh
>> 
>>> -Original Message-
>>> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
>>> ow...@vger.kernel.org] On Behalf Of Anna Schumaker
>>> Sent: Friday, March 27, 2015 12:10 AM
>>> To: Chuck Lever; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
>>> Subject: Re: [PATCH v2 00/15] NFS/RDMA patches proposed for 4.1
>>> 
>>> Hey Chuck,
>>> 
>>> I didn't see anything that needs to be fixed up in these patches.  Are
>>> they ready for me?
>>> 
>>> Anna
>>> 
>>> On 03/24/2015 04:30 PM, Chuck Lever wrote:
>>>> This is a series of client-side patches for NFS/RDMA. In preparation
>>>> for increasing the transport credit limit and maximum rsize/wsize,
>>>> I've re-factored the memory registration logic into separate files,
>>>> invoked via a method API.
>>>> 
>>>> The two main optimizations in v1 of this series have been dropped.
>>>> Sagi Grimberg didn't like the complexity of the solution, and there
>>>> isn't enough time to rework it, test the new version, and get it
>>>> reviewed before the 4.1 merge window opens. I'm going to prepare
>>>> these for 4.2.
>>>> 
>>>> Fixes suggested by reviewers have been included before the
>>>> refactoring patches to make it easier to backport them to previous kernels.
>>>> 
>>>> The series is available in the nfs-rdma-for-4.1 topic branch at
>>>> 
>>>> git://linux-nfs.org/projects/cel/cel-2.6.git
>>>> 
>>>> Changes since v1:
>>>> - Rebased on 4.0-rc5
>>>> - Main optimizations postponed to 4.2
>>>> - Addressed review comments from Anna, Sagi, and Devesh
>>>> 
>>>> ---
>>>> 
>>>> Chuck Lever (15):
>>>>  SUNRPC: Introduce missing well-known netids
>>>>  xprtrdma: Display IPv6 addresses and port numbers correctly
>>>>  xprtrdma: Perform a full marshal on retransmit
>>>>  xprtrdma: Byte-align FRWR registration
>>>>  xprtrdma: Prevent infinite loop in rpcrdma_ep_create()
>>>>  xprtrdma: Add vector of ops for each memory registration strategy
>>>>  xprtrdma: Add a "max_payload" op for each memreg mode
>>>>  xprtrdma: Add a "register_external" op for each memreg mode
>>>>  xprtrdma: Add a "deregister_external" op for each memreg mode
>>>>  xprtrdma: Add "init MRs" memreg op
>>>>  xprtrdma: Add "reset MRs" memreg op
>>>>  xprtrdma: Add "destroy MRs" memreg op
>>>>  xprtrdma: Add "open" memreg op
>>>>  xprtrdma: Handle non-SEND completions via a callout
>>>>  xprtrdma: Make rpcrdma_{un}map_one() into inline functions
>>>> 
>>>> 
>>>> include/linux/sunrpc/msg_prot.h|8
>>>> net/sunrpc/xprtrdma/Makefile   |3
>>>> net/sunrpc/xprtrdma/fmr_ops.c  |  208 +++
>>>> net/sunrpc/xprtrdma/frwr_ops.c |  353 ++
>>>> net/sunrpc/xprtrdma/physical_ops.c |   94 +
>>>> net/sunrpc/xprtrdma/rpc_rdma.c |   87 ++--
>>>> net/sunrpc/xprtrdma/transport.c|   61 ++-
>>>> net/sunrpc/xprtrdma/verbs.c|  699 
>>>> +++-
>>>> net/sunrpc/xprtrdma/xprt_rdma.h|   90 -
>>>> 9 files changed, 882 insertions(+), 721 deletions(-)  create mode
>>>> 100644 net/sunrpc/xprtrdma/fmr_ops.c  create mode 100644
>>>> net/sunrpc/xprtrdma/frwr_ops.c  create mode 100644
>>>> net/sunrpc/xprtrdma/physical_ops.c
>>>> 
>>>> --
>>>> Chuck Lever
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
>>>> in the body of a message to majord...@vger.kernel.org More majordomo
>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
>>> in the body of a message to majord...@vger.kernel.org More majordomo
>>> info at http://vger.kernel.org/majordomo-info.html
>>   칻 & ~ &   +-  ݶ  w  ˛   m b  kvf   ^n r   z   h&   G   h ( 階 
>> ݢj"   m z
>> ޖ   f   h   ~ m

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/15] NFS/RDMA patches proposed for 4.1

2015-03-30 Thread Chuck Lever

On Mar 30, 2015, at 10:18 AM, Steve Wise  wrote:

> Hey Chuck,
> 
> Chelsio's QA regression tested this series on iw_cxgb4.  Tests out good.
> 
> Tests ran: spew, ffsb, xdd, fio, dbench, and cthon with both v3 and v4.

Thanks, Steve. Who should I credit in the Tested-by tag?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 00/15] NFS/RDMA patches proposed for 4.1

2015-03-30 Thread Chuck Lever
This is a series of client-side patches for NFS/RDMA. In preparation
for increasing the transport credit limit and maximum rsize/wsize,
I've re-factored the memory registration logic into separate files,
invoked via a method API.

The series is available in the nfs-rdma-for-4.1 topic branch at

git://linux-nfs.org/projects/cel/cel-2.6.git

Changes since v2:
- Rebased on 4.0-rc6
- One minor fix squashed into 01/15
- Tested-by tags added

Changes since v1:
- Rebased on 4.0-rc5
- Main optimizations postponed to 4.2
- Addressed review comments from Anna, Sagi, and Devesh

---

Chuck Lever (15):
  SUNRPC: Introduce missing well-known netids
  xprtrdma: Display IPv6 addresses and port numbers correctly
  xprtrdma: Perform a full marshal on retransmit
  xprtrdma: Byte-align FRWR registration
  xprtrdma: Prevent infinite loop in rpcrdma_ep_create()
  xprtrdma: Add vector of ops for each memory registration strategy
  xprtrdma: Add a "max_payload" op for each memreg mode
  xprtrdma: Add a "register_external" op for each memreg mode
  xprtrdma: Add a "deregister_external" op for each memreg mode
  xprtrdma: Add "init MRs" memreg op
  xprtrdma: Add "reset MRs" memreg op
  xprtrdma: Add "destroy MRs" memreg op
  xprtrdma: Add "open" memreg op
  xprtrdma: Handle non-SEND completions via a callout
  xprtrdma: Make rpcrdma_{un}map_one() into inline functions


 include/linux/sunrpc/msg_prot.h|8 
 include/linux/sunrpc/xprtrdma.h|5 
 net/sunrpc/xprtrdma/Makefile   |3 
 net/sunrpc/xprtrdma/fmr_ops.c  |  208 +++
 net/sunrpc/xprtrdma/frwr_ops.c |  353 ++
 net/sunrpc/xprtrdma/physical_ops.c |   94 +
 net/sunrpc/xprtrdma/rpc_rdma.c |   87 ++--
 net/sunrpc/xprtrdma/transport.c|   61 ++-
 net/sunrpc/xprtrdma/verbs.c|  699 +++-
 net/sunrpc/xprtrdma/xprt_rdma.h|   90 -
 10 files changed, 882 insertions(+), 726 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/fmr_ops.c
 create mode 100644 net/sunrpc/xprtrdma/frwr_ops.c
 create mode 100644 net/sunrpc/xprtrdma/physical_ops.c

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 02/15] xprtrdma: Display IPv6 addresses and port numbers correctly

2015-03-30 Thread Chuck Lever
Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/transport.c |   47 ---
 net/sunrpc/xprtrdma/verbs.c |   21 +++--
 2 files changed, 47 insertions(+), 21 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 2e192ba..9be7f97 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -157,12 +157,47 @@ static struct ctl_table sunrpc_table[] = {
 static struct rpc_xprt_ops xprt_rdma_procs;/* forward reference */
 
 static void
+xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr *sap)
+{
+   struct sockaddr_in *sin = (struct sockaddr_in *)sap;
+   char buf[20];
+
+   snprintf(buf, sizeof(buf), "%08x", ntohl(sin->sin_addr.s_addr));
+   xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
+
+   xprt->address_strings[RPC_DISPLAY_NETID] = RPCBIND_NETID_RDMA;
+}
+
+static void
+xprt_rdma_format_addresses6(struct rpc_xprt *xprt, struct sockaddr *sap)
+{
+   struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)sap;
+   char buf[40];
+
+   snprintf(buf, sizeof(buf), "%pi6", &sin6->sin6_addr);
+   xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
+
+   xprt->address_strings[RPC_DISPLAY_NETID] = RPCBIND_NETID_RDMA6;
+}
+
+static void
 xprt_rdma_format_addresses(struct rpc_xprt *xprt)
 {
struct sockaddr *sap = (struct sockaddr *)
&rpcx_to_rdmad(xprt).addr;
-   struct sockaddr_in *sin = (struct sockaddr_in *)sap;
-   char buf[64];
+   char buf[128];
+
+   switch (sap->sa_family) {
+   case AF_INET:
+   xprt_rdma_format_addresses4(xprt, sap);
+   break;
+   case AF_INET6:
+   xprt_rdma_format_addresses6(xprt, sap);
+   break;
+   default:
+   pr_err("rpcrdma: Unrecognized address family\n");
+   return;
+   }
 
(void)rpc_ntop(sap, buf, sizeof(buf));
xprt->address_strings[RPC_DISPLAY_ADDR] = kstrdup(buf, GFP_KERNEL);
@@ -170,16 +205,10 @@ xprt_rdma_format_addresses(struct rpc_xprt *xprt)
snprintf(buf, sizeof(buf), "%u", rpc_get_port(sap));
xprt->address_strings[RPC_DISPLAY_PORT] = kstrdup(buf, GFP_KERNEL);
 
-   xprt->address_strings[RPC_DISPLAY_PROTO] = "rdma";
-
-   snprintf(buf, sizeof(buf), "%08x", ntohl(sin->sin_addr.s_addr));
-   xprt->address_strings[RPC_DISPLAY_HEX_ADDR] = kstrdup(buf, GFP_KERNEL);
-
snprintf(buf, sizeof(buf), "%4hx", rpc_get_port(sap));
xprt->address_strings[RPC_DISPLAY_HEX_PORT] = kstrdup(buf, GFP_KERNEL);
 
-   /* netid */
-   xprt->address_strings[RPC_DISPLAY_NETID] = "rdma";
+   xprt->address_strings[RPC_DISPLAY_PROTO] = "rdma";
 }
 
 static void
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 124676c..1aa55b7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "xprt_rdma.h"
@@ -424,7 +425,7 @@ rpcrdma_conn_upcall(struct rdma_cm_id *id, struct 
rdma_cm_event *event)
struct rpcrdma_ia *ia = &xprt->rx_ia;
struct rpcrdma_ep *ep = &xprt->rx_ep;
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
-   struct sockaddr_in *addr = (struct sockaddr_in *) &ep->rep_remote_addr;
+   struct sockaddr *sap = (struct sockaddr *)&ep->rep_remote_addr;
 #endif
struct ib_qp_attr *attr = &ia->ri_qp_attr;
struct ib_qp_init_attr *iattr = &ia->ri_qp_init_attr;
@@ -480,9 +481,8 @@ connected:
wake_up_all(&ep->rep_connect_wait);
/*FALLTHROUGH*/
default:
-   dprintk("RPC:   %s: %pI4:%u (ep 0x%p): %s\n",
-   __func__, &addr->sin_addr.s_addr,
-   ntohs(addr->sin_port), ep,
+   dprintk("RPC:   %s: %pIS:%u (ep 0x%p): %s\n",
+   __func__, sap, rpc_get_port(sap), ep,
CONNECTION_MSG(event->event));
break;
}
@@ -491,19 +491,16 @@ connected:
if (connstate == 1) {
int ird = attr->max_dest_rd_atomic;
int tird = ep->rep_remote_cma.responder_resources;
-   printk(KERN_INFO "rpcrdma: connection to %pI4:%u "
-   "on %s, memreg %d slots %d ird %d%s\n",
-   &addr->sin_addr.s_addr,
-   ntohs(addr->sin_port),
+
+   pr_info("rpcrdma: connection to %pIS:%u

[PATCH v3 01/15] SUNRPC: Introduce missing well-known netids

2015-03-30 Thread Chuck Lever
Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/msg_prot.h |8 +++-
 include/linux/sunrpc/xprtrdma.h |5 -
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/sunrpc/msg_prot.h b/include/linux/sunrpc/msg_prot.h
index aadc6a0..8073713 100644
--- a/include/linux/sunrpc/msg_prot.h
+++ b/include/linux/sunrpc/msg_prot.h
@@ -142,12 +142,18 @@ typedef __be32rpc_fraghdr;
(RPC_REPHDRSIZE + (2 + RPC_MAX_AUTH_SIZE/4))
 
 /*
- * RFC1833/RFC3530 rpcbind (v3+) well-known netid's.
+ * Well-known netids. See:
+ *
+ *   http://www.iana.org/assignments/rpc-netids/rpc-netids.xhtml
  */
 #define RPCBIND_NETID_UDP  "udp"
 #define RPCBIND_NETID_TCP  "tcp"
+#define RPCBIND_NETID_RDMA "rdma"
+#define RPCBIND_NETID_SCTP "sctp"
 #define RPCBIND_NETID_UDP6 "udp6"
 #define RPCBIND_NETID_TCP6 "tcp6"
+#define RPCBIND_NETID_RDMA6"rdma6"
+#define RPCBIND_NETID_SCTP6"sctp6"
 #define RPCBIND_NETID_LOCAL"local"
 
 /*
diff --git a/include/linux/sunrpc/xprtrdma.h b/include/linux/sunrpc/xprtrdma.h
index 64a0a0a..c984c85 100644
--- a/include/linux/sunrpc/xprtrdma.h
+++ b/include/linux/sunrpc/xprtrdma.h
@@ -41,11 +41,6 @@
 #define _LINUX_SUNRPC_XPRTRDMA_H
 
 /*
- * rpcbind (v3+) RDMA netid.
- */
-#define RPCBIND_NETID_RDMA "rdma"
-
-/*
  * Constants. Max RPC/NFS header is big enough to account for
  * additional marshaling buffers passed down by Linux client.
  *

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 05/15] xprtrdma: Prevent infinite loop in rpcrdma_ep_create()

2015-03-30 Thread Chuck Lever
If a provider advertizes a zero max_fast_reg_page_list_len, FRWR
depth detection loops forever. Instead of just failing the mount,
try other memory registration modes.

Fixes: 0fc6c4e7bb28 ("xprtrdma: mind the device's max fast . . .")
Reported-by: Devesh Sharma 
Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/verbs.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 60f3317..99752b5 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -618,9 +618,10 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr 
*addr, int memreg)
 
if (memreg == RPCRDMA_FRMR) {
/* Requires both frmr reg and local dma lkey */
-   if ((devattr->device_cap_flags &
+   if (((devattr->device_cap_flags &
 (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) !=
-   (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) {
+   (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) ||
+ (devattr->max_fast_reg_page_list_len == 0)) {
dprintk("RPC:   %s: FRMR registration "
"not supported by HCA\n", __func__);
memreg = RPCRDMA_MTHCAFMR;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 03/15] xprtrdma: Perform a full marshal on retransmit

2015-03-30 Thread Chuck Lever
Commit 6ab59945f292 ("xprtrdma: Update rkeys after transport
reconnect" added logic in the ->send_request path to update the
chunk list when an RPC/RDMA request is retransmitted.

Note that rpc_xdr_encode() resets and re-encodes the entire RPC
send buffer for each retransmit of an RPC. The RPC send buffer
is not preserved from the previous transmission of an RPC.

Revert 6ab59945f292, and instead, just force each request to be
fully marshaled every time through ->send_request. This should
preserve the fix from 6ab59945f292, while also performing pullup
during retransmits.

Signed-off-by: Chuck Lever 
Acked-by: Sagi Grimberg 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   71 ++-
 net/sunrpc/xprtrdma/transport.c |5 +--
 net/sunrpc/xprtrdma/xprt_rdma.h |   10 -
 3 files changed, 34 insertions(+), 52 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 91ffde8..41456d9 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -53,6 +53,14 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+enum rpcrdma_chunktype {
+   rpcrdma_noch = 0,
+   rpcrdma_readch,
+   rpcrdma_areadch,
+   rpcrdma_writech,
+   rpcrdma_replych
+};
+
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
 static const char transfertypes[][12] = {
"pure inline",  /* no chunks */
@@ -284,28 +292,6 @@ out:
 }
 
 /*
- * Marshal chunks. This routine returns the header length
- * consumed by marshaling.
- *
- * Returns positive RPC/RDMA header size, or negative errno.
- */
-
-ssize_t
-rpcrdma_marshal_chunks(struct rpc_rqst *rqst, ssize_t result)
-{
-   struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
-   struct rpcrdma_msg *headerp = rdmab_to_msg(req->rl_rdmabuf);
-
-   if (req->rl_rtype != rpcrdma_noch)
-   result = rpcrdma_create_chunks(rqst, &rqst->rq_snd_buf,
-  headerp, req->rl_rtype);
-   else if (req->rl_wtype != rpcrdma_noch)
-   result = rpcrdma_create_chunks(rqst, &rqst->rq_rcv_buf,
-  headerp, req->rl_wtype);
-   return result;
-}
-
-/*
  * Copy write data inline.
  * This function is used for "small" requests. Data which is passed
  * to RPC via iovecs (or page list) is copied directly into the
@@ -397,6 +383,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
char *base;
size_t rpclen, padlen;
ssize_t hdrlen;
+   enum rpcrdma_chunktype rtype, wtype;
struct rpcrdma_msg *headerp;
 
/*
@@ -433,13 +420,13 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * into pages; otherwise use reply chunks.
 */
if (rqst->rq_rcv_buf.buflen <= RPCRDMA_INLINE_READ_THRESHOLD(rqst))
-   req->rl_wtype = rpcrdma_noch;
+   wtype = rpcrdma_noch;
else if (rqst->rq_rcv_buf.page_len == 0)
-   req->rl_wtype = rpcrdma_replych;
+   wtype = rpcrdma_replych;
else if (rqst->rq_rcv_buf.flags & XDRBUF_READ)
-   req->rl_wtype = rpcrdma_writech;
+   wtype = rpcrdma_writech;
else
-   req->rl_wtype = rpcrdma_replych;
+   wtype = rpcrdma_replych;
 
/*
 * Chunks needed for arguments?
@@ -456,16 +443,16 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * TBD check NFSv4 setacl
 */
if (rqst->rq_snd_buf.len <= RPCRDMA_INLINE_WRITE_THRESHOLD(rqst))
-   req->rl_rtype = rpcrdma_noch;
+   rtype = rpcrdma_noch;
else if (rqst->rq_snd_buf.page_len == 0)
-   req->rl_rtype = rpcrdma_areadch;
+   rtype = rpcrdma_areadch;
else
-   req->rl_rtype = rpcrdma_readch;
+   rtype = rpcrdma_readch;
 
/* The following simplification is not true forever */
-   if (req->rl_rtype != rpcrdma_noch && req->rl_wtype == rpcrdma_replych)
-   req->rl_wtype = rpcrdma_noch;
-   if (req->rl_rtype != rpcrdma_noch && req->rl_wtype != rpcrdma_noch) {
+   if (rtype != rpcrdma_noch && wtype == rpcrdma_replych)
+   wtype = rpcrdma_noch;
+   if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
dprintk("RPC:   %s: cannot marshal multiple chunk lists\n",
__func__);
return -EIO;
@@ -479,7 +466,7 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
 * When padding is in use and applies to the transfer, insert
 * it and change the message type.
 */
-   if (req->rl_rtype == rpcrdma_noch) {
+   if (rtype == rpcrdma_noch) {
 
padlen = rpcrdma_inline_pullup(rqst,

[PATCH v3 06/15] xprtrdma: Add vector of ops for each memory registration strategy

2015-03-30 Thread Chuck Lever
Instead of employing switch() statements, let's use the typical
Linux kernel idiom for handling behavioral variation: virtual
functions.

Start by defining a vector of operations for each supported memory
registration mode, and by adding a source file for each mode.

Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/Makefile   |3 ++-
 net/sunrpc/xprtrdma/fmr_ops.c  |   22 ++
 net/sunrpc/xprtrdma/frwr_ops.c |   22 ++
 net/sunrpc/xprtrdma/physical_ops.c |   24 
 net/sunrpc/xprtrdma/verbs.c|   11 +++
 net/sunrpc/xprtrdma/xprt_rdma.h|   12 
 6 files changed, 89 insertions(+), 5 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/fmr_ops.c
 create mode 100644 net/sunrpc/xprtrdma/frwr_ops.c
 create mode 100644 net/sunrpc/xprtrdma/physical_ops.c

diff --git a/net/sunrpc/xprtrdma/Makefile b/net/sunrpc/xprtrdma/Makefile
index da5136f..579f72b 100644
--- a/net/sunrpc/xprtrdma/Makefile
+++ b/net/sunrpc/xprtrdma/Makefile
@@ -1,6 +1,7 @@
 obj-$(CONFIG_SUNRPC_XPRT_RDMA_CLIENT) += xprtrdma.o
 
-xprtrdma-y := transport.o rpc_rdma.o verbs.o
+xprtrdma-y := transport.o rpc_rdma.o verbs.o \
+   fmr_ops.o frwr_ops.o physical_ops.o
 
 obj-$(CONFIG_SUNRPC_XPRT_RDMA_SERVER) += svcrdma.o
 
diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
new file mode 100644
index 000..ffb7d93
--- /dev/null
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -0,0 +1,22 @@
+/*
+ * Copyright (c) 2015 Oracle.  All rights reserved.
+ * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
+ */
+
+/* Lightweight memory registration using Fast Memory Regions (FMR).
+ * Referred to sometimes as MTHCAFMR mode.
+ *
+ * FMR uses synchronous memory registration and deregistration.
+ * FMR registration is known to be fast, but FMR deregistration
+ * can take tens of usecs to complete.
+ */
+
+#include "xprt_rdma.h"
+
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+# define RPCDBG_FACILITY   RPCDBG_TRANS
+#endif
+
+const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
+   .ro_displayname = "fmr",
+};
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
new file mode 100644
index 000..79173f9
--- /dev/null
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -0,0 +1,22 @@
+/*
+ * Copyright (c) 2015 Oracle.  All rights reserved.
+ * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
+ */
+
+/* Lightweight memory registration using Fast Registration Work
+ * Requests (FRWR). Also referred to sometimes as FRMR mode.
+ *
+ * FRWR features ordered asynchronous registration and deregistration
+ * of arbitrarily sized memory regions. This is the fastest and safest
+ * but most complex memory registration mode.
+ */
+
+#include "xprt_rdma.h"
+
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+# define RPCDBG_FACILITY   RPCDBG_TRANS
+#endif
+
+const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
+   .ro_displayname = "frwr",
+};
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
new file mode 100644
index 000..b0922ac
--- /dev/null
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -0,0 +1,24 @@
+/*
+ * Copyright (c) 2015 Oracle.  All rights reserved.
+ * Copyright (c) 2003-2007 Network Appliance, Inc. All rights reserved.
+ */
+
+/* No-op chunk preparation. All client memory is pre-registered.
+ * Sometimes referred to as ALLPHYSICAL mode.
+ *
+ * Physical registration is simple because all client memory is
+ * pre-registered and never deregistered. This mode is good for
+ * adapter bring up, but is considered not safe: the server is
+ * trusted not to abuse its access to client memory not involved
+ * in RDMA I/O.
+ */
+
+#include "xprt_rdma.h"
+
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+# define RPCDBG_FACILITY   RPCDBG_TRANS
+#endif
+
+const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
+   .ro_displayname = "physical",
+};
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 99752b5..c3319e1 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -492,10 +492,10 @@ connected:
int ird = attr->max_dest_rd_atomic;
int tird = ep->rep_remote_cma.responder_resources;
 
-   pr_info("rpcrdma: connection to %pIS:%u on %s, memreg %d slots 
%d ird %d%s\n",
+   pr_info("rpcrdma: connection to %pIS:%u on %s, memreg '%s', %d 
credits, %d responders%s\n",
sap, rpc_get_port(sap),
ia->ri_id->device->name,
-   ia->ri_memreg_strategy,
+   ia->ri_ops->ro_displayname

[PATCH v3 04/15] xprtrdma: Byte-align FRWR registration

2015-03-30 Thread Chuck Lever
The RPC/RDMA transport's FRWR registration logic registers whole
pages. This means areas in the first and last pages that are not
involved in the RDMA I/O are needlessly exposed to the server.

Buffered I/O is typically page-aligned, so not a problem there. But
for direct I/O, which can be byte-aligned, and for reply chunks,
which are nearly always smaller than a page, the transport could
expose memory outside the I/O buffer.

FRWR allows byte-aligned memory registration, so let's use it as
it was intended.

Reported-by: Sagi Grimberg 
Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/verbs.c |   12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 1aa55b7..60f3317 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1924,23 +1924,19 @@ rpcrdma_register_frmr_external(struct rpcrdma_mr_seg 
*seg,
offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
break;
}
-   dprintk("RPC:   %s: Using frmr %p to map %d segments\n",
-   __func__, mw, i);
+   dprintk("RPC:   %s: Using frmr %p to map %d segments (%d bytes)\n",
+   __func__, mw, i, len);
 
frmr->fr_state = FRMR_IS_VALID;
 
memset(&fastreg_wr, 0, sizeof(fastreg_wr));
fastreg_wr.wr_id = (unsigned long)(void *)mw;
fastreg_wr.opcode = IB_WR_FAST_REG_MR;
-   fastreg_wr.wr.fast_reg.iova_start = seg1->mr_dma;
+   fastreg_wr.wr.fast_reg.iova_start = seg1->mr_dma + pageoff;
fastreg_wr.wr.fast_reg.page_list = frmr->fr_pgl;
fastreg_wr.wr.fast_reg.page_list_len = page_no;
fastreg_wr.wr.fast_reg.page_shift = PAGE_SHIFT;
-   fastreg_wr.wr.fast_reg.length = page_no << PAGE_SHIFT;
-   if (fastreg_wr.wr.fast_reg.length < len) {
-   rc = -EIO;
-   goto out_err;
-   }
+   fastreg_wr.wr.fast_reg.length = len;
 
/* Bump the key */
key = (u8)(mr->rkey & 0x00FF);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 08/15] xprtrdma: Add a "register_external" op for each memreg mode

2015-03-30 Thread Chuck Lever
There is very little common processing among the different external
memory registration functions. Have rpcrdma_create_chunks() call
the registration method directly. This removes a stack frame and a
switch statement from the external registration path.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   51 +++
 net/sunrpc/xprtrdma/frwr_ops.c |   82 ++
 net/sunrpc/xprtrdma/physical_ops.c |   17 
 net/sunrpc/xprtrdma/rpc_rdma.c |5 +
 net/sunrpc/xprtrdma/verbs.c|  168 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|6 +
 6 files changed, 160 insertions(+), 169 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index eec2660..45fb646 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -29,7 +29,58 @@ fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * RPCRDMA_MAX_FMR_SGES);
 }
 
+/* Use the ib_map_phys_fmr() verb to register a memory region
+ * for remote access via RDMA READ or RDMA WRITE.
+ */
+static int
+fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+  int nsegs, bool writing)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg1 = seg;
+   struct rpcrdma_mw *mw = seg1->rl_mw;
+   u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
+   int len, pageoff, i, rc;
+
+   pageoff = offset_in_page(seg1->mr_offset);
+   seg1->mr_offset -= pageoff; /* start of page */
+   seg1->mr_len += pageoff;
+   len = -pageoff;
+   if (nsegs > RPCRDMA_MAX_FMR_SGES)
+   nsegs = RPCRDMA_MAX_FMR_SGES;
+   for (i = 0; i < nsegs;) {
+   rpcrdma_map_one(ia, seg, writing);
+   physaddrs[i] = seg->mr_dma;
+   len += seg->mr_len;
+   ++seg;
+   ++i;
+   /* Check for holes */
+   if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
+   offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
+   break;
+   }
+
+   rc = ib_map_phys_fmr(mw->r.fmr, physaddrs, i, seg1->mr_dma);
+   if (rc)
+   goto out_maperr;
+
+   seg1->mr_rkey = mw->r.fmr->rkey;
+   seg1->mr_base = seg1->mr_dma + pageoff;
+   seg1->mr_nsegs = i;
+   seg1->mr_len = len;
+   return i;
+
+out_maperr:
+   dprintk("RPC:   %s: ib_map_phys_fmr %u@0x%llx+%i (%d) status %i\n",
+   __func__, len, (unsigned long long)seg1->mr_dma,
+   pageoff, i, rc);
+   while (i--)
+   rpcrdma_unmap_one(ia, --seg);
+   return rc;
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
+   .ro_map = fmr_op_map,
.ro_maxpages= fmr_op_maxpages,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 73a5ac8..23e4d99 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -29,7 +29,89 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
+/* Post a FAST_REG Work Request to register a memory region
+ * for remote access via RDMA READ or RDMA WRITE.
+ */
+static int
+frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+   int nsegs, bool writing)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg1 = seg;
+   struct rpcrdma_mw *mw = seg1->rl_mw;
+   struct rpcrdma_frmr *frmr = &mw->r.frmr;
+   struct ib_mr *mr = frmr->fr_mr;
+   struct ib_send_wr fastreg_wr, *bad_wr;
+   u8 key;
+   int len, pageoff;
+   int i, rc;
+   int seg_len;
+   u64 pa;
+   int page_no;
+
+   pageoff = offset_in_page(seg1->mr_offset);
+   seg1->mr_offset -= pageoff; /* start of page */
+   seg1->mr_len += pageoff;
+   len = -pageoff;
+   if (nsegs > ia->ri_max_frmr_depth)
+   nsegs = ia->ri_max_frmr_depth;
+   for (page_no = i = 0; i < nsegs;) {
+   rpcrdma_map_one(ia, seg, writing);
+   pa = seg->mr_dma;
+   for (seg_len = seg->mr_len; seg_len > 0; seg_len -= PAGE_SIZE) {
+   frmr->fr_pgl->page_list[page_no++] = pa;
+   pa += PAGE_SIZE;
+   }
+   len += seg->mr_len;
+   ++seg;
+   ++i;
+   /* Check for holes */
+   if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
+   offset_in_page((seg-1)->mr_offs

[PATCH v3 07/15] xprtrdma: Add a "max_payload" op for each memreg mode

2015-03-30 Thread Chuck Lever
The max_payload computation is generalized to ensure that the
payload maximum is the lesser of RPC_MAX_DATA_SEGS and the number of
data segments that can be transmitted in an inline buffer.

Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   13 ++
 net/sunrpc/xprtrdma/frwr_ops.c |   13 ++
 net/sunrpc/xprtrdma/physical_ops.c |   10 +++
 net/sunrpc/xprtrdma/transport.c|5 +++-
 net/sunrpc/xprtrdma/verbs.c|   49 +++-
 net/sunrpc/xprtrdma/xprt_rdma.h|5 +++-
 6 files changed, 59 insertions(+), 36 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index ffb7d93..eec2660 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -17,6 +17,19 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+/* Maximum scatter/gather per FMR */
+#define RPCRDMA_MAX_FMR_SGES   (64)
+
+/* FMR mode conveys up to 64 pages of payload per chunk segment.
+ */
+static size_t
+fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
+{
+   return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
+rpcrdma_max_segments(r_xprt) * RPCRDMA_MAX_FMR_SGES);
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
+   .ro_maxpages= fmr_op_maxpages,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 79173f9..73a5ac8 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -17,6 +17,19 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+/* FRWR mode conveys a list of pages per chunk segment. The
+ * maximum length of that list is the FRWR page list depth.
+ */
+static size_t
+frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+
+   return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
+rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
+   .ro_maxpages= frwr_op_maxpages,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index b0922ac..28ade19 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -19,6 +19,16 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+/* PHYSICAL memory registration conveys one page per chunk segment.
+ */
+static size_t
+physical_op_maxpages(struct rpcrdma_xprt *r_xprt)
+{
+   return min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
+rpcrdma_max_segments(r_xprt));
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
+   .ro_maxpages= physical_op_maxpages,
.ro_displayname = "physical",
 };
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 97f6562..da71a24 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -406,7 +406,10 @@ xprt_setup_rdma(struct xprt_create *args)
  xprt_rdma_connect_worker);
 
xprt_rdma_format_addresses(xprt);
-   xprt->max_payload = rpcrdma_max_payload(new_xprt);
+   xprt->max_payload = new_xprt->rx_ia.ri_ops->ro_maxpages(new_xprt);
+   if (xprt->max_payload == 0)
+   goto out4;
+   xprt->max_payload <<= PAGE_SHIFT;
dprintk("RPC:   %s: transport data payload maximum: %zu bytes\n",
__func__, xprt->max_payload);
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index c3319e1..da55cda 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -2212,43 +2212,24 @@ rpcrdma_ep_post_recv(struct rpcrdma_ia *ia,
return rc;
 }
 
-/* Physical mapping means one Read/Write list entry per-page.
- * All list entries must fit within an inline buffer
- *
- * NB: The server must return a Write list for NFS READ,
- * which has the same constraint. Factor in the inline
- * rsize as well.
+/* How many chunk list items fit within our inline buffers?
  */
-static size_t
-rpcrdma_physical_max_payload(struct rpcrdma_xprt *r_xprt)
+unsigned int
+rpcrdma_max_segments(struct rpcrdma_xprt *r_xprt)
 {
struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
-   unsigned int inline_size, pages;
-
-   inline_size = min_t(unsigned int,
-   cdata->inline_wsize, cdata->inline_rsize);
-   inline_size -= RPCRDMA_HDRLEN_MIN;
-   pages = inline_size / sizeof(struct rpcrdma_segment);
-   return pages << PAGE_SHIFT;
-}
+   int bytes, segments;
 
-static size_t
-rpcrdma_m

[PATCH v3 09/15] xprtrdma: Add a "deregister_external" op for each memreg mode

2015-03-30 Thread Chuck Lever
There is very little common processing among the different external
memory deregistration functions.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   27 
 net/sunrpc/xprtrdma/frwr_ops.c |   36 
 net/sunrpc/xprtrdma/physical_ops.c |   10 
 net/sunrpc/xprtrdma/rpc_rdma.c |   11 +++--
 net/sunrpc/xprtrdma/transport.c|4 +-
 net/sunrpc/xprtrdma/verbs.c|   81 
 net/sunrpc/xprtrdma/xprt_rdma.h|5 +-
 7 files changed, 84 insertions(+), 90 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 45fb646..888aa10 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -79,8 +79,35 @@ out_maperr:
return rc;
 }
 
+/* Use the ib_unmap_fmr() verb to prevent further remote
+ * access via RDMA READ or RDMA WRITE.
+ */
+static int
+fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg1 = seg;
+   int rc, nsegs = seg->mr_nsegs;
+   LIST_HEAD(l);
+
+   list_add(&seg1->rl_mw->r.fmr->list, &l);
+   rc = ib_unmap_fmr(&l);
+   read_lock(&ia->ri_qplock);
+   while (seg1->mr_nsegs--)
+   rpcrdma_unmap_one(ia, seg++);
+   read_unlock(&ia->ri_qplock);
+   if (rc)
+   goto out_err;
+   return nsegs;
+
+out_err:
+   dprintk("RPC:   %s: ib_unmap_fmr status %i\n", __func__, rc);
+   return nsegs;
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
+   .ro_unmap   = fmr_op_unmap,
.ro_maxpages= fmr_op_maxpages,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 23e4d99..35b725b 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -110,8 +110,44 @@ out_senderr:
return rc;
 }
 
+/* Post a LOCAL_INV Work Request to prevent further remote access
+ * via RDMA READ or RDMA WRITE.
+ */
+static int
+frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   struct rpcrdma_mr_seg *seg1 = seg;
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct ib_send_wr invalidate_wr, *bad_wr;
+   int rc, nsegs = seg->mr_nsegs;
+
+   seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
+
+   memset(&invalidate_wr, 0, sizeof(invalidate_wr));
+   invalidate_wr.wr_id = (unsigned long)(void *)seg1->rl_mw;
+   invalidate_wr.opcode = IB_WR_LOCAL_INV;
+   invalidate_wr.ex.invalidate_rkey = seg1->rl_mw->r.frmr.fr_mr->rkey;
+   DECR_CQCOUNT(&r_xprt->rx_ep);
+
+   read_lock(&ia->ri_qplock);
+   while (seg1->mr_nsegs--)
+   rpcrdma_unmap_one(ia, seg++);
+   rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+   read_unlock(&ia->ri_qplock);
+   if (rc)
+   goto out_err;
+   return nsegs;
+
+out_err:
+   /* Force rpcrdma_buffer_get() to retry */
+   seg1->rl_mw->r.frmr.fr_state = FRMR_IS_STALE;
+   dprintk("RPC:   %s: ib_post_send status %i\n", __func__, rc);
+   return nsegs;
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
+   .ro_unmap   = frwr_op_unmap,
.ro_maxpages= frwr_op_maxpages,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index 5a284ee..5b5a63a 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -44,8 +44,18 @@ physical_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
return 1;
 }
 
+/* Unmap a memory region, but leave it registered.
+ */
+static int
+physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   rpcrdma_unmap_one(&r_xprt->rx_ia, seg);
+   return 1;
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
+   .ro_unmap   = physical_op_unmap,
.ro_maxpages= physical_op_maxpages,
.ro_displayname = "physical",
 };
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 6ab1d03..2c53ea9 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -284,11 +284,12 @@ rpcrdma_create_chunks(struct rpc_rqst *rqst, struct 
xdr_buf *target,
return (unsigned 

[PATCH v3 12/15] xprtrdma: Add "destroy MRs" memreg op

2015-03-30 Thread Chuck Lever
Memory Region objects associated with a transport instance are
destroyed before the instance is shutdown and destroyed.

Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   18 
 net/sunrpc/xprtrdma/frwr_ops.c |   14 ++
 net/sunrpc/xprtrdma/physical_ops.c |6 
 net/sunrpc/xprtrdma/verbs.c|   52 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|1 +
 5 files changed, 40 insertions(+), 51 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 93261b0..e9ca594 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -168,11 +168,29 @@ fmr_op_reset(struct rpcrdma_xprt *r_xprt)
__func__, rc);
 }
 
+static void
+fmr_op_destroy(struct rpcrdma_buffer *buf)
+{
+   struct rpcrdma_mw *r;
+   int rc;
+
+   while (!list_empty(&buf->rb_all)) {
+   r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
+   list_del(&r->mw_all);
+   rc = ib_dealloc_fmr(r->r.fmr);
+   if (rc)
+   dprintk("RPC:   %s: ib_dealloc_fmr failed %i\n",
+   __func__, rc);
+   kfree(r);
+   }
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap   = fmr_op_unmap,
.ro_maxpages= fmr_op_maxpages,
.ro_init= fmr_op_init,
.ro_reset   = fmr_op_reset,
+   .ro_destroy = fmr_op_destroy,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index c2bb29d..121e400 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -260,11 +260,25 @@ frwr_op_reset(struct rpcrdma_xprt *r_xprt)
}
 }
 
+static void
+frwr_op_destroy(struct rpcrdma_buffer *buf)
+{
+   struct rpcrdma_mw *r;
+
+   while (!list_empty(&buf->rb_all)) {
+   r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
+   list_del(&r->mw_all);
+   __frwr_release(r);
+   kfree(r);
+   }
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
.ro_unmap   = frwr_op_unmap,
.ro_maxpages= frwr_op_maxpages,
.ro_init= frwr_op_init,
.ro_reset   = frwr_op_reset,
+   .ro_destroy = frwr_op_destroy,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index e060713..eb39011 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -64,11 +64,17 @@ physical_op_reset(struct rpcrdma_xprt *r_xprt)
 {
 }
 
+static void
+physical_op_destroy(struct rpcrdma_buffer *buf)
+{
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
.ro_unmap   = physical_op_unmap,
.ro_maxpages= physical_op_maxpages,
.ro_init= physical_op_init,
.ro_reset   = physical_op_reset,
+   .ro_destroy = physical_op_destroy,
.ro_displayname = "physical",
 };
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 1b2c1f4..a7fb314 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1199,47 +1199,6 @@ rpcrdma_destroy_req(struct rpcrdma_ia *ia, struct 
rpcrdma_req *req)
kfree(req);
 }
 
-static void
-rpcrdma_destroy_fmrs(struct rpcrdma_buffer *buf)
-{
-   struct rpcrdma_mw *r;
-   int rc;
-
-   while (!list_empty(&buf->rb_all)) {
-   r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
-   list_del(&r->mw_all);
-   list_del(&r->mw_list);
-
-   rc = ib_dealloc_fmr(r->r.fmr);
-   if (rc)
-   dprintk("RPC:   %s: ib_dealloc_fmr failed %i\n",
-   __func__, rc);
-
-   kfree(r);
-   }
-}
-
-static void
-rpcrdma_destroy_frmrs(struct rpcrdma_buffer *buf)
-{
-   struct rpcrdma_mw *r;
-   int rc;
-
-   while (!list_empty(&buf->rb_all)) {
-   r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
-   list_del(&r->mw_all);
-   list_

[PATCH v3 13/15] xprtrdma: Add "open" memreg op

2015-03-30 Thread Chuck Lever
The open op determines the size of various transport data structures
based on device capabilities and memory registration mode.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |8 ++
 net/sunrpc/xprtrdma/frwr_ops.c |   48 +++
 net/sunrpc/xprtrdma/physical_ops.c |8 ++
 net/sunrpc/xprtrdma/verbs.c|   49 ++--
 net/sunrpc/xprtrdma/xprt_rdma.h|3 ++
 5 files changed, 70 insertions(+), 46 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index e9ca594..e8a9837 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -20,6 +20,13 @@
 /* Maximum scatter/gather per FMR */
 #define RPCRDMA_MAX_FMR_SGES   (64)
 
+static int
+fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
+   struct rpcrdma_create_data_internal *cdata)
+{
+   return 0;
+}
+
 /* FMR mode conveys up to 64 pages of payload per chunk segment.
  */
 static size_t
@@ -188,6 +195,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap   = fmr_op_unmap,
+   .ro_open= fmr_op_open,
.ro_maxpages= fmr_op_maxpages,
.ro_init= fmr_op_init,
.ro_reset   = fmr_op_reset,
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 121e400..e17d54d 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -58,6 +58,53 @@ __frwr_release(struct rpcrdma_mw *r)
ib_free_fast_reg_page_list(r->r.frmr.fr_pgl);
 }
 
+static int
+frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
+struct rpcrdma_create_data_internal *cdata)
+{
+   struct ib_device_attr *devattr = &ia->ri_devattr;
+   int depth, delta;
+
+   ia->ri_max_frmr_depth =
+   min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
+ devattr->max_fast_reg_page_list_len);
+   dprintk("RPC:   %s: device's max FR page list len = %u\n",
+   __func__, ia->ri_max_frmr_depth);
+
+   /* Add room for frmr register and invalidate WRs.
+* 1. FRMR reg WR for head
+* 2. FRMR invalidate WR for head
+* 3. N FRMR reg WRs for pagelist
+* 4. N FRMR invalidate WRs for pagelist
+* 5. FRMR reg WR for tail
+* 6. FRMR invalidate WR for tail
+* 7. The RDMA_SEND WR
+*/
+   depth = 7;
+
+   /* Calculate N if the device max FRMR depth is smaller than
+* RPCRDMA_MAX_DATA_SEGS.
+*/
+   if (ia->ri_max_frmr_depth < RPCRDMA_MAX_DATA_SEGS) {
+   delta = RPCRDMA_MAX_DATA_SEGS - ia->ri_max_frmr_depth;
+   do {
+   depth += 2; /* FRMR reg + invalidate */
+   delta -= ia->ri_max_frmr_depth;
+   } while (delta > 0);
+   }
+
+   ep->rep_attr.cap.max_send_wr *= depth;
+   if (ep->rep_attr.cap.max_send_wr > devattr->max_qp_wr) {
+   cdata->max_requests = devattr->max_qp_wr / depth;
+   if (!cdata->max_requests)
+   return -EINVAL;
+   ep->rep_attr.cap.max_send_wr = cdata->max_requests *
+  depth;
+   }
+
+   return 0;
+}
+
 /* FRWR mode conveys a list of pages per chunk segment. The
  * maximum length of that list is the FRWR page list depth.
  */
@@ -276,6 +323,7 @@ frwr_op_destroy(struct rpcrdma_buffer *buf)
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
.ro_unmap   = frwr_op_unmap,
+   .ro_open= frwr_op_open,
.ro_maxpages= frwr_op_maxpages,
.ro_init= frwr_op_init,
.ro_reset   = frwr_op_reset,
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index eb39011..0ba130b 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -19,6 +19,13 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+static int
+physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
+struct rpcrdma_create_data_internal *cdata)
+{
+   return 0;
+}
+
 /* PHYSICAL memory registration conveys one page per chunk segment.
  */
 static size_t
@@ -72,6 +79,7 @@ physical_op_destroy(struct rpcrdma_buffer *buf)
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
 

[PATCH v3 10/15] xprtrdma: Add "init MRs" memreg op

2015-03-30 Thread Chuck Lever
This method is used when setting up a new transport instance to
create a pool of Memory Region objects that will be used to register
memory during operation.

Memory Regions are not needed for "physical" registration, since
->prepare and ->release are no-ops for that mode.

Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   42 +++
 net/sunrpc/xprtrdma/frwr_ops.c |   66 +++
 net/sunrpc/xprtrdma/physical_ops.c |7 ++
 net/sunrpc/xprtrdma/verbs.c|  104 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|1 
 5 files changed, 119 insertions(+), 101 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 888aa10..825ce96 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -29,6 +29,47 @@ fmr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * RPCRDMA_MAX_FMR_SGES);
 }
 
+static int
+fmr_op_init(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   int mr_access_flags = IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ;
+   struct ib_fmr_attr fmr_attr = {
+   .max_pages  = RPCRDMA_MAX_FMR_SGES,
+   .max_maps   = 1,
+   .page_shift = PAGE_SHIFT
+   };
+   struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
+   struct rpcrdma_mw *r;
+   int i, rc;
+
+   INIT_LIST_HEAD(&buf->rb_mws);
+   INIT_LIST_HEAD(&buf->rb_all);
+
+   i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
+   dprintk("RPC:   %s: initalizing %d FMRs\n", __func__, i);
+
+   while (i--) {
+   r = kzalloc(sizeof(*r), GFP_KERNEL);
+   if (!r)
+   return -ENOMEM;
+
+   r->r.fmr = ib_alloc_fmr(pd, mr_access_flags, &fmr_attr);
+   if (IS_ERR(r->r.fmr))
+   goto out_fmr_err;
+
+   list_add(&r->mw_list, &buf->rb_mws);
+   list_add(&r->mw_all, &buf->rb_all);
+   }
+   return 0;
+
+out_fmr_err:
+   rc = PTR_ERR(r->r.fmr);
+   dprintk("RPC:   %s: ib_alloc_fmr status %i\n", __func__, rc);
+   kfree(r);
+   return rc;
+}
+
 /* Use the ib_map_phys_fmr() verb to register a memory region
  * for remote access via RDMA READ or RDMA WRITE.
  */
@@ -109,5 +150,6 @@ const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap   = fmr_op_unmap,
.ro_maxpages= fmr_op_maxpages,
+   .ro_init= fmr_op_init,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 35b725b..9168c15 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -17,6 +17,35 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+static int
+__frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, struct ib_device *device,
+   unsigned int depth)
+{
+   struct rpcrdma_frmr *f = &r->r.frmr;
+   int rc;
+
+   f->fr_mr = ib_alloc_fast_reg_mr(pd, depth);
+   if (IS_ERR(f->fr_mr))
+   goto out_mr_err;
+   f->fr_pgl = ib_alloc_fast_reg_page_list(device, depth);
+   if (IS_ERR(f->fr_pgl))
+   goto out_list_err;
+   return 0;
+
+out_mr_err:
+   rc = PTR_ERR(f->fr_mr);
+   dprintk("RPC:   %s: ib_alloc_fast_reg_mr status %i\n",
+   __func__, rc);
+   return rc;
+
+out_list_err:
+   rc = PTR_ERR(f->fr_pgl);
+   dprintk("RPC:   %s: ib_alloc_fast_reg_page_list status %i\n",
+   __func__, rc);
+   ib_dereg_mr(f->fr_mr);
+   return rc;
+}
+
 /* FRWR mode conveys a list of pages per chunk segment. The
  * maximum length of that list is the FRWR page list depth.
  */
@@ -29,6 +58,42 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
+static int
+frwr_op_init(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct ib_device *device = r_xprt->rx_ia.ri_id->device;
+   unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
+   struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
+   int i;
+
+   INIT_LIST_HEAD(&buf->rb_mws);
+   INIT_LIST_HEAD(&buf->rb_all);
+
+   i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
+   dprintk("RPC:   %s: initalizing %d FRMRs\n", __func__, i);
+
+   while (i--) {
+   struct rpcrdma_mw *r;
+   

[PATCH v3 11/15] xprtrdma: Add "reset MRs" memreg op

2015-03-30 Thread Chuck Lever
This method is invoked when a transport instance is about to be
reconnected. Each Memory Region object is reset to its initial
state.

Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   23 
 net/sunrpc/xprtrdma/frwr_ops.c |   51 ++
 net/sunrpc/xprtrdma/physical_ops.c |6 ++
 net/sunrpc/xprtrdma/verbs.c|  103 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|1 
 5 files changed, 83 insertions(+), 101 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 825ce96..93261b0 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -146,10 +146,33 @@ out_err:
return nsegs;
 }
 
+/* After a disconnect, unmap all FMRs.
+ *
+ * This is invoked only in the transport connect worker in order
+ * to serialize with rpcrdma_register_fmr_external().
+ */
+static void
+fmr_op_reset(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct rpcrdma_mw *r;
+   LIST_HEAD(list);
+   int rc;
+
+   list_for_each_entry(r, &buf->rb_all, mw_all)
+   list_add(&r->r.fmr->list, &list);
+
+   rc = ib_unmap_fmr(&list);
+   if (rc)
+   dprintk("RPC:   %s: ib_unmap_fmr failed %i\n",
+   __func__, rc);
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
.ro_unmap   = fmr_op_unmap,
.ro_maxpages= fmr_op_maxpages,
.ro_init= fmr_op_init,
+   .ro_reset   = fmr_op_reset,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 9168c15..c2bb29d 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -46,6 +46,18 @@ out_list_err:
return rc;
 }
 
+static void
+__frwr_release(struct rpcrdma_mw *r)
+{
+   int rc;
+
+   rc = ib_dereg_mr(r->r.frmr.fr_mr);
+   if (rc)
+   dprintk("RPC:   %s: ib_dereg_mr status %i\n",
+   __func__, rc);
+   ib_free_fast_reg_page_list(r->r.frmr.fr_pgl);
+}
+
 /* FRWR mode conveys a list of pages per chunk segment. The
  * maximum length of that list is the FRWR page list depth.
  */
@@ -210,10 +222,49 @@ out_err:
return nsegs;
 }
 
+/* After a disconnect, a flushed FAST_REG_MR can leave an FRMR in
+ * an unusable state. Find FRMRs in this state and dereg / reg
+ * each.  FRMRs that are VALID and attached to an rpcrdma_req are
+ * also torn down.
+ *
+ * This gives all in-use FRMRs a fresh rkey and leaves them INVALID.
+ *
+ * This is invoked only in the transport connect worker in order
+ * to serialize with rpcrdma_register_frmr_external().
+ */
+static void
+frwr_op_reset(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct ib_device *device = r_xprt->rx_ia.ri_id->device;
+   unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
+   struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
+   struct rpcrdma_mw *r;
+   int rc;
+
+   list_for_each_entry(r, &buf->rb_all, mw_all) {
+   if (r->r.frmr.fr_state == FRMR_IS_INVALID)
+   continue;
+
+   __frwr_release(r);
+   rc = __frwr_init(r, pd, device, depth);
+   if (rc) {
+   dprintk("RPC:   %s: mw %p left %s\n",
+   __func__, r,
+   (r->r.frmr.fr_state == FRMR_IS_STALE ?
+   "stale" : "valid"));
+   continue;
+   }
+
+   r->r.frmr.fr_state = FRMR_IS_INVALID;
+   }
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_map = frwr_op_map,
.ro_unmap   = frwr_op_unmap,
.ro_maxpages= frwr_op_maxpages,
.ro_init= frwr_op_init,
+   .ro_reset   = frwr_op_reset,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index c372051..e060713 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -59,10 +59,16 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
return 1;
 }
 
+static void
+physical_op_reset(struct rpcrdma_xprt *r_xprt)
+{
+}
+
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops =

[PATCH v3 14/15] xprtrdma: Handle non-SEND completions via a callout

2015-03-30 Thread Chuck Lever
Allow each memory registration mode to plug in a callout that handles
the completion of a memory registration operation.

Signed-off-by: Chuck Lever 
Reviewed-by: Sagi Grimberg 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/frwr_ops.c  |   17 +
 net/sunrpc/xprtrdma/verbs.c |   16 ++--
 net/sunrpc/xprtrdma/xprt_rdma.h |5 +
 3 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index e17d54d..ea59c1b 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -117,6 +117,22 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
+/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs to be reset. */
+static void
+frwr_sendcompletion(struct ib_wc *wc)
+{
+   struct rpcrdma_mw *r;
+
+   if (likely(wc->status == IB_WC_SUCCESS))
+   return;
+
+   /* WARNING: Only wr_id and status are reliable at this point */
+   r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+   dprintk("RPC:   %s: frmr %p (stale), status %d\n",
+   __func__, r, wc->status);
+   r->r.frmr.fr_state = FRMR_IS_STALE;
+}
+
 static int
 frwr_op_init(struct rpcrdma_xprt *r_xprt)
 {
@@ -148,6 +164,7 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
 
list_add(&r->mw_list, &buf->rb_mws);
list_add(&r->mw_all, &buf->rb_all);
+   r->mw_sendcompletion = frwr_sendcompletion;
}
 
return 0;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index b697b3e..cac06f2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -186,7 +186,7 @@ static const char * const wc_status[] = {
"remote access error",
"remote operation error",
"transport retry counter exceeded",
-   "RNR retrycounter exceeded",
+   "RNR retry counter exceeded",
"local RDD violation error",
"remove invalid RD request",
"operation aborted",
@@ -204,21 +204,17 @@ static const char * const wc_status[] = {
 static void
 rpcrdma_sendcq_process_wc(struct ib_wc *wc)
 {
-   if (likely(wc->status == IB_WC_SUCCESS))
-   return;
-
/* WARNING: Only wr_id and status are reliable at this point */
-   if (wc->wr_id == 0ULL) {
-   if (wc->status != IB_WC_WR_FLUSH_ERR)
+   if (wc->wr_id == RPCRDMA_IGNORE_COMPLETION) {
+   if (wc->status != IB_WC_SUCCESS &&
+   wc->status != IB_WC_WR_FLUSH_ERR)
pr_err("RPC:   %s: SEND: %s\n",
   __func__, COMPLETION_MSG(wc->status));
} else {
struct rpcrdma_mw *r;
 
r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
-   r->r.frmr.fr_state = FRMR_IS_STALE;
-   pr_err("RPC:   %s: frmr %p (stale): %s\n",
-  __func__, r, COMPLETION_MSG(wc->status));
+   r->mw_sendcompletion(wc);
}
 }
 
@@ -1622,7 +1618,7 @@ rpcrdma_ep_post(struct rpcrdma_ia *ia,
}
 
send_wr.next = NULL;
-   send_wr.wr_id = 0ULL;   /* no send cookie */
+   send_wr.wr_id = RPCRDMA_IGNORE_COMPLETION;
send_wr.sg_list = req->rl_send_iov;
send_wr.num_sge = req->rl_niovs;
send_wr.opcode = IB_WR_SEND;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 9036fb4..54bcbe4 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -106,6 +106,10 @@ struct rpcrdma_ep {
 #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
 #define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)
 
+/* Force completion handler to ignore the signal
+ */
+#define RPCRDMA_IGNORE_COMPLETION  (0ULL)
+
 /* Registered buffer -- registered kmalloc'd memory for RDMA SEND/RECV
  *
  * The below structure appears at the front of a large region of kmalloc'd
@@ -206,6 +210,7 @@ struct rpcrdma_mw {
struct ib_fmr   *fmr;
struct rpcrdma_frmr frmr;
} r;
+   void(*mw_sendcompletion)(struct ib_wc *);
struct list_headmw_list;
struct list_headmw_all;
 };

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 15/15] xprtrdma: Make rpcrdma_{un}map_one() into inline functions

2015-03-30 Thread Chuck Lever
These functions are called in a loop for each page transferred via
RDMA READ or WRITE. Extract loop invariants and inline them to
reduce CPU overhead.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
Tested-by: Meghana Cheripady 
Tested-by: Veeresh U. Kokatnur 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   10 ++--
 net/sunrpc/xprtrdma/frwr_ops.c |   10 ++--
 net/sunrpc/xprtrdma/physical_ops.c |   10 ++--
 net/sunrpc/xprtrdma/verbs.c|   44 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h|   45 ++--
 5 files changed, 73 insertions(+), 46 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index e8a9837..a91ba2c 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -85,6 +85,8 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg 
*seg,
   int nsegs, bool writing)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct ib_device *device = ia->ri_id->device;
+   enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_mw *mw = seg1->rl_mw;
u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
@@ -97,7 +99,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg 
*seg,
if (nsegs > RPCRDMA_MAX_FMR_SGES)
nsegs = RPCRDMA_MAX_FMR_SGES;
for (i = 0; i < nsegs;) {
-   rpcrdma_map_one(ia, seg, writing);
+   rpcrdma_map_one(device, seg, direction);
physaddrs[i] = seg->mr_dma;
len += seg->mr_len;
++seg;
@@ -123,7 +125,7 @@ out_maperr:
__func__, len, (unsigned long long)seg1->mr_dma,
pageoff, i, rc);
while (i--)
-   rpcrdma_unmap_one(ia, --seg);
+   rpcrdma_unmap_one(device, --seg);
return rc;
 }
 
@@ -135,14 +137,16 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mr_seg *seg1 = seg;
+   struct ib_device *device;
int rc, nsegs = seg->mr_nsegs;
LIST_HEAD(l);
 
list_add(&seg1->rl_mw->r.fmr->list, &l);
rc = ib_unmap_fmr(&l);
read_lock(&ia->ri_qplock);
+   device = ia->ri_id->device;
while (seg1->mr_nsegs--)
-   rpcrdma_unmap_one(ia, seg++);
+   rpcrdma_unmap_one(device, seg++);
read_unlock(&ia->ri_qplock);
if (rc)
goto out_err;
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index ea59c1b..0a7b9df 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -178,6 +178,8 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
int nsegs, bool writing)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct ib_device *device = ia->ri_id->device;
+   enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_mw *mw = seg1->rl_mw;
struct rpcrdma_frmr *frmr = &mw->r.frmr;
@@ -197,7 +199,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;
for (page_no = i = 0; i < nsegs;) {
-   rpcrdma_map_one(ia, seg, writing);
+   rpcrdma_map_one(device, seg, direction);
pa = seg->mr_dma;
for (seg_len = seg->mr_len; seg_len > 0; seg_len -= PAGE_SIZE) {
frmr->fr_pgl->page_list[page_no++] = pa;
@@ -247,7 +249,7 @@ out_senderr:
ib_update_fast_reg_key(mr, --key);
frmr->fr_state = FRMR_IS_INVALID;
while (i--)
-   rpcrdma_unmap_one(ia, --seg);
+   rpcrdma_unmap_one(device, --seg);
return rc;
 }
 
@@ -261,6 +263,7 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct ib_send_wr invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;
+   struct ib_device *device;
 
seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
 
@@ -271,8 +274,9 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
DECR_CQCOUNT(&r_xprt->rx_ep);
 
read_lock(&ia->ri_qplock);
+   device = ia->ri_id->device;
while (seg1->mr_nsegs--)
-   rpcrdma_unmap_one(ia, seg++);
+   rpcrdma_unmap_one(device, seg++);
rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
read_unlock(&ia->ri_qplock);
if (rc)
diff --git a/net/sunrpc/xprtrdma/physica

[PATCH v1 00/14] client NFS/RDMA patches for 4.2

2015-05-04 Thread Chuck Lever
I'd like these patches considered for merging upstream. This patch
series includes:

 - JIT allocation of rpcrdma_mw structures
 - Break-up of rb_lock
 - Reduction of how many rpcrdma_mw structs are needed per transport

These are pre-requisites for increasing the RPC slot count and
r/wsize on RPC/RDMA transports. And:

 - An RPC/RDMA transport fault injector

This is useful to discover regressions in logic for handling
transport disconnection and recovery.

You can find these in my git repo in the "nfs-rdma-for-4.2" topic
branch. See:

  git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or

  http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=summary

Thanks in advance for patch review!

---

Chuck Lever (14):
  xprtrdma: Transport fault injection
  xprtrdma: Warn when there are orphaned IB objects
  xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt
  xprtrdma: Use ib_device pointer safely
  xprtrdma: Introduce helpers for allocating MWs
  xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()
  xprtrdma: Introduce an FRMR recovery workqueue
  xprtrdma: Acquire MRs in rpcrdma_register_external()
  xprtrdma: Remove unused LOCAL_INV recovery logic
  xprtrdma: Remove ->ro_reset
  xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy
  xprtrdma: Split rb_lock
  xprtrdma: Stack relief in fmr_op_map()
  xprtrmda: Reduce per-transport MR allocation


 include/linux/sunrpc/xprtrdma.h|3 
 net/sunrpc/Kconfig |   12 ++
 net/sunrpc/xprtrdma/fmr_ops.c  |  120 +++---
 net/sunrpc/xprtrdma/frwr_ops.c |  224 -
 net/sunrpc/xprtrdma/physical_ops.c |   14 --
 net/sunrpc/xprtrdma/rpc_rdma.c |7 -
 net/sunrpc/xprtrdma/transport.c|   52 +++-
 net/sunrpc/xprtrdma/verbs.c|  241 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|   38 --
 9 files changed, 387 insertions(+), 324 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 14/14] xprtrmda: Reduce per-transport MR allocation

2015-05-04 Thread Chuck Lever
Reduce resource consumption per-transport to make way for increasing
the credit limit and maximum r/wsize. Pre-allocate fewer MRs.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |6 --
 net/sunrpc/xprtrdma/frwr_ops.c |6 --
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 4a53ad5..f1e8daf 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -69,8 +69,10 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);
 
-   i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
-   dprintk("RPC:   %s: initializing %d FMRs\n", __func__, i);
+   i = max_t(int, RPCRDMA_MAX_DATA_SEGS / RPCRDMA_MAX_FMR_SGES, 1);
+   i += 2; /* head + tail */
+   i *= buf->rb_max_requests;  /* one set for each RPC slot */
+   dprintk("RPC:   %s: initalizing %d FMRs\n", __func__, i);
 
rc = -ENOMEM;
while (i--) {
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index edc10ba..fc2d0c6 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -270,8 +270,10 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);
 
-   i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
-   dprintk("RPC:   %s: initializing %d FRMRs\n", __func__, i);
+   i = max_t(int, RPCRDMA_MAX_DATA_SEGS / depth, 1);
+   i += 2; /* head + tail */
+   i *= buf->rb_max_requests;  /* one set for each RPC slot */
+   dprintk("RPC:   %s: initalizing %d FRMRs\n", __func__, i);
 
while (i--) {
struct rpcrdma_mw *r;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/16] xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()

2015-05-11 Thread Chuck Lever
Acquiring 64 FMRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because FMR mode can
transfer up to a 1MB payload using just a single ib_fmr.

Instead, acquire ib_fmrs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
---
 net/sunrpc/xprtrdma/fmr_ops.c |   52 ++---
 net/sunrpc/xprtrdma/verbs.c   |   26 -
 2 files changed, 48 insertions(+), 30 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 0a96155..53fb649 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -11,6 +11,21 @@
  * can take tens of usecs to complete.
  */
 
+/* Normal operation
+ *
+ * A Memory Region is prepared for RDMA READ or WRITE using the
+ * ib_map_phys_fmr verb (fmr_op_map). When the RDMA operation is
+ * finished, the Memory Region is unmapped using the ib_unmap_fmr
+ * verb (fmr_op_unmap).
+ */
+
+/* Transport recovery
+ *
+ * After a transport reconnect, fmr_op_map re-uses the MR already
+ * allocated for the RPC, but generates a fresh rkey then maps the
+ * MR again. This process is synchronous.
+ */
+
 #include "xprt_rdma.h"
 
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
@@ -77,6 +92,15 @@ out_fmr_err:
return rc;
 }
 
+static int
+__fmr_unmap(struct rpcrdma_mw *r)
+{
+   LIST_HEAD(l);
+
+   list_add(&r->r.fmr->list, &l);
+   return ib_unmap_fmr(&l);
+}
+
 /* Use the ib_map_phys_fmr() verb to register a memory region
  * for remote access via RDMA READ or RDMA WRITE.
  */
@@ -88,9 +112,22 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
struct ib_device *device = ia->ri_device;
enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
-   struct rpcrdma_mw *mw = seg1->rl_mw;
u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
int len, pageoff, i, rc;
+   struct rpcrdma_mw *mw;
+
+   mw = seg1->rl_mw;
+   seg1->rl_mw = NULL;
+   if (!mw) {
+   mw = rpcrdma_get_mw(r_xprt);
+   if (!mw)
+   return -ENOMEM;
+   } else {
+   /* this is a retransmit; generate a fresh rkey */
+   rc = __fmr_unmap(mw);
+   if (rc)
+   return rc;
+   }
 
pageoff = offset_in_page(seg1->mr_offset);
seg1->mr_offset -= pageoff; /* start of page */
@@ -114,6 +151,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
if (rc)
goto out_maperr;
 
+   seg1->rl_mw = mw;
seg1->mr_rkey = mw->r.fmr->rkey;
seg1->mr_base = seg1->mr_dma + pageoff;
seg1->mr_nsegs = i;
@@ -137,18 +175,24 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mr_seg *seg1 = seg;
+   struct rpcrdma_mw *mw = seg1->rl_mw;
int rc, nsegs = seg->mr_nsegs;
-   LIST_HEAD(l);
 
-   list_add(&seg1->rl_mw->r.fmr->list, &l);
-   rc = ib_unmap_fmr(&l);
+   dprintk("RPC:   %s: FMR %p\n", __func__, mw);
+
+   seg1->rl_mw = NULL;
while (seg1->mr_nsegs--)
rpcrdma_unmap_one(ia->ri_device, seg++);
+   rc = __fmr_unmap(mw);
if (rc)
goto out_err;
+   rpcrdma_put_mw(r_xprt, mw);
return nsegs;
 
 out_err:
+   /* The FMR is abandoned, but remains in rb_all. fmr_op_destroy
+* will attempt to release it when the transport is destroyed.
+*/
dprintk("RPC:   %s: ib_unmap_fmr status %i\n", __func__, rc);
return nsegs;
 }
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index b7ca73e..3188e36 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1324,28 +1324,6 @@ rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req, struct 
rpcrdma_buffer *buf,
return NULL;
 }
 
-static struct rpcrdma_req *
-rpcrdma_buffer_get_fmrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf)
-{
-   struct rpcrdma_mw *r;
-   int i;
-
-   i = RPCRDMA_MAX_SEGS - 1;
-   while (!list_empty(&buf->rb_mws)) {
-   r = list_entry(buf->rb_mws.next,
-  struct rpcrdma_mw, mw_list);
-   list_del(&r->mw_list);
-   req->rl_segments[i].rl_mw = r;
-   if (unlikely(i-- == 0))
-   return req; /* Success */
-   }
-
-   /* Not enough entries on rb_mws for this req */
-   rpcrdma_buffer_put_sendbuf(req, buf);
-   rpcrdma_buffer_put_mrs(req, buf);
-   return NULL;
-}
-
 /*
  * Get a set of request/reply buffers.
  *

[PATCH v2 03/16] xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt

2015-05-11 Thread Chuck Lever
Clean up: Instead of carrying a pointer to the buffer pool and
the rpc_xprt, carry a pointer to the controlling rpcrdma_xprt.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Reviewed-by: Sagi Grimberg 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |4 ++--
 net/sunrpc/xprtrdma/transport.c |7 ++-
 net/sunrpc/xprtrdma/verbs.c |8 +---
 net/sunrpc/xprtrdma/xprt_rdma.h |3 +--
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 2c53ea9..98a3b95 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -732,8 +732,8 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
struct rpcrdma_msg *headerp;
struct rpcrdma_req *req;
struct rpc_rqst *rqst;
-   struct rpc_xprt *xprt = rep->rr_xprt;
-   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct rpcrdma_xprt *r_xprt = rep->rr_rxprt;
+   struct rpc_xprt *xprt = &r_xprt->rx_xprt;
__be32 *iptr;
int rdmalen, status;
unsigned long cwnd;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index dfcd52e..25f7a6e 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -627,12 +627,9 @@ xprt_rdma_send_request(struct rpc_task *task)
 
if (req->rl_reply == NULL)  /* e.g. reconnection */
rpcrdma_recv_buffer_get(req);
-
-   if (req->rl_reply) {
+   /* rpcrdma_recv_buffer_get may have set rl_reply, so check again */
+   if (req->rl_reply)
req->rl_reply->rr_func = rpcrdma_reply_handler;
-   /* this need only be done once, but... */
-   req->rl_reply->rr_xprt = xprt;
-   }
 
/* Must suppress retransmit to maintain credits */
if (req->rl_connect_cookie == xprt->connect_cookie)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 51900e6..c55bfbc 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -278,6 +278,7 @@ rpcrdma_recvcq_process_wc(struct ib_wc *wc, struct 
list_head *sched_list)
 {
struct rpcrdma_rep *rep =
(struct rpcrdma_rep *)(unsigned long)wc->wr_id;
+   struct rpcrdma_ia *ia;
 
/* WARNING: Only wr_id and status are reliable at this point */
if (wc->status != IB_WC_SUCCESS)
@@ -290,8 +291,9 @@ rpcrdma_recvcq_process_wc(struct ib_wc *wc, struct 
list_head *sched_list)
dprintk("RPC:   %s: rep %p opcode 'recv', length %u: success\n",
__func__, rep, wc->byte_len);
 
+   ia = &rep->rr_rxprt->rx_ia;
rep->rr_len = wc->byte_len;
-   ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
+   ib_dma_sync_single_for_cpu(ia->ri_id->device,
   rdmab_addr(rep->rr_rdmabuf),
   rep->rr_len, DMA_FROM_DEVICE);
prefetch(rdmab_to_msg(rep->rr_rdmabuf));
@@ -1053,7 +1055,7 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
goto out_free;
}
 
-   rep->rr_buffer = &r_xprt->rx_buf;
+   rep->rr_rxprt = r_xprt;
return rep;
 
 out_free:
@@ -1423,7 +1425,7 @@ rpcrdma_recv_buffer_get(struct rpcrdma_req *req)
 void
 rpcrdma_recv_buffer_put(struct rpcrdma_rep *rep)
 {
-   struct rpcrdma_buffer *buffers = rep->rr_buffer;
+   struct rpcrdma_buffer *buffers = &rep->rr_rxprt->rx_buf;
unsigned long flags;
 
rep->rr_func = NULL;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 78e0b8b..c3d57c0 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -173,8 +173,7 @@ struct rpcrdma_buffer;
 
 struct rpcrdma_rep {
unsigned intrr_len;
-   struct rpcrdma_buffer   *rr_buffer;
-   struct rpc_xprt *rr_xprt;
+   struct rpcrdma_xprt *rr_rxprt;
void(*rr_func)(struct rpcrdma_rep *);
struct list_headrr_list;
struct rpcrdma_regbuf   *rr_rdmabuf;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/16] xprtrdma: Introduce helpers for allocating MWs

2015-05-11 Thread Chuck Lever
We eventually want to handle allocating MWs one at a time, as
needed, instead of grabbing 64 and throwing them at each RPC in the
pipeline.

Add a helper for grabbing an MW off rb_mws, and a helper for
returning an MW to rb_mws. These will be used in a subsequent patch.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Reviewed-by: Sagi Grimberg 
---
 net/sunrpc/xprtrdma/verbs.c |   31 +++
 net/sunrpc/xprtrdma/xprt_rdma.h |2 ++
 2 files changed, 33 insertions(+)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index ddd5b36..b7ca73e 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1173,6 +1173,37 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
kfree(buf->rb_pool);
 }
 
+struct rpcrdma_mw *
+rpcrdma_get_mw(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct rpcrdma_mw *mw = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(&buf->rb_lock, flags);
+   if (!list_empty(&buf->rb_mws)) {
+   mw = list_first_entry(&buf->rb_mws,
+ struct rpcrdma_mw, mw_list);
+   list_del_init(&mw->mw_list);
+   }
+   spin_unlock_irqrestore(&buf->rb_lock, flags);
+
+   if (!mw)
+   pr_err("RPC:   %s: no MWs available\n", __func__);
+   return mw;
+}
+
+void
+rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *mw)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   unsigned long flags;
+
+   spin_lock_irqsave(&buf->rb_lock, flags);
+   list_add_tail(&mw->mw_list, &buf->rb_mws);
+   spin_unlock_irqrestore(&buf->rb_lock, flags);
+}
+
 /* "*mw" can be NULL when rpcrdma_buffer_get_mrs() fails, leaving
  * some req segments uninitialized.
  */
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 300423d..5b801d5 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -413,6 +413,8 @@ int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct 
rpcrdma_ep *,
 int rpcrdma_buffer_create(struct rpcrdma_xprt *);
 void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
 
+struct rpcrdma_mw *rpcrdma_get_mw(struct rpcrdma_xprt *);
+void rpcrdma_put_mw(struct rpcrdma_xprt *, struct rpcrdma_mw *);
 struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);
 void rpcrdma_buffer_put(struct rpcrdma_req *);
 void rpcrdma_recv_buffer_get(struct rpcrdma_req *);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 16/16] SUNRPC: Clean up bc_send()

2015-05-11 Thread Chuck Lever
Clean up: Merge bc_send() into bc_svc_process().

Note: even thought this touches svc.c, it is a client-side change.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/bc_xprt.h |1 -
 net/sunrpc/Makefile|2 +
 net/sunrpc/bc_svc.c|   63 
 net/sunrpc/svc.c   |   33 -
 4 files changed, 26 insertions(+), 73 deletions(-)
 delete mode 100644 net/sunrpc/bc_svc.c

diff --git a/include/linux/sunrpc/bc_xprt.h b/include/linux/sunrpc/bc_xprt.h
index 2ca67b5..8df43c9f 100644
--- a/include/linux/sunrpc/bc_xprt.h
+++ b/include/linux/sunrpc/bc_xprt.h
@@ -37,7 +37,6 @@ void xprt_complete_bc_request(struct rpc_rqst *req, uint32_t 
copied);
 void xprt_free_bc_request(struct rpc_rqst *req);
 int xprt_setup_backchannel(struct rpc_xprt *, unsigned int min_reqs);
 void xprt_destroy_backchannel(struct rpc_xprt *, unsigned int max_reqs);
-int bc_send(struct rpc_rqst *req);
 
 /*
  * Determine if a shared backchannel is in use
diff --git a/net/sunrpc/Makefile b/net/sunrpc/Makefile
index 15e6f6c..1b8e68d 100644
--- a/net/sunrpc/Makefile
+++ b/net/sunrpc/Makefile
@@ -15,6 +15,6 @@ sunrpc-y := clnt.o xprt.o socklib.o xprtsock.o sched.o \
sunrpc_syms.o cache.o rpc_pipe.o \
svc_xprt.o
 sunrpc-$(CONFIG_SUNRPC_DEBUG) += debugfs.o
-sunrpc-$(CONFIG_SUNRPC_BACKCHANNEL) += backchannel_rqst.o bc_svc.o
+sunrpc-$(CONFIG_SUNRPC_BACKCHANNEL) += backchannel_rqst.o
 sunrpc-$(CONFIG_PROC_FS) += stats.o
 sunrpc-$(CONFIG_SYSCTL) += sysctl.o
diff --git a/net/sunrpc/bc_svc.c b/net/sunrpc/bc_svc.c
deleted file mode 100644
index 15c7a8a..000
--- a/net/sunrpc/bc_svc.c
+++ /dev/null
@@ -1,63 +0,0 @@
-/**
-
-(c) 2007 Network Appliance, Inc.  All Rights Reserved.
-(c) 2009 NetApp.  All Rights Reserved.
-
-NetApp provides this source code under the GPL v2 License.
-The GPL v2 license is available at
-http://opensource.org/licenses/gpl-license.php.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
-"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
-A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-**/
-
-/*
- * The NFSv4.1 callback service helper routines.
- * They implement the transport level processing required to send the
- * reply over an existing open connection previously established by the client.
- */
-
-#include 
-
-#include 
-#include 
-#include 
-
-#define RPCDBG_FACILITYRPCDBG_SVCDSP
-
-/* Empty callback ops */
-static const struct rpc_call_ops nfs41_callback_ops = {
-};
-
-
-/*
- * Send the callback reply
- */
-int bc_send(struct rpc_rqst *req)
-{
-   struct rpc_task *task;
-   int ret;
-
-   dprintk("RPC:   bc_send req= %p\n", req);
-   task = rpc_run_bc_task(req, &nfs41_callback_ops);
-   if (IS_ERR(task))
-   ret = PTR_ERR(task);
-   else {
-   WARN_ON_ONCE(atomic_read(&task->tk_count) != 1);
-   ret = task->tk_status;
-   rpc_put_task(task);
-   }
-   dprintk("RPC:   bc_send ret= %d\n", ret);
-   return ret;
-}
-
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 78974e4..e144902 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1350,6 +1350,11 @@ bc_svc_process(struct svc_serv *serv, struct rpc_rqst 
*req,
 {
struct kvec *argv = &rqstp->rq_arg.head[0];
struct kvec *resv = &rqstp->rq_res.head[0];
+   static const struct rpc_call_ops reply_ops = { };
+   struct rpc_task *task;
+   int error;
+
+   dprintk("svc: %s(%p)\n", __func__, req);
 
/* Build the svc_rqst used by the common processing routine */
rqstp->rq_xprt = serv->sv_bc_xprt;
@@ -1372,21 +1377,33 @@ bc_svc_process(struct svc_serv *serv, struct rpc_rqst 
*req,
 
/*
 * Skip the next two words because they've already been
-* processed in the trasport
+* processed in the transport
 */
svc_getu32(argv);   /* XID */
svc_getnl(argv);/* CALLDIR */
 
-   /* Returns 1 for send, 0 for drop */
-   if (svc_process_common(rqstp, argv, resv)) {
-   memcpy(&req->rq_snd_buf, 

[PATCH v2 09/16] xprtrdma: Acquire MRs in rpcrdma_register_external()

2015-05-11 Thread Chuck Lever
Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because most modern adapters
can transfer 100s of KBs of payload using just a single MR.

Instead, acquire MRs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Note: commit 539431a437d2 ("xprtrdma: Don't invalidate FRMRs if
registration fails") is reverted: There is now a valid case where
registration can fail (with -ENOMEM) but the QP is still in RTS.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
---
 net/sunrpc/xprtrdma/frwr_ops.c |  100 +++-
 net/sunrpc/xprtrdma/rpc_rdma.c |3 -
 net/sunrpc/xprtrdma/verbs.c|   21 
 3 files changed, 89 insertions(+), 35 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index a06d9a3..133edf6 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -11,6 +11,62 @@
  * but most complex memory registration mode.
  */
 
+/* Normal operation
+ *
+ * A Memory Region is prepared for RDMA READ or WRITE using a FAST_REG
+ * Work Request (frmr_op_map). When the RDMA operation is finished, this
+ * Memory Region is invalidated using a LOCAL_INV Work Request
+ * (frmr_op_unmap).
+ *
+ * Typically these Work Requests are not signaled, and neither are RDMA
+ * SEND Work Requests (with the exception of signaling occasionally to
+ * prevent provider work queue overflows). This greatly reduces HCA
+ * interrupt workload.
+ *
+ * As an optimization, frwr_op_unmap marks MRs INVALID before the
+ * LOCAL_INV WR is posted. If posting succeeds, the MR is placed on
+ * rb_mws immediately so that no work (like managing a linked list
+ * under a spinlock) is needed in the completion upcall.
+ *
+ * But this means that frwr_op_map() can occasionally encounter an MR
+ * that is INVALID but the LOCAL_INV WR has not completed. Work Queue
+ * ordering prevents a subsequent FAST_REG WR from executing against
+ * that MR while it is still being invalidated.
+ */
+
+/* Transport recovery
+ *
+ * ->op_map and the transport connect worker cannot run at the same
+ * time, but ->op_unmap can fire while the transport connect worker
+ * is running. Thus MR recovery is handled in ->op_map, to guarantee
+ * that recovered MRs are owned by a sending RPC, and not one where
+ * ->op_unmap could fire at the same time transport reconnect is
+ * being done.
+ *
+ * When the underlying transport disconnects, MRs are left in one of
+ * three states:
+ *
+ * INVALID:The MR was not in use before the QP entered ERROR state.
+ * (Or, the LOCAL_INV WR has not completed or flushed yet).
+ *
+ * STALE:  The MR was being registered or unregistered when the QP
+ * entered ERROR state, and the pending WR was flushed.
+ *
+ * VALID:  The MR was registered before the QP entered ERROR state.
+ *
+ * When frwr_op_map encounters STALE and VALID MRs, they are recovered
+ * with ib_dereg_mr and then are re-initialized. Beause MR recovery
+ * allocates fresh resources, it is deferred to a workqueue, and the
+ * recovered MRs are placed back on the rb_mws list when recovery is
+ * complete. frwr_op_map allocates another MR for the current RPC while
+ * the broken MR is reset.
+ *
+ * To ensure that frwr_op_map doesn't encounter an MR that is marked
+ * INVALID but that is about to be flushed due to a previous transport
+ * disconnect, the transport connect worker attempts to drain all
+ * pending send queue WRs before the transport is reconnected.
+ */
+
 #include "xprt_rdma.h"
 
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
@@ -250,9 +306,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
struct ib_device *device = ia->ri_device;
enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
-   struct rpcrdma_mw *mw = seg1->rl_mw;
-   struct rpcrdma_frmr *frmr = &mw->r.frmr;
-   struct ib_mr *mr = frmr->fr_mr;
+   struct rpcrdma_mw *mw;
+   struct rpcrdma_frmr *frmr;
+   struct ib_mr *mr;
struct ib_send_wr fastreg_wr, *bad_wr;
u8 key;
int len, pageoff;
@@ -261,12 +317,25 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
u64 pa;
int page_no;
 
+   mw = seg1->rl_mw;
+   seg1->rl_mw = NULL;
+   do {
+   if (mw)
+   __frwr_queue_recovery(mw);
+   mw = rpcrdma_get_mw(r_xprt);
+   if (!mw)
+   return -ENOMEM;
+   } while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
+   frmr = &mw->r.frmr;
+   frmr->fr_state = FRMR_IS_VALID;
+
pageoff = offset_in_page(seg1->mr_offset);
seg1->mr_offset -= pageoff; /* start of page */
seg1->mr_len += pageoff;
   

Re: [PATCH v2 00/16] NFS/RDMA patches proposed for 4.2

2015-05-26 Thread Chuck Lever

On May 26, 2015, at 11:28 AM, Doug Ledford  wrote:

> On Mon, 2015-05-11 at 14:02 -0400, Chuck Lever wrote:
>> I'd like these patches to be considered for merging upstream. This
>> patch series includes:
>> 
>>  - JIT allocation of rpcrdma_mw structures
>>  - Break-up of rb_lock
>>  - Reduction of how many rpcrdma_mw structs are needed per transport
>> 
>> These are pre-requisites for increasing the RPC slot count and
>> r/wsize on RPC/RDMA transports, and provide scalability benefits
>> even on their own. And:
>> 
>>  - A generic transport fault injector
>> 
>> This is useful to discover regressions in logic that handles
>> transport reconnection.
>> 
>> You can find these in my git repo in the "nfs-rdma-for-4.2" topic
>> branch. See:
>> 
>>  git://git.linux-nfs.org/projects/cel/cel-2.6.git
> 
> I assume you are planning on this going in through the nfs tree.  As
> such, I'm planning on removing this patchset from the linux-rdma
> patchworks site.

Yes, patches to net/sunrpc/xprtrdma/ will typically go through
Anna or Bruce. I post to linux-rdma for review of RDMA-related
changes.

> However, I'll add this for the series:
> 
> Reviewed-by: Doug Ledford 

Thanks, I will post a refresh today.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 00/17] NFS/RDMA client patches for 4.2

2015-05-26 Thread Chuck Lever
I'd like these patches to be considered for merging upstream. This
patch series includes:

 - Internal documentation of connection recovery
 - JIT allocation of rpcrdma_mw structures
 - Break-up of rb_lock
 - Reduction of how many rpcrdma_mw structs are needed per transport

These are pre-requisites for increasing the RPC slot count and
r/wsize on RPC/RDMA transports, and provide scalability benefits
even on their own. And:

 - A generic transport fault injector

This is useful to discover regressions in logic that handles
transport reconnection.

You can find these in my git repo in the "nfs-rdma-for-4.2" topic
branch. See:

 git://git.linux-nfs.org/projects/cel/cel-2.6.git


Changes since v2:
 - Rebased on 4.1-rc5
 - Updated Reviewed-by and Tested-by tags
 - Minor fix to XDR argument encoder for NFS SETACL


Changes since v1:

 - Rebased on 4.1-rc3
 - Transport fault injector controlled from debugfs rather than /proc
 - Transport fault injector works for all transport types
 - bc_send() clean up suggested by Christoph Hellwig
 - Added Reviewed-by: tags. Many thanks to reviewers!
 - Addressed all review comments but one: Sagi's comment about
ri_device remains unresolved.

---

Chuck Lever (17):
  SUNRPC: Transport fault injection
  xprtrdma: Warn when there are orphaned IB objects
  xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt
  xprtrdma: Remove rr_func
  xprtrdma: Use ib_device pointer safely
  xprtrdma: Introduce helpers for allocating MWs
  xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()
  xprtrdma: Introduce an FRMR recovery workqueue
  xprtrdma: Acquire MRs in rpcrdma_register_external()
  xprtrdma: Remove unused LOCAL_INV recovery logic
  xprtrdma: Remove ->ro_reset
  xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy
  xprtrdma: Split rb_lock
  xprtrdma: Stack relief in fmr_op_map()
  xprtrdma: Reduce per-transport MR allocation
  SUNRPC: Clean up bc_send()
  NFS: Fix size of NFSACL SETACL operations


 fs/nfs/nfs3xdr.c   |2 
 include/linux/sunrpc/bc_xprt.h |1 
 include/linux/sunrpc/xprt.h|   19 +++
 include/linux/sunrpc/xprtrdma.h|3 
 net/sunrpc/Makefile|2 
 net/sunrpc/bc_svc.c|   63 -
 net/sunrpc/clnt.c  |1 
 net/sunrpc/debugfs.c   |   77 +++
 net/sunrpc/svc.c   |   33 -
 net/sunrpc/xprt.c  |2 
 net/sunrpc/xprtrdma/fmr_ops.c  |  120 +++--
 net/sunrpc/xprtrdma/frwr_ops.c |  227 +++-
 net/sunrpc/xprtrdma/physical_ops.c |   14 --
 net/sunrpc/xprtrdma/rpc_rdma.c |8 -
 net/sunrpc/xprtrdma/transport.c|   30 +++-
 net/sunrpc/xprtrdma/verbs.c|  257 +---
 net/sunrpc/xprtrdma/xprt_rdma.h|   38 -
 net/sunrpc/xprtsock.c  |   10 +
 18 files changed, 493 insertions(+), 414 deletions(-)
 delete mode 100644 net/sunrpc/bc_svc.c

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 02/17] xprtrdma: Warn when there are orphaned IB objects

2015-05-26 Thread Chuck Lever
WARN during transport destruction if ib_dealloc_pd() fails. This is
a sign that xprtrdma orphaned one or more RDMA API objects at some
point, which can pin lower layer kernel modules and cause shutdown
to hang.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Devesh Sharma 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/verbs.c |   10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 4870d27..51900e6 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -702,17 +702,17 @@ rpcrdma_ia_close(struct rpcrdma_ia *ia)
dprintk("RPC:   %s: ib_dereg_mr returned %i\n",
__func__, rc);
}
+
if (ia->ri_id != NULL && !IS_ERR(ia->ri_id)) {
if (ia->ri_id->qp)
rdma_destroy_qp(ia->ri_id);
rdma_destroy_id(ia->ri_id);
ia->ri_id = NULL;
}
-   if (ia->ri_pd != NULL && !IS_ERR(ia->ri_pd)) {
-   rc = ib_dealloc_pd(ia->ri_pd);
-   dprintk("RPC:   %s: ib_dealloc_pd returned %i\n",
-   __func__, rc);
-   }
+
+   /* If the pd is still busy, xprtrdma missed freeing a resource */
+   if (ia->ri_pd && !IS_ERR(ia->ri_pd))
+   WARN_ON(ib_dealloc_pd(ia->ri_pd));
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 01/17] SUNRPC: Transport fault injection

2015-05-26 Thread Chuck Lever
It has been exceptionally useful to exercise the logic that handles
local immediate errors and RDMA connection loss.  To enable
developers to test this regularly and repeatably, add logic to
simulate connection loss every so often.

Fault injection is disabled by default. It is enabled with

  $ sudo echo xxx > /sys/kernel/debug/sunrpc/inject_fault/disconnect

where "xxx" is a large positive number of transport method calls
before a disconnect. A value of several thousand is usually a good
number that allows reasonable forward progress while still causing a
lot of connection drops.

These hooks are disabled when SUNRPC_DEBUG is turned off.

Signed-off-by: Chuck Lever 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 include/linux/sunrpc/xprt.h |   19 ++
 net/sunrpc/clnt.c   |1 +
 net/sunrpc/debugfs.c|   77 +++
 net/sunrpc/xprt.c   |2 +
 net/sunrpc/xprtrdma/transport.c |   13 ++-
 net/sunrpc/xprtsock.c   |   10 +
 6 files changed, 121 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 8b93ef5..178190a 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -133,6 +133,7 @@ struct rpc_xprt_ops {
void(*close)(struct rpc_xprt *xprt);
void(*destroy)(struct rpc_xprt *xprt);
void(*print_stats)(struct rpc_xprt *xprt, struct seq_file 
*seq);
+   void(*inject_disconnect)(struct rpc_xprt *xprt);
 };
 
 /*
@@ -241,6 +242,7 @@ struct rpc_xprt {
const char  *address_strings[RPC_DISPLAY_MAX];
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
struct dentry   *debugfs;   /* debugfs directory */
+   atomic_tinject_disconnect;
 #endif
 };
 
@@ -431,6 +433,23 @@ static inline int xprt_test_and_set_binding(struct 
rpc_xprt *xprt)
return test_and_set_bit(XPRT_BINDING, &xprt->state);
 }
 
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+extern unsigned int rpc_inject_disconnect;
+static inline void xprt_inject_disconnect(struct rpc_xprt *xprt)
+{
+   if (!rpc_inject_disconnect)
+   return;
+   if (atomic_dec_return(&xprt->inject_disconnect))
+   return;
+   atomic_set(&xprt->inject_disconnect, rpc_inject_disconnect);
+   xprt->ops->inject_disconnect(xprt);
+}
+#else
+static inline void xprt_inject_disconnect(struct rpc_xprt *xprt)
+{
+}
+#endif
+
 #endif /* __KERNEL__*/
 
 #endif /* _LINUX_SUNRPC_XPRT_H */
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index e6ce151..db4efb6 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -1614,6 +1614,7 @@ call_allocate(struct rpc_task *task)
req->rq_callsize + req->rq_rcvsize);
if (req->rq_buffer != NULL)
return;
+   xprt_inject_disconnect(xprt);
 
dprintk("RPC: %5u rpc_buffer allocation failed\n", task->tk_pid);
 
diff --git a/net/sunrpc/debugfs.c b/net/sunrpc/debugfs.c
index 82962f7..7cc1b8a 100644
--- a/net/sunrpc/debugfs.c
+++ b/net/sunrpc/debugfs.c
@@ -10,9 +10,12 @@
 #include "netns.h"
 
 static struct dentry *topdir;
+static struct dentry *rpc_fault_dir;
 static struct dentry *rpc_clnt_dir;
 static struct dentry *rpc_xprt_dir;
 
+unsigned int rpc_inject_disconnect;
+
 struct rpc_clnt_iter {
struct rpc_clnt *clnt;
loff_t  pos;
@@ -257,6 +260,8 @@ rpc_xprt_debugfs_register(struct rpc_xprt *xprt)
debugfs_remove_recursive(xprt->debugfs);
xprt->debugfs = NULL;
}
+
+   atomic_set(&xprt->inject_disconnect, rpc_inject_disconnect);
 }
 
 void
@@ -266,11 +271,78 @@ rpc_xprt_debugfs_unregister(struct rpc_xprt *xprt)
xprt->debugfs = NULL;
 }
 
+static int
+fault_open(struct inode *inode, struct file *filp)
+{
+   filp->private_data = kmalloc(128, GFP_KERNEL);
+   if (!filp->private_data)
+   return -ENOMEM;
+   return 0;
+}
+
+static int
+fault_release(struct inode *inode, struct file *filp)
+{
+   kfree(filp->private_data);
+   return 0;
+}
+
+static ssize_t
+fault_disconnect_read(struct file *filp, char __user *user_buf,
+ size_t len, loff_t *offset)
+{
+   char *buffer = (char *)filp->private_data;
+   size_t size;
+
+   size = sprintf(buffer, "%u\n", rpc_inject_disconnect);
+   return simple_read_from_buffer(user_buf, len, offset, buffer, size);
+}
+
+static ssize_t
+fault_disconnect_write(struct file *filp, const char __user *user_buf,
+  size_t len, loff_t *offset)
+{
+   char buffer[16];
+
+   len = min(len, sizeof(buffer) - 1);
+   if (copy_from_user(buffer, user_buf, len))
+   return -EFAULT;
+   buffer[len] = '\0';
+   if (kstrt

[PATCH v3 03/17] xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt

2015-05-26 Thread Chuck Lever
Clean up: Instead of carrying a pointer to the buffer pool and
the rpc_xprt, carry a pointer to the controlling rpcrdma_xprt.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Reviewed-by: Sagi Grimberg 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |4 ++--
 net/sunrpc/xprtrdma/transport.c |7 ++-
 net/sunrpc/xprtrdma/verbs.c |8 +---
 net/sunrpc/xprtrdma/xprt_rdma.h |3 +--
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 2c53ea9..98a3b95 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -732,8 +732,8 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
struct rpcrdma_msg *headerp;
struct rpcrdma_req *req;
struct rpc_rqst *rqst;
-   struct rpc_xprt *xprt = rep->rr_xprt;
-   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct rpcrdma_xprt *r_xprt = rep->rr_rxprt;
+   struct rpc_xprt *xprt = &r_xprt->rx_xprt;
__be32 *iptr;
int rdmalen, status;
unsigned long cwnd;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index dfcd52e..25f7a6e 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -627,12 +627,9 @@ xprt_rdma_send_request(struct rpc_task *task)
 
if (req->rl_reply == NULL)  /* e.g. reconnection */
rpcrdma_recv_buffer_get(req);
-
-   if (req->rl_reply) {
+   /* rpcrdma_recv_buffer_get may have set rl_reply, so check again */
+   if (req->rl_reply)
req->rl_reply->rr_func = rpcrdma_reply_handler;
-   /* this need only be done once, but... */
-   req->rl_reply->rr_xprt = xprt;
-   }
 
/* Must suppress retransmit to maintain credits */
if (req->rl_connect_cookie == xprt->connect_cookie)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 51900e6..c55bfbc 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -278,6 +278,7 @@ rpcrdma_recvcq_process_wc(struct ib_wc *wc, struct 
list_head *sched_list)
 {
struct rpcrdma_rep *rep =
(struct rpcrdma_rep *)(unsigned long)wc->wr_id;
+   struct rpcrdma_ia *ia;
 
/* WARNING: Only wr_id and status are reliable at this point */
if (wc->status != IB_WC_SUCCESS)
@@ -290,8 +291,9 @@ rpcrdma_recvcq_process_wc(struct ib_wc *wc, struct 
list_head *sched_list)
dprintk("RPC:   %s: rep %p opcode 'recv', length %u: success\n",
__func__, rep, wc->byte_len);
 
+   ia = &rep->rr_rxprt->rx_ia;
rep->rr_len = wc->byte_len;
-   ib_dma_sync_single_for_cpu(rdmab_to_ia(rep->rr_buffer)->ri_id->device,
+   ib_dma_sync_single_for_cpu(ia->ri_id->device,
   rdmab_addr(rep->rr_rdmabuf),
   rep->rr_len, DMA_FROM_DEVICE);
prefetch(rdmab_to_msg(rep->rr_rdmabuf));
@@ -1053,7 +1055,7 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
goto out_free;
}
 
-   rep->rr_buffer = &r_xprt->rx_buf;
+   rep->rr_rxprt = r_xprt;
return rep;
 
 out_free:
@@ -1423,7 +1425,7 @@ rpcrdma_recv_buffer_get(struct rpcrdma_req *req)
 void
 rpcrdma_recv_buffer_put(struct rpcrdma_rep *rep)
 {
-   struct rpcrdma_buffer *buffers = rep->rr_buffer;
+   struct rpcrdma_buffer *buffers = &rep->rr_rxprt->rx_buf;
unsigned long flags;
 
rep->rr_func = NULL;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 78e0b8b..c3d57c0 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -173,8 +173,7 @@ struct rpcrdma_buffer;
 
 struct rpcrdma_rep {
unsigned intrr_len;
-   struct rpcrdma_buffer   *rr_buffer;
-   struct rpc_xprt *rr_xprt;
+   struct rpcrdma_xprt *rr_rxprt;
void(*rr_func)(struct rpcrdma_rep *);
struct list_headrr_list;
struct rpcrdma_regbuf   *rr_rdmabuf;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 04/17] xprtrdma: Remove rr_func

2015-05-26 Thread Chuck Lever
A posted rpcrdma_rep never has rr_func set to anything but
rpcrdma_reply_handler.

Signed-off-by: Chuck Lever 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |1 -
 net/sunrpc/xprtrdma/transport.c |3 ---
 net/sunrpc/xprtrdma/verbs.c |   10 +-
 net/sunrpc/xprtrdma/xprt_rdma.h |1 -
 4 files changed, 1 insertion(+), 14 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 98a3b95..3f422ca 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -770,7 +770,6 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
rep->rr_len);
 repost:
r_xprt->rx_stats.bad_reply_count++;
-   rep->rr_func = rpcrdma_reply_handler;
if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
rpcrdma_recv_buffer_put(rep);
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 25f7a6e..7c12556 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -627,9 +627,6 @@ xprt_rdma_send_request(struct rpc_task *task)
 
if (req->rl_reply == NULL)  /* e.g. reconnection */
rpcrdma_recv_buffer_get(req);
-   /* rpcrdma_recv_buffer_get may have set rl_reply, so check again */
-   if (req->rl_reply)
-   req->rl_reply->rr_func = rpcrdma_reply_handler;
 
/* Must suppress retransmit to maintain credits */
if (req->rl_connect_cookie == xprt->connect_cookie)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index c55bfbc..8e0bd84 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -80,7 +80,6 @@ static void
 rpcrdma_run_tasklet(unsigned long data)
 {
struct rpcrdma_rep *rep;
-   void (*func)(struct rpcrdma_rep *);
unsigned long flags;
 
data = data;
@@ -89,14 +88,9 @@ rpcrdma_run_tasklet(unsigned long data)
rep = list_entry(rpcrdma_tasklets_g.next,
 struct rpcrdma_rep, rr_list);
list_del(&rep->rr_list);
-   func = rep->rr_func;
-   rep->rr_func = NULL;
spin_unlock_irqrestore(&rpcrdma_tk_lock_g, flags);
 
-   if (func)
-   func(rep);
-   else
-   rpcrdma_recv_buffer_put(rep);
+   rpcrdma_reply_handler(rep);
 
spin_lock_irqsave(&rpcrdma_tk_lock_g, flags);
}
@@ -1213,7 +1207,6 @@ rpcrdma_buffer_put_sendbuf(struct rpcrdma_req *req, 
struct rpcrdma_buffer *buf)
req->rl_niovs = 0;
if (req->rl_reply) {
buf->rb_recv_bufs[--buf->rb_recv_index] = req->rl_reply;
-   req->rl_reply->rr_func = NULL;
req->rl_reply = NULL;
}
 }
@@ -1428,7 +1421,6 @@ rpcrdma_recv_buffer_put(struct rpcrdma_rep *rep)
struct rpcrdma_buffer *buffers = &rep->rr_rxprt->rx_buf;
unsigned long flags;
 
-   rep->rr_func = NULL;
spin_lock_irqsave(&buffers->rb_lock, flags);
buffers->rb_recv_bufs[--buffers->rb_recv_index] = rep;
spin_unlock_irqrestore(&buffers->rb_lock, flags);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index c3d57c0..230e7fe 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -174,7 +174,6 @@ struct rpcrdma_buffer;
 struct rpcrdma_rep {
unsigned intrr_len;
struct rpcrdma_xprt *rr_rxprt;
-   void(*rr_func)(struct rpcrdma_rep *);
struct list_headrr_list;
struct rpcrdma_regbuf   *rr_rdmabuf;
 };

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 08/17] xprtrdma: Introduce an FRMR recovery workqueue

2015-05-26 Thread Chuck Lever
After a transport disconnect, FRMRs can be left in an undetermined
state. In particular, the MR's rkey is no good.

Currently, FRMRs are fixed up by the transport connect worker, but
that can race with ->ro_unmap if an RPC happens to exit while the
transport connect worker is running.

A better way of dealing with broken FRMRs is to detect them before
they are re-used by ->ro_map. Such FRMRs are either already invalid
or are owned by the sending RPC, and thus no race with ->ro_unmap
is possible.

Introduce a mechanism for handing broken FRMRs to a workqueue to be
reset in a context that is appropriate for allocating resources
(ie. an ib_alloc_fast_reg_mr() API call).

This mechanism is not yet used, but will be in subsequent patches.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Reviewed-By: Devesh Sharma 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/frwr_ops.c  |   71 ++-
 net/sunrpc/xprtrdma/transport.c |   11 +-
 net/sunrpc/xprtrdma/xprt_rdma.h |5 +++
 3 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 66a85fa..a06d9a3 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -17,6 +17,74 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+static struct workqueue_struct *frwr_recovery_wq;
+
+#define FRWR_RECOVERY_WQ_FLAGS (WQ_UNBOUND | WQ_MEM_RECLAIM)
+
+int
+frwr_alloc_recovery_wq(void)
+{
+   frwr_recovery_wq = alloc_workqueue("frwr_recovery",
+  FRWR_RECOVERY_WQ_FLAGS, 0);
+   return !frwr_recovery_wq ? -ENOMEM : 0;
+}
+
+void
+frwr_destroy_recovery_wq(void)
+{
+   struct workqueue_struct *wq;
+
+   if (!frwr_recovery_wq)
+   return;
+
+   wq = frwr_recovery_wq;
+   frwr_recovery_wq = NULL;
+   destroy_workqueue(wq);
+}
+
+/* Deferred reset of a single FRMR. Generate a fresh rkey by
+ * replacing the MR.
+ *
+ * There's no recovery if this fails. The FRMR is abandoned, but
+ * remains in rb_all. It will be cleaned up when the transport is
+ * destroyed.
+ */
+static void
+__frwr_recovery_worker(struct work_struct *work)
+{
+   struct rpcrdma_mw *r = container_of(work, struct rpcrdma_mw,
+   r.frmr.fr_work);
+   struct rpcrdma_xprt *r_xprt = r->r.frmr.fr_xprt;
+   unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
+   struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
+
+   if (ib_dereg_mr(r->r.frmr.fr_mr))
+   goto out_fail;
+
+   r->r.frmr.fr_mr = ib_alloc_fast_reg_mr(pd, depth);
+   if (IS_ERR(r->r.frmr.fr_mr))
+   goto out_fail;
+
+   dprintk("RPC:   %s: recovered FRMR %p\n", __func__, r);
+   r->r.frmr.fr_state = FRMR_IS_INVALID;
+   rpcrdma_put_mw(r_xprt, r);
+   return;
+
+out_fail:
+   pr_warn("RPC:   %s: FRMR %p unrecovered\n",
+   __func__, r);
+}
+
+/* A broken MR was discovered in a context that can't sleep.
+ * Defer recovery to the recovery worker.
+ */
+static void
+__frwr_queue_recovery(struct rpcrdma_mw *r)
+{
+   INIT_WORK(&r->r.frmr.fr_work, __frwr_recovery_worker);
+   queue_work(frwr_recovery_wq, &r->r.frmr.fr_work);
+}
+
 static int
 __frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, struct ib_device *device,
unsigned int depth)
@@ -128,7 +196,7 @@ frwr_sendcompletion(struct ib_wc *wc)
 
/* WARNING: Only wr_id and status are reliable at this point */
r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
-   dprintk("RPC:   %s: frmr %p (stale), status %d\n",
+   pr_warn("RPC:   %s: frmr %p flushed, status %d\n",
__func__, r, wc->status);
r->r.frmr.fr_state = FRMR_IS_STALE;
 }
@@ -165,6 +233,7 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
list_add(&r->mw_list, &buf->rb_mws);
list_add(&r->mw_all, &buf->rb_all);
r->mw_sendcompletion = frwr_sendcompletion;
+   r->r.frmr.fr_xprt = r_xprt;
}
 
return 0;
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 7c12556..6f8943c 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -731,17 +731,24 @@ static void __exit xprt_rdma_cleanup(void)
if (rc)
dprintk("RPC:   %s: xprt_unregister returned %i\n",
__func__, rc);
+
+   frwr_destroy_recovery_wq();
 }
 
 static int __init xprt_rdma_init(void)
 {
int rc;
 
-   rc = xprt_register_transport(&xprt_rdma);
-
+   rc = frwr_alloc_recovery_wq();
if (rc)
return rc;
 
+   rc = xprt_register_transport(&xprt_rdma)

[PATCH v3 09/17] xprtrdma: Acquire MRs in rpcrdma_register_external()

2015-05-26 Thread Chuck Lever
Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because most modern adapters
can transfer 100s of KBs of payload using just a single MR.

Instead, acquire MRs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Note: commit 539431a437d2 ("xprtrdma: Don't invalidate FRMRs if
registration fails") is reverted: There is now a valid case where
registration can fail (with -ENOMEM) but the QP is still in RTS.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/frwr_ops.c |  100 +++-
 net/sunrpc/xprtrdma/rpc_rdma.c |3 -
 net/sunrpc/xprtrdma/verbs.c|   21 
 3 files changed, 89 insertions(+), 35 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index a06d9a3..133edf6 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -11,6 +11,62 @@
  * but most complex memory registration mode.
  */
 
+/* Normal operation
+ *
+ * A Memory Region is prepared for RDMA READ or WRITE using a FAST_REG
+ * Work Request (frmr_op_map). When the RDMA operation is finished, this
+ * Memory Region is invalidated using a LOCAL_INV Work Request
+ * (frmr_op_unmap).
+ *
+ * Typically these Work Requests are not signaled, and neither are RDMA
+ * SEND Work Requests (with the exception of signaling occasionally to
+ * prevent provider work queue overflows). This greatly reduces HCA
+ * interrupt workload.
+ *
+ * As an optimization, frwr_op_unmap marks MRs INVALID before the
+ * LOCAL_INV WR is posted. If posting succeeds, the MR is placed on
+ * rb_mws immediately so that no work (like managing a linked list
+ * under a spinlock) is needed in the completion upcall.
+ *
+ * But this means that frwr_op_map() can occasionally encounter an MR
+ * that is INVALID but the LOCAL_INV WR has not completed. Work Queue
+ * ordering prevents a subsequent FAST_REG WR from executing against
+ * that MR while it is still being invalidated.
+ */
+
+/* Transport recovery
+ *
+ * ->op_map and the transport connect worker cannot run at the same
+ * time, but ->op_unmap can fire while the transport connect worker
+ * is running. Thus MR recovery is handled in ->op_map, to guarantee
+ * that recovered MRs are owned by a sending RPC, and not one where
+ * ->op_unmap could fire at the same time transport reconnect is
+ * being done.
+ *
+ * When the underlying transport disconnects, MRs are left in one of
+ * three states:
+ *
+ * INVALID:The MR was not in use before the QP entered ERROR state.
+ * (Or, the LOCAL_INV WR has not completed or flushed yet).
+ *
+ * STALE:  The MR was being registered or unregistered when the QP
+ * entered ERROR state, and the pending WR was flushed.
+ *
+ * VALID:  The MR was registered before the QP entered ERROR state.
+ *
+ * When frwr_op_map encounters STALE and VALID MRs, they are recovered
+ * with ib_dereg_mr and then are re-initialized. Beause MR recovery
+ * allocates fresh resources, it is deferred to a workqueue, and the
+ * recovered MRs are placed back on the rb_mws list when recovery is
+ * complete. frwr_op_map allocates another MR for the current RPC while
+ * the broken MR is reset.
+ *
+ * To ensure that frwr_op_map doesn't encounter an MR that is marked
+ * INVALID but that is about to be flushed due to a previous transport
+ * disconnect, the transport connect worker attempts to drain all
+ * pending send queue WRs before the transport is reconnected.
+ */
+
 #include "xprt_rdma.h"
 
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
@@ -250,9 +306,9 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
struct ib_device *device = ia->ri_device;
enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
-   struct rpcrdma_mw *mw = seg1->rl_mw;
-   struct rpcrdma_frmr *frmr = &mw->r.frmr;
-   struct ib_mr *mr = frmr->fr_mr;
+   struct rpcrdma_mw *mw;
+   struct rpcrdma_frmr *frmr;
+   struct ib_mr *mr;
struct ib_send_wr fastreg_wr, *bad_wr;
u8 key;
int len, pageoff;
@@ -261,12 +317,25 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
u64 pa;
int page_no;
 
+   mw = seg1->rl_mw;
+   seg1->rl_mw = NULL;
+   do {
+   if (mw)
+   __frwr_queue_recovery(mw);
+   mw = rpcrdma_get_mw(r_xprt);
+   if (!mw)
+   return -ENOMEM;
+   } while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
+   frmr = &mw->r.frmr;
+   frmr->fr_state = FRMR_IS_VALID;
+
pageoff = offset_in_page(seg1->mr_offset);
seg1->mr_offset -= pageoff; /* start of page

[PATCH v3 12/17] xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy

2015-05-26 Thread Chuck Lever
Clean up: This field is no longer used.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Devesh Sharma 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 include/linux/sunrpc/xprtrdma.h |3 ++-
 net/sunrpc/xprtrdma/verbs.c |3 ---
 net/sunrpc/xprtrdma/xprt_rdma.h |1 -
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/xprtrdma.h b/include/linux/sunrpc/xprtrdma.h
index c984c85..b176130 100644
--- a/include/linux/sunrpc/xprtrdma.h
+++ b/include/linux/sunrpc/xprtrdma.h
@@ -56,7 +56,8 @@
 
 #define RPCRDMA_INLINE_PAD_THRESH  (512)/* payload threshold to pad (bytes) */
 
-/* memory registration strategies */
+/* Memory registration strategies, by number.
+ * This is part of a kernel / user space API. Do not remove. */
 enum rpcrdma_memreg {
RPCRDMA_BOUNCEBUFFERS = 0,
RPCRDMA_REGISTER,
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index db9303a..cc1a526 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -665,9 +665,6 @@ rpcrdma_ia_open(struct rpcrdma_xprt *xprt, struct sockaddr 
*addr, int memreg)
dprintk("RPC:   %s: memory registration strategy is '%s'\n",
__func__, ia->ri_ops->ro_displayname);
 
-   /* Else will do memory reg/dereg for each chunk */
-   ia->ri_memreg_strategy = memreg;
-
rwlock_init(&ia->ri_qplock);
return 0;
 
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index f19376d..3ecee38 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -70,7 +70,6 @@ struct rpcrdma_ia {
int ri_have_dma_lkey;
struct completion   ri_done;
int ri_async_rc;
-   enum rpcrdma_memreg ri_memreg_strategy;
unsigned intri_max_frmr_depth;
struct ib_device_attr   ri_devattr;
struct ib_qp_attr   ri_qp_attr;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 16/17] SUNRPC: Clean up bc_send()

2015-05-26 Thread Chuck Lever
Clean up: Merge bc_send() into bc_svc_process().

Note: even though this touches svc.c, it is a client-side change.

Signed-off-by: Chuck Lever 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 include/linux/sunrpc/bc_xprt.h |1 -
 net/sunrpc/Makefile|2 +
 net/sunrpc/bc_svc.c|   63 
 net/sunrpc/svc.c   |   33 -
 4 files changed, 26 insertions(+), 73 deletions(-)
 delete mode 100644 net/sunrpc/bc_svc.c

diff --git a/include/linux/sunrpc/bc_xprt.h b/include/linux/sunrpc/bc_xprt.h
index 2ca67b5..8df43c9f 100644
--- a/include/linux/sunrpc/bc_xprt.h
+++ b/include/linux/sunrpc/bc_xprt.h
@@ -37,7 +37,6 @@ void xprt_complete_bc_request(struct rpc_rqst *req, uint32_t 
copied);
 void xprt_free_bc_request(struct rpc_rqst *req);
 int xprt_setup_backchannel(struct rpc_xprt *, unsigned int min_reqs);
 void xprt_destroy_backchannel(struct rpc_xprt *, unsigned int max_reqs);
-int bc_send(struct rpc_rqst *req);
 
 /*
  * Determine if a shared backchannel is in use
diff --git a/net/sunrpc/Makefile b/net/sunrpc/Makefile
index 15e6f6c..1b8e68d 100644
--- a/net/sunrpc/Makefile
+++ b/net/sunrpc/Makefile
@@ -15,6 +15,6 @@ sunrpc-y := clnt.o xprt.o socklib.o xprtsock.o sched.o \
sunrpc_syms.o cache.o rpc_pipe.o \
svc_xprt.o
 sunrpc-$(CONFIG_SUNRPC_DEBUG) += debugfs.o
-sunrpc-$(CONFIG_SUNRPC_BACKCHANNEL) += backchannel_rqst.o bc_svc.o
+sunrpc-$(CONFIG_SUNRPC_BACKCHANNEL) += backchannel_rqst.o
 sunrpc-$(CONFIG_PROC_FS) += stats.o
 sunrpc-$(CONFIG_SYSCTL) += sysctl.o
diff --git a/net/sunrpc/bc_svc.c b/net/sunrpc/bc_svc.c
deleted file mode 100644
index 15c7a8a..000
--- a/net/sunrpc/bc_svc.c
+++ /dev/null
@@ -1,63 +0,0 @@
-/**
-
-(c) 2007 Network Appliance, Inc.  All Rights Reserved.
-(c) 2009 NetApp.  All Rights Reserved.
-
-NetApp provides this source code under the GPL v2 License.
-The GPL v2 license is available at
-http://opensource.org/licenses/gpl-license.php.
-
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
-"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
-A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
-PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
-LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
-NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
-**/
-
-/*
- * The NFSv4.1 callback service helper routines.
- * They implement the transport level processing required to send the
- * reply over an existing open connection previously established by the client.
- */
-
-#include 
-
-#include 
-#include 
-#include 
-
-#define RPCDBG_FACILITYRPCDBG_SVCDSP
-
-/* Empty callback ops */
-static const struct rpc_call_ops nfs41_callback_ops = {
-};
-
-
-/*
- * Send the callback reply
- */
-int bc_send(struct rpc_rqst *req)
-{
-   struct rpc_task *task;
-   int ret;
-
-   dprintk("RPC:   bc_send req= %p\n", req);
-   task = rpc_run_bc_task(req, &nfs41_callback_ops);
-   if (IS_ERR(task))
-   ret = PTR_ERR(task);
-   else {
-   WARN_ON_ONCE(atomic_read(&task->tk_count) != 1);
-   ret = task->tk_status;
-   rpc_put_task(task);
-   }
-   dprintk("RPC:   bc_send ret= %d\n", ret);
-   return ret;
-}
-
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 78974e4..e144902 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1350,6 +1350,11 @@ bc_svc_process(struct svc_serv *serv, struct rpc_rqst 
*req,
 {
struct kvec *argv = &rqstp->rq_arg.head[0];
struct kvec *resv = &rqstp->rq_res.head[0];
+   static const struct rpc_call_ops reply_ops = { };
+   struct rpc_task *task;
+   int error;
+
+   dprintk("svc: %s(%p)\n", __func__, req);
 
/* Build the svc_rqst used by the common processing routine */
rqstp->rq_xprt = serv->sv_bc_xprt;
@@ -1372,21 +1377,33 @@ bc_svc_process(struct svc_serv *serv, struct rpc_rqst 
*req,
 
/*
 * Skip the next two words because they've already been
-* processed in the trasport
+* processed in the transport
 */
svc_getu32(argv);   /* XID */
svc_getnl(argv);/* CALLDIR */
 
-   /* Returns 1 for send, 0 for drop */
-   if (svc_process_commo

[PATCH v3 06/17] xprtrdma: Introduce helpers for allocating MWs

2015-05-26 Thread Chuck Lever
We eventually want to handle allocating MWs one at a time, as
needed, instead of grabbing 64 and throwing them at each RPC in the
pipeline.

Add a helper for grabbing an MW off rb_mws, and a helper for
returning an MW to rb_mws. These will be used in a subsequent patch.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Reviewed-by: Sagi Grimberg 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/verbs.c |   31 +++
 net/sunrpc/xprtrdma/xprt_rdma.h |2 ++
 2 files changed, 33 insertions(+)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index ddd5b36..b7ca73e 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1173,6 +1173,37 @@ rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
kfree(buf->rb_pool);
 }
 
+struct rpcrdma_mw *
+rpcrdma_get_mw(struct rpcrdma_xprt *r_xprt)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct rpcrdma_mw *mw = NULL;
+   unsigned long flags;
+
+   spin_lock_irqsave(&buf->rb_lock, flags);
+   if (!list_empty(&buf->rb_mws)) {
+   mw = list_first_entry(&buf->rb_mws,
+ struct rpcrdma_mw, mw_list);
+   list_del_init(&mw->mw_list);
+   }
+   spin_unlock_irqrestore(&buf->rb_lock, flags);
+
+   if (!mw)
+   pr_err("RPC:   %s: no MWs available\n", __func__);
+   return mw;
+}
+
+void
+rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *mw)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   unsigned long flags;
+
+   spin_lock_irqsave(&buf->rb_lock, flags);
+   list_add_tail(&mw->mw_list, &buf->rb_mws);
+   spin_unlock_irqrestore(&buf->rb_lock, flags);
+}
+
 /* "*mw" can be NULL when rpcrdma_buffer_get_mrs() fails, leaving
  * some req segments uninitialized.
  */
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 300423d..5b801d5 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -413,6 +413,8 @@ int rpcrdma_ep_post_recv(struct rpcrdma_ia *, struct 
rpcrdma_ep *,
 int rpcrdma_buffer_create(struct rpcrdma_xprt *);
 void rpcrdma_buffer_destroy(struct rpcrdma_buffer *);
 
+struct rpcrdma_mw *rpcrdma_get_mw(struct rpcrdma_xprt *);
+void rpcrdma_put_mw(struct rpcrdma_xprt *, struct rpcrdma_mw *);
 struct rpcrdma_req *rpcrdma_buffer_get(struct rpcrdma_buffer *);
 void rpcrdma_buffer_put(struct rpcrdma_req *);
 void rpcrdma_recv_buffer_get(struct rpcrdma_req *);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 07/17] xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external()

2015-05-26 Thread Chuck Lever
Acquiring 64 FMRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because FMR mode can
transfer up to a 1MB payload using just a single ib_fmr.

Instead, acquire ib_fmrs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/fmr_ops.c |   52 ++---
 net/sunrpc/xprtrdma/verbs.c   |   26 -
 2 files changed, 48 insertions(+), 30 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 0a96155..53fb649 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -11,6 +11,21 @@
  * can take tens of usecs to complete.
  */
 
+/* Normal operation
+ *
+ * A Memory Region is prepared for RDMA READ or WRITE using the
+ * ib_map_phys_fmr verb (fmr_op_map). When the RDMA operation is
+ * finished, the Memory Region is unmapped using the ib_unmap_fmr
+ * verb (fmr_op_unmap).
+ */
+
+/* Transport recovery
+ *
+ * After a transport reconnect, fmr_op_map re-uses the MR already
+ * allocated for the RPC, but generates a fresh rkey then maps the
+ * MR again. This process is synchronous.
+ */
+
 #include "xprt_rdma.h"
 
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
@@ -77,6 +92,15 @@ out_fmr_err:
return rc;
 }
 
+static int
+__fmr_unmap(struct rpcrdma_mw *r)
+{
+   LIST_HEAD(l);
+
+   list_add(&r->r.fmr->list, &l);
+   return ib_unmap_fmr(&l);
+}
+
 /* Use the ib_map_phys_fmr() verb to register a memory region
  * for remote access via RDMA READ or RDMA WRITE.
  */
@@ -88,9 +112,22 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
struct ib_device *device = ia->ri_device;
enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
-   struct rpcrdma_mw *mw = seg1->rl_mw;
u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
int len, pageoff, i, rc;
+   struct rpcrdma_mw *mw;
+
+   mw = seg1->rl_mw;
+   seg1->rl_mw = NULL;
+   if (!mw) {
+   mw = rpcrdma_get_mw(r_xprt);
+   if (!mw)
+   return -ENOMEM;
+   } else {
+   /* this is a retransmit; generate a fresh rkey */
+   rc = __fmr_unmap(mw);
+   if (rc)
+   return rc;
+   }
 
pageoff = offset_in_page(seg1->mr_offset);
seg1->mr_offset -= pageoff; /* start of page */
@@ -114,6 +151,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
if (rc)
goto out_maperr;
 
+   seg1->rl_mw = mw;
seg1->mr_rkey = mw->r.fmr->rkey;
seg1->mr_base = seg1->mr_dma + pageoff;
seg1->mr_nsegs = i;
@@ -137,18 +175,24 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mr_seg *seg1 = seg;
+   struct rpcrdma_mw *mw = seg1->rl_mw;
int rc, nsegs = seg->mr_nsegs;
-   LIST_HEAD(l);
 
-   list_add(&seg1->rl_mw->r.fmr->list, &l);
-   rc = ib_unmap_fmr(&l);
+   dprintk("RPC:   %s: FMR %p\n", __func__, mw);
+
+   seg1->rl_mw = NULL;
while (seg1->mr_nsegs--)
rpcrdma_unmap_one(ia->ri_device, seg++);
+   rc = __fmr_unmap(mw);
if (rc)
goto out_err;
+   rpcrdma_put_mw(r_xprt, mw);
return nsegs;
 
 out_err:
+   /* The FMR is abandoned, but remains in rb_all. fmr_op_destroy
+* will attempt to release it when the transport is destroyed.
+*/
dprintk("RPC:   %s: ib_unmap_fmr status %i\n", __func__, rc);
return nsegs;
 }
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index b7ca73e..3188e36 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1324,28 +1324,6 @@ rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req, struct 
rpcrdma_buffer *buf,
return NULL;
 }
 
-static struct rpcrdma_req *
-rpcrdma_buffer_get_fmrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf)
-{
-   struct rpcrdma_mw *r;
-   int i;
-
-   i = RPCRDMA_MAX_SEGS - 1;
-   while (!list_empty(&buf->rb_mws)) {
-   r = list_entry(buf->rb_mws.next,
-  struct rpcrdma_mw, mw_list);
-   list_del(&r->mw_list);
-   req->rl_segments[i].rl_mw = r;
-   if (unlikely(i-- == 0))
-   return req; /* Success */
-   }
-
-   /* Not enough entries on rb_mws for this req */
-   rpcrdma_buffer_put_sendbuf(req, buf);
-   rpcrdma_buffer_put_mrs(req, buf);
-   return NULL;
-

[PATCH v3 13/17] xprtrdma: Split rb_lock

2015-05-26 Thread Chuck Lever
/proc/lock_stat showed contention between rpcrdma_buffer_get/put
and the MR allocation functions during I/O intensive workloads.

Now that MRs are no longer allocated in rpcrdma_buffer_get(),
there's no reason the rb_mws list has to be managed using the
same lock as the send/receive buffers. Split that lock. The
new lock does not need to disable interrupts because buffer
get/put is never called in an interrupt context.

struct rpcrdma_buffer is re-arranged to ensure rb_mwlock and rb_mws
are always in a different cacheline than rb_lock and the buffer
pointers.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Reviewed-by: Sagi Grimberg 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/fmr_ops.c   |1 +
 net/sunrpc/xprtrdma/frwr_ops.c  |1 +
 net/sunrpc/xprtrdma/verbs.c |   10 --
 net/sunrpc/xprtrdma/xprt_rdma.h |   16 +---
 4 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 5dd77da..52f9ad5 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -65,6 +65,7 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
struct rpcrdma_mw *r;
int i, rc;
 
+   spin_lock_init(&buf->rb_mwlock);
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);
 
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 8622792..18b7305 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -266,6 +266,7 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
int i;
 
+   spin_lock_init(&buf->rb_mwlock);
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index cc1a526..2340835 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1173,15 +1173,14 @@ rpcrdma_get_mw(struct rpcrdma_xprt *r_xprt)
 {
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
struct rpcrdma_mw *mw = NULL;
-   unsigned long flags;
 
-   spin_lock_irqsave(&buf->rb_lock, flags);
+   spin_lock(&buf->rb_mwlock);
if (!list_empty(&buf->rb_mws)) {
mw = list_first_entry(&buf->rb_mws,
  struct rpcrdma_mw, mw_list);
list_del_init(&mw->mw_list);
}
-   spin_unlock_irqrestore(&buf->rb_lock, flags);
+   spin_unlock(&buf->rb_mwlock);
 
if (!mw)
pr_err("RPC:   %s: no MWs available\n", __func__);
@@ -1192,11 +1191,10 @@ void
 rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mw *mw)
 {
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
-   unsigned long flags;
 
-   spin_lock_irqsave(&buf->rb_lock, flags);
+   spin_lock(&buf->rb_mwlock);
list_add_tail(&mw->mw_list, &buf->rb_mws);
-   spin_unlock_irqrestore(&buf->rb_lock, flags);
+   spin_unlock(&buf->rb_mwlock);
 }
 
 static void
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 3ecee38..df92884 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -282,15 +282,17 @@ rpcr_to_rdmar(struct rpc_rqst *rqst)
  * One of these is associated with a transport instance
  */
 struct rpcrdma_buffer {
-   spinlock_t  rb_lock;/* protects indexes */
-   u32 rb_max_requests;/* client max requests */
-   struct list_head rb_mws;/* optional memory windows/fmrs/frmrs */
-   struct list_head rb_all;
-   int rb_send_index;
+   spinlock_t  rb_mwlock;  /* protect rb_mws list */
+   struct list_headrb_mws;
+   struct list_headrb_all;
+   char*rb_pool;
+
+   spinlock_t  rb_lock;/* protect buf arrays */
+   u32 rb_max_requests;
+   int rb_send_index;
+   int rb_recv_index;
struct rpcrdma_req  **rb_send_bufs;
-   int rb_recv_index;
struct rpcrdma_rep  **rb_recv_bufs;
-   char*rb_pool;
 };
 #define rdmab_to_ia(b) (&container_of((b), struct rpcrdma_xprt, rx_buf)->rx_ia)
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 15/17] xprtrdma: Reduce per-transport MR allocation

2015-05-26 Thread Chuck Lever
Reduce resource consumption per-transport to make way for increasing
the credit limit and maximum r/wsize. Pre-allocate fewer MRs.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Devesh Sharma 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |6 --
 net/sunrpc/xprtrdma/frwr_ops.c |6 --
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 4a53ad5..f1e8daf 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -69,8 +69,10 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);
 
-   i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
-   dprintk("RPC:   %s: initializing %d FMRs\n", __func__, i);
+   i = max_t(int, RPCRDMA_MAX_DATA_SEGS / RPCRDMA_MAX_FMR_SGES, 1);
+   i += 2; /* head + tail */
+   i *= buf->rb_max_requests;  /* one set for each RPC slot */
+   dprintk("RPC:   %s: initalizing %d FMRs\n", __func__, i);
 
rc = -ENOMEM;
while (i--) {
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 18b7305..661fbc1 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -270,8 +270,10 @@ frwr_op_init(struct rpcrdma_xprt *r_xprt)
INIT_LIST_HEAD(&buf->rb_mws);
INIT_LIST_HEAD(&buf->rb_all);
 
-   i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
-   dprintk("RPC:   %s: initializing %d FRMRs\n", __func__, i);
+   i = max_t(int, RPCRDMA_MAX_DATA_SEGS / depth, 1);
+   i += 2; /* head + tail */
+   i *= buf->rb_max_requests;  /* one set for each RPC slot */
+   dprintk("RPC:   %s: initalizing %d FRMRs\n", __func__, i);
 
while (i--) {
struct rpcrdma_mw *r;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 05/17] xprtrdma: Use ib_device pointer safely

2015-05-26 Thread Chuck Lever
The connect worker can replace ri_id, but prevents ri_id->device
from changing during the lifetime of a transport instance. The old
ID is kept around until a new ID is created and the ->device is
confirmed to be the same.

Cache a copy of ri_id->device in rpcrdma_ia and in rpcrdma_rep.
The cached copy can be used safely in code that does not serialize
with the connect worker.

Other code can use it to save an extra address generation (one
pointer dereference instead of two).

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |8 +
 net/sunrpc/xprtrdma/frwr_ops.c |   12 +++
 net/sunrpc/xprtrdma/physical_ops.c |8 +
 net/sunrpc/xprtrdma/verbs.c|   61 +++-
 net/sunrpc/xprtrdma/xprt_rdma.h|2 +
 5 files changed, 43 insertions(+), 48 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 302d4eb..0a96155 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -85,7 +85,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg 
*seg,
   int nsegs, bool writing)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
-   struct ib_device *device = ia->ri_id->device;
+   struct ib_device *device = ia->ri_device;
enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_mw *mw = seg1->rl_mw;
@@ -137,17 +137,13 @@ fmr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mr_seg *seg1 = seg;
-   struct ib_device *device;
int rc, nsegs = seg->mr_nsegs;
LIST_HEAD(l);
 
list_add(&seg1->rl_mw->r.fmr->list, &l);
rc = ib_unmap_fmr(&l);
-   read_lock(&ia->ri_qplock);
-   device = ia->ri_id->device;
while (seg1->mr_nsegs--)
-   rpcrdma_unmap_one(device, seg++);
-   read_unlock(&ia->ri_qplock);
+   rpcrdma_unmap_one(ia->ri_device, seg++);
if (rc)
goto out_err;
return nsegs;
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index dff0481..66a85fa 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -137,7 +137,7 @@ static int
 frwr_op_init(struct rpcrdma_xprt *r_xprt)
 {
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
-   struct ib_device *device = r_xprt->rx_ia.ri_id->device;
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
int i;
@@ -178,7 +178,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
int nsegs, bool writing)
 {
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
-   struct ib_device *device = ia->ri_id->device;
+   struct ib_device *device = ia->ri_device;
enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
struct rpcrdma_mw *mw = seg1->rl_mw;
@@ -263,7 +263,6 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct ib_send_wr invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;
-   struct ib_device *device;
 
seg1->rl_mw->r.frmr.fr_state = FRMR_IS_INVALID;
 
@@ -273,10 +272,9 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
invalidate_wr.ex.invalidate_rkey = seg1->rl_mw->r.frmr.fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);
 
-   read_lock(&ia->ri_qplock);
-   device = ia->ri_id->device;
while (seg1->mr_nsegs--)
-   rpcrdma_unmap_one(device, seg++);
+   rpcrdma_unmap_one(ia->ri_device, seg++);
+   read_lock(&ia->ri_qplock);
rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
read_unlock(&ia->ri_qplock);
if (rc)
@@ -304,7 +302,7 @@ static void
 frwr_op_reset(struct rpcrdma_xprt *r_xprt)
 {
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
-   struct ib_device *device = r_xprt->rx_ia.ri_id->device;
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
struct rpcrdma_mw *r;
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index ba518af..da149e8 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -50,8 +50,7 @@ physical_op_map(struct rpcrdma_xprt *r_xpr

[PATCH v3 11/17] xprtrdma: Remove ->ro_reset

2015-05-26 Thread Chuck Lever
An RPC can exit at any time. When it does so, xprt_rdma_free() is
called, and it calls ->op_unmap().

If ->ro_reset() is running due to a transport disconnect, the two
methods can race while processing the same rpcrdma_mw. The results
are unpredictable.

Because of this, in previous patches I've altered ->ro_map() to
handle MR reset. ->ro_reset() is no longer needed and can be
removed.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Devesh Sharma 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   23 -
 net/sunrpc/xprtrdma/frwr_ops.c |   39 
 net/sunrpc/xprtrdma/physical_ops.c |6 --
 net/sunrpc/xprtrdma/verbs.c|2 --
 net/sunrpc/xprtrdma/xprt_rdma.h|1 -
 5 files changed, 71 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 53fb649..5dd77da 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -197,28 +197,6 @@ out_err:
return nsegs;
 }
 
-/* After a disconnect, unmap all FMRs.
- *
- * This is invoked only in the transport connect worker in order
- * to serialize with rpcrdma_register_fmr_external().
- */
-static void
-fmr_op_reset(struct rpcrdma_xprt *r_xprt)
-{
-   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
-   struct rpcrdma_mw *r;
-   LIST_HEAD(list);
-   int rc;
-
-   list_for_each_entry(r, &buf->rb_all, mw_all)
-   list_add(&r->r.fmr->list, &list);
-
-   rc = ib_unmap_fmr(&list);
-   if (rc)
-   dprintk("RPC:   %s: ib_unmap_fmr failed %i\n",
-   __func__, rc);
-}
-
 static void
 fmr_op_destroy(struct rpcrdma_buffer *buf)
 {
@@ -242,7 +220,6 @@ const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_open= fmr_op_open,
.ro_maxpages= fmr_op_maxpages,
.ro_init= fmr_op_init,
-   .ro_reset   = fmr_op_reset,
.ro_destroy = fmr_op_destroy,
.ro_displayname = "fmr",
 };
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 133edf6..8622792 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -430,44 +430,6 @@ out_err:
return nsegs;
 }
 
-/* After a disconnect, a flushed FAST_REG_MR can leave an FRMR in
- * an unusable state. Find FRMRs in this state and dereg / reg
- * each.  FRMRs that are VALID and attached to an rpcrdma_req are
- * also torn down.
- *
- * This gives all in-use FRMRs a fresh rkey and leaves them INVALID.
- *
- * This is invoked only in the transport connect worker in order
- * to serialize with rpcrdma_register_frmr_external().
- */
-static void
-frwr_op_reset(struct rpcrdma_xprt *r_xprt)
-{
-   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
-   struct ib_device *device = r_xprt->rx_ia.ri_device;
-   unsigned int depth = r_xprt->rx_ia.ri_max_frmr_depth;
-   struct ib_pd *pd = r_xprt->rx_ia.ri_pd;
-   struct rpcrdma_mw *r;
-   int rc;
-
-   list_for_each_entry(r, &buf->rb_all, mw_all) {
-   if (r->r.frmr.fr_state == FRMR_IS_INVALID)
-   continue;
-
-   __frwr_release(r);
-   rc = __frwr_init(r, pd, device, depth);
-   if (rc) {
-   dprintk("RPC:   %s: mw %p left %s\n",
-   __func__, r,
-   (r->r.frmr.fr_state == FRMR_IS_STALE ?
-   "stale" : "valid"));
-   continue;
-   }
-
-   r->r.frmr.fr_state = FRMR_IS_INVALID;
-   }
-}
-
 static void
 frwr_op_destroy(struct rpcrdma_buffer *buf)
 {
@@ -490,7 +452,6 @@ const struct rpcrdma_memreg_ops rpcrdma_frwr_memreg_ops = {
.ro_open= frwr_op_open,
.ro_maxpages= frwr_op_maxpages,
.ro_init= frwr_op_init,
-   .ro_reset   = frwr_op_reset,
.ro_destroy = frwr_op_destroy,
.ro_displayname = "frwr",
 };
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index da149e8..41985d0 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -69,11 +69,6 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
 }
 
 static void
-physical_op_reset(struct rpcrdma_xprt *r_xprt)
-{
-}
-
-static void
 physical_op_destroy(struct rpcrdma_buffer *buf)
 {
 }
@@ -84,7 +79,6 @@ const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = 
{

[PATCH v3 14/17] xprtrdma: Stack relief in fmr_op_map()

2015-05-26 Thread Chuck Lever
fmr_op_map() declares a 64 element array of u64 in automatic
storage. This is 512 bytes (8 * 64) on the stack.

Instead, when FMR memory registration is in use, pre-allocate a
physaddr array for each rpcrdma_mw.

This is a pre-requisite for increasing the r/wsize maximum for
FMR on platforms with 4KB pages.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Devesh Sharma 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/fmr_ops.c   |   32 ++--
 net/sunrpc/xprtrdma/xprt_rdma.h |7 ++-
 2 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index 52f9ad5..4a53ad5 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -72,13 +72,19 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
i = (buf->rb_max_requests + 1) * RPCRDMA_MAX_SEGS;
dprintk("RPC:   %s: initializing %d FMRs\n", __func__, i);
 
+   rc = -ENOMEM;
while (i--) {
r = kzalloc(sizeof(*r), GFP_KERNEL);
if (!r)
-   return -ENOMEM;
+   goto out;
+
+   r->r.fmr.physaddrs = kmalloc(RPCRDMA_MAX_FMR_SGES *
+sizeof(u64), GFP_KERNEL);
+   if (!r->r.fmr.physaddrs)
+   goto out_free;
 
-   r->r.fmr = ib_alloc_fmr(pd, mr_access_flags, &fmr_attr);
-   if (IS_ERR(r->r.fmr))
+   r->r.fmr.fmr = ib_alloc_fmr(pd, mr_access_flags, &fmr_attr);
+   if (IS_ERR(r->r.fmr.fmr))
goto out_fmr_err;
 
list_add(&r->mw_list, &buf->rb_mws);
@@ -87,9 +93,12 @@ fmr_op_init(struct rpcrdma_xprt *r_xprt)
return 0;
 
 out_fmr_err:
-   rc = PTR_ERR(r->r.fmr);
+   rc = PTR_ERR(r->r.fmr.fmr);
dprintk("RPC:   %s: ib_alloc_fmr status %i\n", __func__, rc);
+   kfree(r->r.fmr.physaddrs);
+out_free:
kfree(r);
+out:
return rc;
 }
 
@@ -98,7 +107,7 @@ __fmr_unmap(struct rpcrdma_mw *r)
 {
LIST_HEAD(l);
 
-   list_add(&r->r.fmr->list, &l);
+   list_add(&r->r.fmr.fmr->list, &l);
return ib_unmap_fmr(&l);
 }
 
@@ -113,7 +122,6 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
struct ib_device *device = ia->ri_device;
enum dma_data_direction direction = rpcrdma_data_dir(writing);
struct rpcrdma_mr_seg *seg1 = seg;
-   u64 physaddrs[RPCRDMA_MAX_DATA_SEGS];
int len, pageoff, i, rc;
struct rpcrdma_mw *mw;
 
@@ -138,7 +146,7 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
nsegs = RPCRDMA_MAX_FMR_SGES;
for (i = 0; i < nsegs;) {
rpcrdma_map_one(device, seg, direction);
-   physaddrs[i] = seg->mr_dma;
+   mw->r.fmr.physaddrs[i] = seg->mr_dma;
len += seg->mr_len;
++seg;
++i;
@@ -148,12 +156,13 @@ fmr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
break;
}
 
-   rc = ib_map_phys_fmr(mw->r.fmr, physaddrs, i, seg1->mr_dma);
+   rc = ib_map_phys_fmr(mw->r.fmr.fmr, mw->r.fmr.physaddrs,
+i, seg1->mr_dma);
if (rc)
goto out_maperr;
 
seg1->rl_mw = mw;
-   seg1->mr_rkey = mw->r.fmr->rkey;
+   seg1->mr_rkey = mw->r.fmr.fmr->rkey;
seg1->mr_base = seg1->mr_dma + pageoff;
seg1->mr_nsegs = i;
seg1->mr_len = len;
@@ -207,10 +216,13 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
while (!list_empty(&buf->rb_all)) {
r = list_entry(buf->rb_all.next, struct rpcrdma_mw, mw_all);
list_del(&r->mw_all);
-   rc = ib_dealloc_fmr(r->r.fmr);
+   kfree(r->r.fmr.physaddrs);
+
+   rc = ib_dealloc_fmr(r->r.fmr.fmr);
if (rc)
dprintk("RPC:   %s: ib_dealloc_fmr failed %i\n",
__func__, rc);
+
kfree(r);
}
 }
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index df92884..110d685 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -206,9 +206,14 @@ struct rpcrdma_frmr {
struct rpcrdma_xprt *fr_xprt;
 };
 
+struct rpcrdma_fmr {
+   struct ib_fmr   *fmr;
+   u64 *physaddrs;
+};
+
 struct rpcrdma_mw {
union {
-   struct ib_fmr   *fmr;
+   struct rpcrdma_fmr  fmr;
struct rpcrdma_frmr frmr;
} r;
  

[PATCH v3 10/17] xprtrdma: Remove unused LOCAL_INV recovery logic

2015-05-26 Thread Chuck Lever
Clean up: Remove functions no longer used to recover broken FRMRs.

Signed-off-by: Chuck Lever 
Reviewed-by: Steve Wise 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Devesh Sharma 
Tested-By: Devesh Sharma 
Reviewed-by: Doug Ledford 
---
 net/sunrpc/xprtrdma/verbs.c |  109 ---
 1 file changed, 109 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 768bb77..a891cf7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1204,33 +1204,6 @@ rpcrdma_put_mw(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mw *mw)
spin_unlock_irqrestore(&buf->rb_lock, flags);
 }
 
-/* "*mw" can be NULL when rpcrdma_buffer_get_mrs() fails, leaving
- * some req segments uninitialized.
- */
-static void
-rpcrdma_buffer_put_mr(struct rpcrdma_mw **mw, struct rpcrdma_buffer *buf)
-{
-   if (*mw) {
-   list_add_tail(&(*mw)->mw_list, &buf->rb_mws);
-   *mw = NULL;
-   }
-}
-
-/* Cycle mw's back in reverse order, and "spin" them.
- * This delays and scrambles reuse as much as possible.
- */
-static void
-rpcrdma_buffer_put_mrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf)
-{
-   struct rpcrdma_mr_seg *seg = req->rl_segments;
-   struct rpcrdma_mr_seg *seg1 = seg;
-   int i;
-
-   for (i = 1, seg++; i < RPCRDMA_MAX_SEGS; seg++, i++)
-   rpcrdma_buffer_put_mr(&seg->rl_mw, buf);
-   rpcrdma_buffer_put_mr(&seg1->rl_mw, buf);
-}
-
 static void
 rpcrdma_buffer_put_sendbuf(struct rpcrdma_req *req, struct rpcrdma_buffer *buf)
 {
@@ -1242,88 +1215,6 @@ rpcrdma_buffer_put_sendbuf(struct rpcrdma_req *req, 
struct rpcrdma_buffer *buf)
}
 }
 
-/* rpcrdma_unmap_one() was already done during deregistration.
- * Redo only the ib_post_send().
- */
-static void
-rpcrdma_retry_local_inv(struct rpcrdma_mw *r, struct rpcrdma_ia *ia)
-{
-   struct rpcrdma_xprt *r_xprt =
-   container_of(ia, struct rpcrdma_xprt, rx_ia);
-   struct ib_send_wr invalidate_wr, *bad_wr;
-   int rc;
-
-   dprintk("RPC:   %s: FRMR %p is stale\n", __func__, r);
-
-   /* When this FRMR is re-inserted into rb_mws, it is no longer stale */
-   r->r.frmr.fr_state = FRMR_IS_INVALID;
-
-   memset(&invalidate_wr, 0, sizeof(invalidate_wr));
-   invalidate_wr.wr_id = (unsigned long)(void *)r;
-   invalidate_wr.opcode = IB_WR_LOCAL_INV;
-   invalidate_wr.ex.invalidate_rkey = r->r.frmr.fr_mr->rkey;
-   DECR_CQCOUNT(&r_xprt->rx_ep);
-
-   dprintk("RPC:   %s: frmr %p invalidating rkey %08x\n",
-   __func__, r, r->r.frmr.fr_mr->rkey);
-
-   read_lock(&ia->ri_qplock);
-   rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
-   read_unlock(&ia->ri_qplock);
-   if (rc) {
-   /* Force rpcrdma_buffer_get() to retry */
-   r->r.frmr.fr_state = FRMR_IS_STALE;
-   dprintk("RPC:   %s: ib_post_send failed, %i\n",
-   __func__, rc);
-   }
-}
-
-static void
-rpcrdma_retry_flushed_linv(struct list_head *stale,
-  struct rpcrdma_buffer *buf)
-{
-   struct rpcrdma_ia *ia = rdmab_to_ia(buf);
-   struct list_head *pos;
-   struct rpcrdma_mw *r;
-   unsigned long flags;
-
-   list_for_each(pos, stale) {
-   r = list_entry(pos, struct rpcrdma_mw, mw_list);
-   rpcrdma_retry_local_inv(r, ia);
-   }
-
-   spin_lock_irqsave(&buf->rb_lock, flags);
-   list_splice_tail(stale, &buf->rb_mws);
-   spin_unlock_irqrestore(&buf->rb_lock, flags);
-}
-
-static struct rpcrdma_req *
-rpcrdma_buffer_get_frmrs(struct rpcrdma_req *req, struct rpcrdma_buffer *buf,
-struct list_head *stale)
-{
-   struct rpcrdma_mw *r;
-   int i;
-
-   i = RPCRDMA_MAX_SEGS - 1;
-   while (!list_empty(&buf->rb_mws)) {
-   r = list_entry(buf->rb_mws.next,
-  struct rpcrdma_mw, mw_list);
-   list_del(&r->mw_list);
-   if (r->r.frmr.fr_state == FRMR_IS_STALE) {
-   list_add(&r->mw_list, stale);
-   continue;
-   }
-   req->rl_segments[i].rl_mw = r;
-   if (unlikely(i-- == 0))
-   return req; /* Success */
-   }
-
-   /* Not enough entries on rb_mws for this req */
-   rpcrdma_buffer_put_sendbuf(req, buf);
-   rpcrdma_buffer_put_mrs(req, buf);
-   return NULL;
-}
-
 /*
  * Get a set of request/reply buffers.
  *

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 17/17] NFS: Fix size of NFSACL SETACL operations

2015-05-26 Thread Chuck Lever
When encoding the NFSACL SETACL operation, reserve just the estimated
size of the ACL rather than a fixed maximum. This eliminates needless
zero padding on the wire that the server ignores.

Fixes: ee5dc7732bd5 ('NFS: Fix "kernel BUG at fs/nfs/nfs3xdr.c:1338!"')
Signed-off-by: Chuck Lever 
---
 fs/nfs/nfs3xdr.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/nfs3xdr.c b/fs/nfs/nfs3xdr.c
index 53852a4..9b04c2e 100644
--- a/fs/nfs/nfs3xdr.c
+++ b/fs/nfs/nfs3xdr.c
@@ -1342,7 +1342,7 @@ static void nfs3_xdr_enc_setacl3args(struct rpc_rqst *req,
if (args->npages != 0)
xdr_write_pages(xdr, args->pages, 0, args->len);
else
-   xdr_reserve_space(xdr, NFS_ACL_INLINE_BUFSIZE);
+   xdr_reserve_space(xdr, args->len);
 
error = nfsacl_encode(xdr->buf, base, args->inode,
(args->mask & NFS_ACL) ?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Indirect memory registration feature

2015-06-09 Thread Chuck Lever
rdware rkey, so it doesn’t work in this case.

What all this means is that significant extra complexity is required to
deal with transport disconnection, which is very rare compared to normal
data transfer operations.

FMR, for instance, has much less of an issue here because the map and
unmap verbs are synchronous, do not rely on having a QP, and do not
generate send completions. But FMR has other problems.

> - There is the fmr_pool API which tries to tackle the disadvantages of
>  fmrs (very slow unmap) by delaying the fmr unmap until some dirty
>  watermark of remapping is met. I'm not sure how this can be done.

I wonder if FMR should be considered at all for a simplified API. Sure,
there are some older cards that do not support FRMR, but seems like FMR
is going to be abandoned sooner or later.

> - How would the API choose the method to register memory?

Based on the capabilities of the HCA, I would think.

> - If there is an alignment issue, do we fail? do we bounce?
> 
> - There is the whole T10-DIF support…

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 3/5] RDMA/core: transport-independent access flags

2015-06-30 Thread Chuck Lever

On Jun 30, 2015, at 10:29 AM, Steve Wise  wrote:

> 
> 
>> -Original Message-
>> From: Or Gerlitz [mailto:ogerl...@mellanox.com]
>> Sent: Tuesday, June 30, 2015 4:04 AM
>> To: Steve Wise; Chuck Lever
>> Cc: dledf...@redhat.com; r...@mellanox.com; sa...@mellanox.com; 
>> linux-rdma@vger.kernel.org; jguntho...@obsidianresearch.com;
>> infinip...@intel.com; e...@mellanox.com; sean.he...@intel.com
>> Subject: Re: [PATCH V2 3/5] RDMA/core: transport-independent access flags
>> 
>> On 6/30/2015 12:36 AM, Steve Wise wrote:
>>> The semantics for MR access flags are not consistent across RDMA
>>> protocols.  So rather than have applications try and glean what they
>>> need, have them pass in the intended roles and attributes for the MR to
>>> be allocated and let the RDMA core select the appropriate access flags
>>> given the roles, attributes, and device capabilities.
>> 
>> wait, we have NFSoRDMA (net/sunrpc/xprtrdma/*) in the kernel for years
>> and it works on top of both IB/RoCE and iWARP.I know they have there 5-6
>> memory registration methods.. but if we stick to their mode that uses
>> fast registration ala your upstream commit 0f7ec3 "RDMA/core: Add memory
>> management extensions support"  -- what's missing or how come it work
>> w/o the enhancement suggested here?Added Chuck.
> 
> NFSRDMA currently checks the transport type to decide how to set the access 
> flags for memory registration.  With the new services exported in this 
> series, we can change/simplify NFSRDMA to not have to know the transport 
> type.  

I was planning to look at this more closely soon, but if you
have patches I’d happily consider them, or you can just point
me to what needs to be changed and I can put it together for 4.3.

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/5] ipath,qib: Expose max_sge_rd correctly

2015-06-30 Thread Chuck Lever

On Jun 30, 2015, at 12:54 PM, Jason Gunthorpe  
wrote:

> On Mon, Jun 29, 2015 at 04:36:13PM -0500, Steve Wise wrote:
>> Applications must not assume that max_sge and max_sge_rd
>> are the same, Hence expose max_sge_rd correctly as well.
> 
> Chuck,
> 
> Now that this works, can we change NFS RDMA and get rid of the
> rdma_cap_read_multi_sge stuff?

Steve can propose a patch, I’ll review. 

If this is NFS server only, it will have to go through Bruce’s
tree, or he can ACK it and it can go through linux-rdma.

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-09 Thread Chuck Lever

On Jul 9, 2015, at 4:46 AM, Sagi Grimberg  wrote:

> On 7/8/2015 8:14 PM, Hefty, Sean wrote:
>>> I am still not clear if all of us agree that we need it.
>>> Sean and Steve had some disclaimers...
>> 
>> A single entry point doesn't help a whole lot if the app must deal with 
>> different behavior based on how the API is used.
> 
> It is true that different MRs will be used differently. However, not
> once we have found ourselves extending an API to add functionality. This
> means changing the API signature and changing all the call sites. Just
> recently we saw this in Steve's mr_roles and in Matan's timestamping
> support (changed ib_create_cq). When was the last time ib_create_qp API
> was modified?
> 
>> We have a single entry point for post_send, for example, and that makes 
>> things worse.
> 
> I don't necessarily agree. the fact that post_send have multiple entry
> points allows the consumer to concatenate a couple of those in a single
> post. That's beneficial to get maximum performance from your HW.
> 
>> IMO, we need fewer registration *methods* not fewer calls.
> 
> I do agree that we need fewer registration methods.

I also feel that would be a healthy direction.


> Let's review what we have today:
> 
> - FRWR: which is *the* standard way to register memory. It's fast,
>  non-blocking and has vast support.
> 
> - FMR: which is only supported in some Mellanox devices if I'm not
>  mistaken, it's fast, but has slow invalidation (unmap). It is also not
>  supported over VF.
>  * FMR_POOL API was designed to address the slow unmap but created a
>week invalidation semantics (unmap is done only when some number of
>remapping is met).
> 
> - REG_PHYS_MR: which is supported by some devices. It actually
>  combines both MR allocation and registration in a single call (it is
>  the equivalent of user-space ibv_reg_mr)

There is also Memory Windows, but there may no longer be any
kernel consumers of that memory registration method.


> I don't consider the dma_mr a registration method. It provides physical
> memory access.
> 
> As for REG_PHYS_MR, this is the only synchronous registration method in
> the kernel. It is a simple interface, which is currently used by xprtrdma 
> when dma mr is not supported. We can consider removing it in
> the future if it turns out to be non useful.

Code review has shown the remaining ib_reg_phys_mr() call in xprtrdma
is never reached in the current code, and thus it will be removed
very soon.

There is one remaining kernel user of ib_reg_phys_mr() in 4.2: Lustre.


> As for FMR/FMR_POOL, I'd much rather to wait until it becomes
> deprecated on it's own rather than try and incorporate it in a
> modernized API.

> I think the stack can converge on FRWR as its primary registration
> method. Let's focus on making it better.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 04/12] xprtrdma: Remove last ib_reg_phys_mr() call site

2015-07-09 Thread Chuck Lever
All HCA providers have an ib_get_dma_mr() verb. Thus
rpcrdma_ia_open() will either grab the device's local_dma_key if one
is available, or it will call ib_get_dma_mr() which is a 100%
guaranteed fallback. There is never any need to use the
ib_reg_phys_mr() code path in rpcrdma_register_internal(), so it can
be removed.

The remaining logic in rpcrdma_{de}register_internal() is folded
into rpcrdma_{alloc,free}_regbuf().

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |  102 ---
 net/sunrpc/xprtrdma/xprt_rdma.h |1 
 2 files changed, 21 insertions(+), 82 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 891c4ed..cdf5220 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1229,75 +1229,6 @@ rpcrdma_mapping_error(struct rpcrdma_mr_seg *seg)
(unsigned long long)seg->mr_dma, seg->mr_dmalen);
 }
 
-static int
-rpcrdma_register_internal(struct rpcrdma_ia *ia, void *va, int len,
-   struct ib_mr **mrp, struct ib_sge *iov)
-{
-   struct ib_phys_buf ipb;
-   struct ib_mr *mr;
-   int rc;
-
-   /*
-* All memory passed here was kmalloc'ed, therefore phys-contiguous.
-*/
-   iov->addr = ib_dma_map_single(ia->ri_device,
-   va, len, DMA_BIDIRECTIONAL);
-   if (ib_dma_mapping_error(ia->ri_device, iov->addr))
-   return -ENOMEM;
-
-   iov->length = len;
-
-   if (ia->ri_have_dma_lkey) {
-   *mrp = NULL;
-   iov->lkey = ia->ri_dma_lkey;
-   return 0;
-   } else if (ia->ri_bind_mem != NULL) {
-   *mrp = NULL;
-   iov->lkey = ia->ri_bind_mem->lkey;
-   return 0;
-   }
-
-   ipb.addr = iov->addr;
-   ipb.size = iov->length;
-   mr = ib_reg_phys_mr(ia->ri_pd, &ipb, 1,
-   IB_ACCESS_LOCAL_WRITE, &iov->addr);
-
-   dprintk("RPC:   %s: phys convert: 0x%llx "
-   "registered 0x%llx length %d\n",
-   __func__, (unsigned long long)ipb.addr,
-   (unsigned long long)iov->addr, len);
-
-   if (IS_ERR(mr)) {
-   *mrp = NULL;
-   rc = PTR_ERR(mr);
-   dprintk("RPC:   %s: failed with %i\n", __func__, rc);
-   } else {
-   *mrp = mr;
-   iov->lkey = mr->lkey;
-   rc = 0;
-   }
-
-   return rc;
-}
-
-static int
-rpcrdma_deregister_internal(struct rpcrdma_ia *ia,
-   struct ib_mr *mr, struct ib_sge *iov)
-{
-   int rc;
-
-   ib_dma_unmap_single(ia->ri_device,
-   iov->addr, iov->length, DMA_BIDIRECTIONAL);
-
-   if (NULL == mr)
-   return 0;
-
-   rc = ib_dereg_mr(mr);
-   if (rc)
-   dprintk("RPC:   %s: ib_dereg_mr failed %i\n", __func__, rc);
-   return rc;
-}
-
 /**
  * rpcrdma_alloc_regbuf - kmalloc and register memory for SEND/RECV buffers
  * @ia: controlling rpcrdma_ia
@@ -1317,26 +1248,30 @@ struct rpcrdma_regbuf *
 rpcrdma_alloc_regbuf(struct rpcrdma_ia *ia, size_t size, gfp_t flags)
 {
struct rpcrdma_regbuf *rb;
-   int rc;
+   struct ib_sge *iov;
 
-   rc = -ENOMEM;
rb = kmalloc(sizeof(*rb) + size, flags);
if (rb == NULL)
goto out;
 
-   rb->rg_size = size;
-   rb->rg_owner = NULL;
-   rc = rpcrdma_register_internal(ia, rb->rg_base, size,
-  &rb->rg_mr, &rb->rg_iov);
-   if (rc)
+   iov = &rb->rg_iov;
+   iov->addr = ib_dma_map_single(ia->ri_device,
+ (void *)rb->rg_base, size,
+ DMA_BIDIRECTIONAL);
+   if (ib_dma_mapping_error(ia->ri_device, iov->addr))
goto out_free;
 
+   iov->length = size;
+   iov->lkey = ia->ri_have_dma_lkey ?
+   ia->ri_dma_lkey : ia->ri_bind_mem->lkey;
+   rb->rg_size = size;
+   rb->rg_owner = NULL;
return rb;
 
 out_free:
kfree(rb);
 out:
-   return ERR_PTR(rc);
+   return ERR_PTR(-ENOMEM);
 }
 
 /**
@@ -1347,10 +1282,15 @@ out:
 void
 rpcrdma_free_regbuf(struct rpcrdma_ia *ia, struct rpcrdma_regbuf *rb)
 {
-   if (rb) {
-   rpcrdma_deregister_internal(ia, rb->rg_mr, &rb->rg_iov);
-   kfree(rb);
-   }
+   struct ib_sge *iov;
+
+   if (!rb)
+   return;
+
+   iov = &rb->rg_iov;
+   ib_dma_unmap_single(ia->ri_device,
+   iov->addr, iov->length, DMA_BIDIRECTIONAL);
+   kfree(rb);
 }
 
 /*
diff --git a/net/sunrp

[PATCH v1 03/12] xprtrdma: Increase default credit limit

2015-07-09 Thread Chuck Lever
In preparation for similar increases on NFS/RDMA servers, bump the
advertised credit limit for RPC/RDMA to 128. This allocates some
extra resources, but the client will continue to allow only the
number of RPCs in flight that the server requests via its advertised
credit limit.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/xprtrdma.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/xprtrdma.h b/include/linux/sunrpc/xprtrdma.h
index b176130..b7b279b 100644
--- a/include/linux/sunrpc/xprtrdma.h
+++ b/include/linux/sunrpc/xprtrdma.h
@@ -49,7 +49,7 @@
  * a single chunk type per message is supported currently.
  */
 #define RPCRDMA_MIN_SLOT_TABLE (2U)
-#define RPCRDMA_DEF_SLOT_TABLE (32U)
+#define RPCRDMA_DEF_SLOT_TABLE (128U)
 #define RPCRDMA_MAX_SLOT_TABLE (256U)
 
 #define RPCRDMA_DEF_INLINE  (1024) /* default inline max */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   >