from:"Chuck Lever"

Re: NULL pointer deref in k.o/for-4.5

2016-01-06 Thread Chuck Lever


> On Jan 6, 2016, at 5:25 PM, Or Gerlitz  wrote:
> 
> On Wed, Jan 6, 2016 at 9:20 PM, Chuck Lever  wrote:
>> And appears to be 100% reproducible. Any debugging
>> advice welcome!
> 
> was reported here 2-3 times, this fixes that
> https://patchwork.kernel.org/patch/7929551

Confirmed, that fixes it. Thanks, I never would have
guessed that was the fix.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NULL pointer deref in k.o/for-4.5

2016-01-06 Thread Chuck Lever


> On Jan 6, 2016, at 1:16 PM, Chuck Lever  wrote:
> 
> Encountered the below just after booting my NFS/RDMA
> server with 4.4.0-rc6-00011-g6948cb2 (k.o/for-4.5 plus
> my NFS/RDMA for-4.5 patches). The system is up and
> ping-able via eth0, but high-level networking (like sshd
> and nfsd) does not work, and my ib0 i/f is missing.
> 
> This is an x86_64 system with one CX-3 Pro HCA.

And appears to be 100% reproducible. Any debugging
advice welcome!


> All seems well with a stock v4.4-rc4 kernel.
> 
> 
> Jan  6 12:44:13 klimt kernel:  mlx4_ib_add: mlx4_ib: Mellanox 
> ConnectX InfiniBand driver v2.2-1 (Feb 2014)
> Jan  6 12:44:13 klimt kernel:  mlx4_ib_add: counter index 0 for port 
> 1 allocated 0
> Jan  6 12:44:13 klimt kernel: BUG: unable to handle kernel NULL pointer 
> dereference at   (null)
> Jan  6 12:44:13 klimt kernel: IP: [] 
> __mutex_lock_slowpath+0x75/0x120
> Jan  6 12:44:13 klimt kernel: PGD 853947067 PUD 8546cb067 PMD 0 
> Jan  6 12:44:13 klimt kernel: Oops: 0002 [#1] SMP 
> Jan  6 12:44:13 klimt kernel: Modules linked in: mlx4_ib(+) mlx4_en ib_sa 
> ib_mad ib_core vxlan ip6_udp_tunnel udp_tunnel ib_addr sr_mod cdrom sd_mod 
> ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm 
> mlx4_core igb ahci libahci libata ptp pps_core dca i2c_algo_bit i2c_core 
> dm_mirror dm_region_hash dm_log dm_mod
> Jan  6 12:44:13 klimt kernel: CPU: 3 PID: 431 Comm: modprobe Not tainted 
> 4.4.0-rc6-00011-g6948cb2 #79
> Jan  6 12:44:13 klimt kernel: Hardware name: Supermicro Super 
> Server/X10SRL-F, BIOS 1.0c 09/09/2015
> Jan  6 12:44:13 klimt kernel: task: 88085571aa80 ti: 88084f414000 
> task.ti: 88084f414000
> Jan  6 12:44:13 klimt kernel: RIP: 0010:[]  
> [] __mutex_lock_slowpath+0x75/0x120
> Jan  6 12:44:13 klimt kernel: RSP: 0018:88084f417810  EFLAGS: 00010282
> Jan  6 12:44:13 klimt kernel: RAX:  RBX: 88084f633950 
> RCX: 88085571aa80
> Jan  6 12:44:13 klimt kernel: RDX: 0001 RSI: 88085571aae0 
> RDI: 88084f633954
> Jan  6 12:44:13 klimt kernel: RBP: 88084f417858 R08: 0101 
> R09: 880854f02f00
> Jan  6 12:44:13 klimt kernel: R10: a0150a85 R11: ea002156d400 
> R12: 88084f633954
> Jan  6 12:44:13 klimt kernel: R13: 88085571aa80 R14:  
> R15: 88084f633958
> Jan  6 12:44:13 klimt kernel: FS:  7f32227c0740() 
> GS:88087fcc() knlGS:
> Jan  6 12:44:13 klimt kernel: CS:  0010 DS:  ES:  CR0: 
> 80050033
> Jan  6 12:44:13 klimt kernel: CR2:  CR3: 000853cb6000 
> CR4: 001406e0
> Jan  6 12:44:13 klimt kernel: Stack:
> Jan  6 12:44:13 klimt kernel: 88084f633958  
> 81309502 3b473ac0
> Jan  6 12:44:13 klimt kernel: 88084f633950 88084f417888 
> 88084f633940 88084f633950
> Jan  6 12:44:13 klimt kernel: 88084f63 88084f417870 
> 8165271f 88084f63
> Jan  6 12:44:13 klimt kernel: Call Trace:
> Jan  6 12:44:13 klimt kernel: [] ? 
> get_from_free_list+0x42/0x50
> Jan  6 12:44:13 klimt kernel: [] mutex_lock+0x1f/0x2f
> Jan  6 12:44:13 klimt kernel: [] 
> iboe_process_mad.isra.13+0x77/0x190 [mlx4_ib]
> Jan  6 12:44:13 klimt kernel: [] 
> mlx4_ib_process_mad+0x4d4/0x550 [mlx4_ib]
> Jan  6 12:44:13 klimt kernel: [] ? 
> kernfs_next_descendant_post+0x1a/0x50
> Jan  6 12:44:13 klimt kernel: [] ? 
> kernfs_add_one+0x112/0x150
> Jan  6 12:44:13 klimt kernel: [] ? 
> kmem_cache_alloc_trace+0x3d/0x1d0
> Jan  6 12:44:13 klimt kernel: [] ? get_perf_mad+0x85/0x160 
> [ib_core]
> Jan  6 12:44:13 klimt kernel: [] get_perf_mad+0xee/0x160 
> [ib_core]
> Jan  6 12:44:13 klimt kernel: [] 
> get_counter_table+0x38/0x70 [ib_core]
> Jan  6 12:44:13 klimt kernel: [] ? 
> kmem_cache_alloc_trace+0xf8/0x1d0
> Jan  6 12:44:13 klimt kernel: [] ? add_port+0xc2/0x450 
> [ib_core]
> Jan  6 12:44:13 klimt kernel: [] add_port+0x10f/0x450 
> [ib_core]
> Jan  6 12:44:13 klimt kernel: [] 
> ib_device_register_sysfs+0xe8/0x160 [ib_core]
> Jan  6 12:44:13 klimt kernel: [] 
> ib_register_device+0x320/0x500 [ib_core]
> Jan  6 12:44:13 klimt kernel: [] ? vprintk_default+0x3b/0x40
> Jan  6 12:44:13 klimt kernel: [] ? printk+0x5d/0x74
> Jan  6 12:44:13 klimt kernel: [] mlx4_ib_add+0xbb9/0xfe0 
> [mlx4_ib]
> Jan  6 12:44:13 klimt kernel: [] ? 0xa023f000
> Jan  6 12:44:13 klimt kernel: [] mlx4_add_device+0x3f/0xb0 
> [mlx4_core]
> Jan  6 12:44:13 klimt kernel: [] ? 0xa023f000
> Jan  6 12:44:13 klimt kernel: [] 
> mlx4_register_interface+0xd2/0x100 [mlx4_core]
> Jan  6 12:44:13 klimt kernel: [] mlx4_ib_init+0x4c/0x1000 
> [mlx4_ib]
> Jan  6 12:44:13

NULL pointer deref in k.o/for-4.5

2016-01-06 Thread Chuck Lever

 klimt kernel: Code: 04 4c 89 e7 e8 9d 1a 00 00 8b 03 83 f8 01 
74 25 48 8b 43 10 4c 8d 7b 08 48 89 63 10 41 be ff ff ff ff 4c 89 3c 24 48 89 
44 24 08 <48> 89 20 4c 89 6c 24 10 eb 0b 31 c0 87 03 83 f8 01 75 d2 eb 57 
Jan  6 12:44:13 klimt kernel: RIP  [] 
__mutex_lock_slowpath+0x75/0x120
Jan  6 12:44:13 klimt kernel: RSP 
Jan  6 12:44:13 klimt kernel: CR2: 
Jan  6 12:44:13 klimt kernel: ---[ end trace cea4b2a7abe96d8c ]---


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 00/11] NFS/RDMA server patches for v4.5

2015-12-24 Thread Chuck Lever

My functional test suite includes Cthon, iozone, dbench, fio,
multi-threaded builds of git and the Linux kernel, and xfstests.

This patch series passes with NFSv3, NFSv4.0, and now NFSv4.1.

--
Chuck Lever

> On Dec 23, 2015, at 21:00, J. Bruce Fields  wrote:
> 
>> On Wed, Dec 16, 2015 at 05:40:09PM +0530, Devesh Sharma wrote:
>> iozone passed on ocrdma device.
> 
> What other testing has there been of this patchset?
> 
> Connectathon, xfstests, and pynfs make more of an effort to test corner
> cases, iozone isn't much of a test of correctness.
> 
> --b.
> 
>> Link bounce fails to recover iozone
>> traffic, however failure is not related to this patch series. I am in
>> processes of finding out the patch which broke it.
>> 
>> Tested-By: Devesh Sharma 
>> 
>>> On Tue, Dec 15, 2015 at 3:00 AM, Chuck Lever  wrote:
>>> Here are patches to support server-side bi-directional RPC/RDMA
>>> operation (to enable NFSv4.1 on RPC/RDMA transports). Thanks to
>>> all who reviewed v1, v2, and v3. This version has some significant
>>> changes since the previous one.
>>> 
>>> In preparation for Doug's final topic branch, Bruce, I've rebased
>>> these on Christoph's ib_device_attr branch. There were some merge
>>> conflicts which I've fixed and tested. These are ready for your
>>> review.
>>> 
>>> Also available in the "nfsd-rdma-for-4.5" topic branch of this git repo:
>>> 
>>> git://git.linux-nfs.org/projects/cel/cel-2.6.git
>>> 
>>> Or for browsing:
>>> 
>>> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-for-4.5
>>> 
>>> 
>>> Changes since v3:
>>> - Rebased on Christoph's ib_device_attr branch
>>> - Backchannel patches have been squashed together
>>> - Memory allocation overhaul to prevent blocking allocation
>>>  when sending backchannel calls
>>> 
>>> 
>>> Changes since v2:
>>> - Rebased on v4.4-rc4
>>> - Backchannel code in new source file to address dprintk issues
>>> - svc_rdma_get_context() now uses a pre-allocated cache
>>> - Dropped svc_rdma_send clean up
>>> 
>>> 
>>> Changes since v1:
>>> 
>>> - Rebased on v4.4-rc3
>>> - Removed the use of CONFIG_SUNRPC_BACKCHANNEL
>>> - Fixed computation of forward and backward max_requests
>>> - Updated some comments and patch descriptions
>>> - pr_err and pr_info converted to dprintk
>>> - Simplified svc_rdma_get_context()
>>> - Dropped patch removing access_flags field
>>> - NFSv4.1 callbacks tested with for-4.5 client
>>> 
>>> ---
>>> 
>>> Chuck Lever (11):
>>>  svcrdma: Do not send XDR roundup bytes for a write chunk
>>>  svcrdma: Clean up rdma_create_xprt()
>>>  svcrdma: Clean up process_context()
>>>  svcrdma: Improve allocation of struct svc_rdma_op_ctxt
>>>  svcrdma: Improve allocation of struct svc_rdma_req_map
>>>  svcrdma: Remove unused req_map and ctxt kmem_caches
>>>  svcrdma: Add gfp flags to svc_rdma_post_recv()
>>>  svcrdma: Remove last two __GFP_NOFAIL call sites
>>>  svcrdma: Make map_xdr non-static
>>>  svcrdma: Define maximum number of backchannel requests
>>>  svcrdma: Add class for RDMA backwards direction transport
>>> 
>>> 
>>> include/linux/sunrpc/svc_rdma.h|   37 ++-
>>> net/sunrpc/xprt.c  |1
>>> net/sunrpc/xprtrdma/Makefile   |2
>>> net/sunrpc/xprtrdma/svc_rdma.c |   41 ---
>>> net/sunrpc/xprtrdma/svc_rdma_backchannel.c |  371 
>>> 
>>> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c|   52 
>>> net/sunrpc/xprtrdma/svc_rdma_sendto.c  |   34 ++-
>>> net/sunrpc/xprtrdma/svc_rdma_transport.c   |  284 -
>>> net/sunrpc/xprtrdma/transport.c|   30 +-
>>> net/sunrpc/xprtrdma/xprt_rdma.h|   20 +-
>>> 10 files changed, 730 insertions(+), 142 deletions(-)
>>> create mode 100644 net/sunrpc/xprtrdma/svc_rdma_backchannel.c
>>> 
>>> --
>>> Signature
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 01/11] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-12-21 Thread Chuck Lever


> On Dec 21, 2015, at 4:29 PM, J. Bruce Fields  wrote:
> 
> On Mon, Dec 21, 2015 at 04:15:23PM -0500, Chuck Lever wrote:
>> 
>>> On Dec 21, 2015, at 4:07 PM, J. Bruce Fields  wrote:
>>> 
>>> On Mon, Dec 14, 2015 at 04:30:09PM -0500, Chuck Lever wrote:
>>>> Minor optimization: when dealing with write chunk XDR roundup, do
>>>> not post a Write WR for the zero bytes in the pad. Simply update
>>>> the write segment in the RPC-over-RDMA header to reflect the extra
>>>> pad bytes.
>>>> 
>>>> The Reply chunk is also a write chunk, but the server does not use
>>>> send_write_chunks() to send the Reply chunk. That's OK in this case:
>>>> the server Upper Layer typically marshals the Reply chunk contents
>>>> in a single contiguous buffer, without a separate tail for the XDR
>>>> pad.
>>>> 
>>>> The comments and the variable naming refer to "chunks" but what is
>>>> really meant is "segments." The existing code sends only one
>>>> xdr_write_chunk per RPC reply.
>>>> 
>>>> The fix assumes this as well. When the XDR pad in the first write
>>>> chunk is reached, the assumption is the Write list is complete and
>>>> send_write_chunks() returns.
>>>> 
>>>> That will remain a valid assumption until the server Upper Layer can
>>>> support multiple bulk payload results per RPC.
>>>> 
>>>> Signed-off-by: Chuck Lever 
>>>> ---
>>>> net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++
>>>> 1 file changed, 7 insertions(+)
>>>> 
>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
>>>> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>>> index 969a1ab..bad5eaa 100644
>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>>> @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma 
>>>> *xprt,
>>>>arg_ch->rs_handle,
>>>>arg_ch->rs_offset,
>>>>write_len);
>>>> +
>>>> +  /* Do not send XDR pad bytes */
>>>> +  if (chunk_no && write_len < 4) {
>>>> +  chunk_no++;
>>>> +  break;
>>> 
>>> I'm pretty lost in this code.  Why does (chunk_no && write_len < 4) mean
>>> this is xdr padding?
>> 
>> Chunk zero is always data. Padding is always going to be
>> after the first chunk. Any chunk after chunk zero that is
>> shorter than XDR quad alignment is going to be a pad.
> 
> I don't really know what a chunk is  Looking at the code:
> 
>   write_len = min(xfer_len, be32_to_cpu(arg_ch->rs_length));
> 
> so I guess the assumption is just that those rs_length's are always a
> multiple of four?

The example you recently gave was a two-byte NFS READ
that crosses a page boundary.

In that case, the NFSD would pass down an xdr_buf that
has one byte in a page, one byte in another page, and
a two-byte XDR pad. The logic introduced by this
optimization would be fooled, and neither the second
byte nor the XDR pad would be written to the client.

Unless you can think of a way to recognize an XDR pad
in the xdr_buf 100% of the time, you should drop this
patch.

As far as I know, none of the other patches in this
series depend on this optimization, so please merge
them if you can.


> --b.
> 
>> 
>> Probably too clever. Is there a better way to detect
>> the XDR pad?
>> 
>> 
>>>> +  }
>>>> +
>>>>chunk_off = 0;
>>>>while (write_len) {
>>>>ret = send_write(xprt, rqstp,
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 01/11] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-12-21 Thread Chuck Lever


> On Dec 21, 2015, at 4:07 PM, J. Bruce Fields  wrote:
> 
> On Mon, Dec 14, 2015 at 04:30:09PM -0500, Chuck Lever wrote:
>> Minor optimization: when dealing with write chunk XDR roundup, do
>> not post a Write WR for the zero bytes in the pad. Simply update
>> the write segment in the RPC-over-RDMA header to reflect the extra
>> pad bytes.
>> 
>> The Reply chunk is also a write chunk, but the server does not use
>> send_write_chunks() to send the Reply chunk. That's OK in this case:
>> the server Upper Layer typically marshals the Reply chunk contents
>> in a single contiguous buffer, without a separate tail for the XDR
>> pad.
>> 
>> The comments and the variable naming refer to "chunks" but what is
>> really meant is "segments." The existing code sends only one
>> xdr_write_chunk per RPC reply.
>> 
>> The fix assumes this as well. When the XDR pad in the first write
>> chunk is reached, the assumption is the Write list is complete and
>> send_write_chunks() returns.
>> 
>> That will remain a valid assumption until the server Upper Layer can
>> support multiple bulk payload results per RPC.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>> net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++
>> 1 file changed, 7 insertions(+)
>> 
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
>> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> index 969a1ab..bad5eaa 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
>>  arg_ch->rs_handle,
>>  arg_ch->rs_offset,
>>  write_len);
>> +
>> +/* Do not send XDR pad bytes */
>> +if (chunk_no && write_len < 4) {
>> +chunk_no++;
>> +break;
> 
> I'm pretty lost in this code.  Why does (chunk_no && write_len < 4) mean
> this is xdr padding?

Chunk zero is always data. Padding is always going to be
after the first chunk. Any chunk after chunk zero that is
shorter than XDR quad alignment is going to be a pad.

Probably too clever. Is there a better way to detect
the XDR pad?


>> +}
>> +
>>  chunk_off = 0;
>>  while (write_len) {
>>  ret = send_write(xprt, rqstp,

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 08/10] xprtrdma: Add ro_unmap_sync method for all-physical registration

2015-12-16 Thread Chuck Lever

physical's ro_unmap is synchronous already. The new ro_unmap_sync
method just has to DMA unmap all MRs associated with the RPC
request.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/physical_ops.c |   13 +
 1 file changed, 13 insertions(+)

diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index 617b76f..dbb302e 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -83,6 +83,18 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
return 1;
 }
 
+/* DMA unmap all memory regions that were mapped for "req".
+ */
+static void
+physical_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   unsigned int i;
+
+   for (i = 0; req->rl_nchunks; --req->rl_nchunks)
+   rpcrdma_unmap_one(device, &req->rl_segments[i++]);
+}
+
 static void
 physical_op_destroy(struct rpcrdma_buffer *buf)
 {
@@ -90,6 +102,7 @@ physical_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
+   .ro_unmap_sync  = physical_op_unmap_sync,
.ro_unmap   = physical_op_unmap,
.ro_open= physical_op_open,
.ro_maxpages= physical_op_maxpages,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 09/10] xprtrdma: Invalidate in the RPC reply handler

2015-12-16 Thread Chuck Lever

There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.

The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.

This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.

Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.

Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.

Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.

Reported-by: Jason Gunthorpe 
Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/rpc_rdma.c |   16 
 1 file changed, 16 insertions(+)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..0f28f2d 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -804,6 +804,11 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
if (req->rl_reply)
goto out_duplicate;
 
+   /* Sanity checking has passed. We are now committed
+* to complete this transaction.
+*/
+   list_del_init(&rqst->rq_list);
+   spin_unlock_bh(&xprt->transport_lock);
dprintk("RPC:   %s: reply 0x%p completes request 0x%p\n"
"   RPC request 0x%p xid 0x%08x\n",
__func__, rep, req, rqst,
@@ -888,12 +893,23 @@ badheader:
break;
}
 
+   /* Invalidate and flush the data payloads before waking the
+* waiting application. This guarantees the memory region is
+* properly fenced from the server before the application
+* accesses the data. It also ensures proper send flow
+* control: waking the next RPC waits until this RPC has
+* relinquished all its Send Queue entries.
+*/
+   if (req->rl_nchunks)
+   r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);
+
credits = be32_to_cpu(headerp->rm_credit);
if (credits == 0)
credits = 1;/* don't deadlock */
else if (credits > r_xprt->rx_buf.rb_max_requests)
credits = r_xprt->rx_buf.rb_max_requests;
 
+   spin_lock_bh(&xprt->transport_lock);
cwnd = xprt->cwnd;
xprt->cwnd = credits << RPC_CWNDSHIFT;
if (xprt->cwnd > cwnd)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 10/10] xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').

2015-12-16 Thread Chuck Lever

The root of the problem was that sends (especially unsignalled
FASTREG and LOCAL_INV Work Requests) were not properly flow-
controlled, which allowed a send queue overrun.

Now that the RPC/RDMA reply handler waits for invalidation to
complete, the send queue is properly flow-controlled. Thus this
limit is no longer necessary.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/verbs.c |6 ++
 net/sunrpc/xprtrdma/xprt_rdma.h |6 --
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index f23f3d6..1867e3a 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -608,10 +608,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia 
*ia,
 
/* set trigger for requesting send completion */
ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
-   if (ep->rep_cqinit > RPCRDMA_MAX_UNSIGNALED_SENDS)
-   ep->rep_cqinit = RPCRDMA_MAX_UNSIGNALED_SENDS;
-   else if (ep->rep_cqinit <= 2)
-   ep->rep_cqinit = 0;
+   if (ep->rep_cqinit <= 2)
+   ep->rep_cqinit = 0; /* always signal? */
INIT_CQCOUNT(ep);
init_waitqueue_head(&ep->rep_connect_wait);
INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index b8bac41..a563ffc 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -87,12 +87,6 @@ struct rpcrdma_ep {
struct delayed_work rep_connect_worker;
 };
 
-/*
- * Force a signaled SEND Work Request every so often,
- * in case the provider needs to do some housekeeping.
- */
-#define RPCRDMA_MAX_UNSIGNALED_SENDS   (32)
-
 #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
 #define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 06/10] xprtrdma: Add ro_unmap_sync method for FRWR

2015-12-16 Thread Chuck Lever

FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts
LOCAL_INV Work Requests and waits for them to complete before
returning.

Note also, DMA unmapping is now done _after_ invalidation.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/frwr_ops.c  |  136 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h |2 +
 2 files changed, 134 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 660d0b6..aa078a0 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -244,12 +244,14 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
-/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs to be reset. */
+/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs
+ * to be reset.
+ *
+ * WARNING: Only wr_id and status are reliable at this point
+ */
 static void
-frwr_sendcompletion(struct ib_wc *wc)
+__frwr_sendcompletion_flush(struct ib_wc *wc, struct rpcrdma_mw *r)
 {
-   struct rpcrdma_mw *r;
-
if (likely(wc->status == IB_WC_SUCCESS))
return;
 
@@ -260,9 +262,23 @@ frwr_sendcompletion(struct ib_wc *wc)
else
pr_warn("RPC:   %s: frmr %p error, status %s (%d)\n",
__func__, r, ib_wc_status_msg(wc->status), wc->status);
+
r->r.frmr.fr_state = FRMR_IS_STALE;
 }
 
+static void
+frwr_sendcompletion(struct ib_wc *wc)
+{
+   struct rpcrdma_mw *r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+   struct rpcrdma_frmr *f = &r->r.frmr;
+
+   if (unlikely(wc->status != IB_WC_SUCCESS))
+   __frwr_sendcompletion_flush(wc, r);
+
+   if (f->fr_waiter)
+   complete(&f->fr_linv_done);
+}
+
 static int
 frwr_op_init(struct rpcrdma_xprt *r_xprt)
 {
@@ -334,6 +350,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
} while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
+   frmr->fr_waiter = false;
mr = frmr->fr_mr;
reg_wr = &frmr->fr_regwr;
 
@@ -413,6 +430,116 @@ out_senderr:
return rc;
 }
 
+static struct ib_send_wr *
+__frwr_prepare_linv_wr(struct rpcrdma_mr_seg *seg)
+{
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   struct rpcrdma_frmr *f = &mw->r.frmr;
+   struct ib_send_wr *invalidate_wr;
+
+   f->fr_waiter = false;
+   f->fr_state = FRMR_IS_INVALID;
+   invalidate_wr = &f->fr_invwr;
+
+   memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+   invalidate_wr->wr_id = (unsigned long)(void *)mw;
+   invalidate_wr->opcode = IB_WR_LOCAL_INV;
+   invalidate_wr->ex.invalidate_rkey = f->fr_mr->rkey;
+
+   return invalidate_wr;
+}
+
+static void
+__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+int rc)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   struct rpcrdma_frmr *f = &mw->r.frmr;
+
+   seg->rl_mw = NULL;
+
+   ib_dma_unmap_sg(device, f->sg, f->sg_nents, seg->mr_dir);
+
+   if (!rc)
+   rpcrdma_put_mw(r_xprt, mw);
+   else
+   __frwr_queue_recovery(mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct ib_send_wr *invalidate_wrs, *pos, *prev, *bad_wr;
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg;
+   unsigned int i, nchunks;
+   struct rpcrdma_frmr *f;
+   int rc;
+
+   dprintk("RPC:   %s: req %p\n", __func__, req);
+
+   /* ORDER: Invalidate all of the req's MRs first
+*
+* Chain the LOCAL_INV Work Requests and post them with
+* a single ib_post_send() call.
+*/
+   invalidate_wrs = pos = prev = NULL;
+   seg = NULL;
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = &req->rl_segments[i];
+
+   pos = __frwr_prepare_linv_wr(seg);
+
+   if (!invalidate_wrs)
+   invalidate_wrs = pos;
+   else
+   prev->next = pos;
+   prev = pos;
+
+   i += seg->mr_nsegs;
+   }
+   f = &seg->rl_mw->r.frmr;
+
+   /* Strong send queue ordering guarantees that when the
+* last WR in the chain completes, all WRs in the chain
+* are complete.
+*/
+   f->fr_invwr.send_flags = IB_SEND_SIGNAL

[PATCH v4 07/10] xprtrdma: Add ro_unmap_sync method for FMR

2015-12-16 Thread Chuck Lever

FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.

1. Gather all the MRs for the RPC request onto a list, and invoke
   ib_unmap_fmr() once with that list. This reduces the number of
   doorbells when there is more than one MR to invalidate

2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
   This is critical after invalidating a Write chunk.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/fmr_ops.c |   64 +
 1 file changed, 64 insertions(+)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index f1e8daf..c14f3a4 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -179,6 +179,69 @@ out_maperr:
return rc;
 }
 
+static void
+__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   int nsegs = seg->mr_nsegs;
+
+   seg->rl_mw = NULL;
+
+   while (nsegs--)
+   rpcrdma_unmap_one(device, seg++);
+
+   rpcrdma_put_mw(r_xprt, mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct rpcrdma_mr_seg *seg;
+   unsigned int i, nchunks;
+   struct rpcrdma_mw *mw;
+   LIST_HEAD(unmap_list);
+   int rc;
+
+   dprintk("RPC:   %s: req %p\n", __func__, req);
+
+   /* ORDER: Invalidate all of the req's MRs first
+*
+* ib_unmap_fmr() is slow, so use a single call instead
+* of one call per mapped MR.
+*/
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = &req->rl_segments[i];
+   mw = seg->rl_mw;
+
+   list_add(&mw->r.fmr.fmr->list, &unmap_list);
+
+   i += seg->mr_nsegs;
+   }
+   rc = ib_unmap_fmr(&unmap_list);
+   if (rc)
+   pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
+
+   /* ORDER: Now DMA unmap all of the req's MRs, and return
+* them to the free MW list.
+*/
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = &req->rl_segments[i];
+
+   __fmr_dma_unmap(r_xprt, seg);
+
+   i += seg->mr_nsegs;
+   seg->mr_nsegs = 0;
+   }
+
+   req->rl_nchunks = 0;
+}
+
 /* Use the ib_unmap_fmr() verb to prevent further remote
  * access via RDMA READ or RDMA WRITE.
  */
@@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
+   .ro_unmap_sync  = fmr_op_unmap_sync,
.ro_unmap   = fmr_op_unmap,
.ro_open= fmr_op_open,
.ro_maxpages= fmr_op_maxpages,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 04/10] xprtrdma: Move struct ib_send_wr off the stack

2015-12-16 Thread Chuck Lever

For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off
the stack. This allows frwr_op_map and frwr_op_unmap to chain
WRs together without limit to register or invalidate a set of MRs
with a single ib_post_send().

(This will be for chaining LOCAL_INV requests).

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/frwr_ops.c  |   38 --
 net/sunrpc/xprtrdma/xprt_rdma.h |4 
 2 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index ae2a241..660d0b6 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -318,7 +318,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
struct rpcrdma_mw *mw;
struct rpcrdma_frmr *frmr;
struct ib_mr *mr;
-   struct ib_reg_wr reg_wr;
+   struct ib_reg_wr *reg_wr;
struct ib_send_wr *bad_wr;
int rc, i, n, dma_nents;
u8 key;
@@ -335,6 +335,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
mr = frmr->fr_mr;
+   reg_wr = &frmr->fr_regwr;
 
if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;
@@ -380,19 +381,19 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
key = (u8)(mr->rkey & 0x00FF);
ib_update_fast_reg_key(mr, ++key);
 
-   reg_wr.wr.next = NULL;
-   reg_wr.wr.opcode = IB_WR_REG_MR;
-   reg_wr.wr.wr_id = (uintptr_t)mw;
-   reg_wr.wr.num_sge = 0;
-   reg_wr.wr.send_flags = 0;
-   reg_wr.mr = mr;
-   reg_wr.key = mr->rkey;
-   reg_wr.access = writing ?
-   IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
-   IB_ACCESS_REMOTE_READ;
+   reg_wr->wr.next = NULL;
+   reg_wr->wr.opcode = IB_WR_REG_MR;
+   reg_wr->wr.wr_id = (uintptr_t)mw;
+   reg_wr->wr.num_sge = 0;
+   reg_wr->wr.send_flags = 0;
+   reg_wr->mr = mr;
+   reg_wr->key = mr->rkey;
+   reg_wr->access = writing ?
+IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+IB_ACCESS_REMOTE_READ;
 
DECR_CQCOUNT(&r_xprt->rx_ep);
-   rc = ib_post_send(ia->ri_id->qp, ®_wr.wr, &bad_wr);
+   rc = ib_post_send(ia->ri_id->qp, ®_wr->wr, &bad_wr);
if (rc)
goto out_senderr;
 
@@ -422,23 +423,24 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mw *mw = seg1->rl_mw;
struct rpcrdma_frmr *frmr = &mw->r.frmr;
-   struct ib_send_wr invalidate_wr, *bad_wr;
+   struct ib_send_wr *invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;
 
dprintk("RPC:   %s: FRMR %p\n", __func__, mw);
 
seg1->rl_mw = NULL;
frmr->fr_state = FRMR_IS_INVALID;
+   invalidate_wr = &mw->r.frmr.fr_invwr;
 
-   memset(&invalidate_wr, 0, sizeof(invalidate_wr));
-   invalidate_wr.wr_id = (unsigned long)(void *)mw;
-   invalidate_wr.opcode = IB_WR_LOCAL_INV;
-   invalidate_wr.ex.invalidate_rkey = frmr->fr_mr->rkey;
+   memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+   invalidate_wr->wr_id = (uintptr_t)mw;
+   invalidate_wr->opcode = IB_WR_LOCAL_INV;
+   invalidate_wr->ex.invalidate_rkey = frmr->fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);
 
ib_dma_unmap_sg(ia->ri_device, frmr->sg, frmr->sg_nents, seg1->mr_dir);
read_lock(&ia->ri_qplock);
-   rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+   rc = ib_post_send(ia->ri_id->qp, invalidate_wr, &bad_wr);
read_unlock(&ia->ri_qplock);
if (rc)
goto out_err;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 4197191..b1065ca 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -206,6 +206,10 @@ struct rpcrdma_frmr {
enum rpcrdma_frmr_state fr_state;
struct work_struct  fr_work;
struct rpcrdma_xprt *fr_xprt;
+   union {
+   struct ib_reg_wrfr_regwr;
+   struct ib_send_wr   fr_invwr;
+   };
 };
 
 struct rpcrdma_fmr {

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 02/10] xprtrdma: xprt_rdma_free() must not release backchannel reqs

2015-12-16 Thread Chuck Lever

Preserve any rpcrdma_req that is attached to rpc_rqst's allocated
for the backchannel. Otherwise, after all the pre-allocated
backchannel req's are consumed, incoming backward calls start
writing on freed memory.

Somehow this hunk got lost.

Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/transport.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8c545f7..740bddc 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -576,6 +576,9 @@ xprt_rdma_free(void *buffer)
 
rb = container_of(buffer, struct rpcrdma_regbuf, rg_base[0]);
req = rb->rg_owner;
+   if (req->rl_backchannel)
+   return;
+
r_xprt = container_of(req->rl_buffer, struct rpcrdma_xprt, rx_buf);
 
dprintk("RPC:   %s: called on 0x%p\n", __func__, req->rl_reply);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 03/10] xprtrdma: Disable RPC/RDMA backchannel debugging messages

2015-12-16 Thread Chuck Lever

Clean up.

Fixes: 63cae47005af ('xprtrdma: Handle incoming backward direction')
Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/backchannel.c |   16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c 
b/net/sunrpc/xprtrdma/backchannel.c
index 11d2cfb..cd31181 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -15,7 +15,7 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
-#define RPCRDMA_BACKCHANNEL_DEBUG
+#undef RPCRDMA_BACKCHANNEL_DEBUG
 
 static void rpcrdma_bc_free_rqst(struct rpcrdma_xprt *r_xprt,
 struct rpc_rqst *rqst)
@@ -136,6 +136,7 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int 
reqs)
   __func__);
goto out_free;
}
+   dprintk("RPC:   %s: new rqst %p\n", __func__, rqst);
 
rqst->rq_xprt = &r_xprt->rx_xprt;
INIT_LIST_HEAD(&rqst->rq_list);
@@ -216,12 +217,14 @@ int rpcrdma_bc_marshal_reply(struct rpc_rqst *rqst)
 
rpclen = rqst->rq_svec[0].iov_len;
 
+#ifdef RPCRDMA_BACKCHANNEL_DEBUG
pr_info("RPC:   %s: rpclen %zd headerp 0x%p lkey 0x%x\n",
__func__, rpclen, headerp, rdmab_lkey(req->rl_rdmabuf));
pr_info("RPC:   %s: RPC/RDMA: %*ph\n",
__func__, (int)RPCRDMA_HDRLEN_MIN, headerp);
pr_info("RPC:   %s:  RPC: %*ph\n",
__func__, (int)rpclen, rqst->rq_svec[0].iov_base);
+#endif
 
req->rl_send_iov[0].addr = rdmab_addr(req->rl_rdmabuf);
req->rl_send_iov[0].length = RPCRDMA_HDRLEN_MIN;
@@ -265,6 +268,9 @@ void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
 {
struct rpc_xprt *xprt = rqst->rq_xprt;
 
+   dprintk("RPC:   %s: freeing rqst %p (req %p)\n",
+   __func__, rqst, rpcr_to_rdmar(rqst));
+
smp_mb__before_atomic();
WARN_ON_ONCE(!test_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state));
clear_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state);
@@ -329,9 +335,7 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
struct rpc_rqst, rq_bc_pa_list);
list_del(&rqst->rq_bc_pa_list);
spin_unlock(&xprt->bc_pa_lock);
-#ifdef RPCRDMA_BACKCHANNEL_DEBUG
-   pr_info("RPC:   %s: using rqst %p\n", __func__, rqst);
-#endif
+   dprintk("RPC:   %s: using rqst %p\n", __func__, rqst);
 
/* Prepare rqst */
rqst->rq_reply_bytes_recvd = 0;
@@ -351,10 +355,8 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
 * direction reply.
 */
req = rpcr_to_rdmar(rqst);
-#ifdef RPCRDMA_BACKCHANNEL_DEBUG
-   pr_info("RPC:   %s: attaching rep %p to req %p\n",
+   dprintk("RPC:   %s: attaching rep %p to req %p\n",
__func__, rep, req);
-#endif
req->rl_reply = rep;
 
/* Defeat the retransmit detection logic in send_request */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 05/10] xprtrdma: Introduce ro_unmap_sync method

2015-12-16 Thread Chuck Lever

In the current xprtrdma implementation, some memreg strategies
implement ro_unmap synchronously (the MR is knocked down before the
method returns) and some asynchonously (the MR will be knocked down
and returned to the pool in the background).

To guarantee the MR is truly invalid before the RPC consumer is
allowed to resume execution, we need an unmap method that is
always synchronous, invoked from the RPC/RDMA reply handler.

The new method unmaps all MRs for an RPC. The existing ro_unmap
method unmaps only one MR at a time.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/xprt_rdma.h |2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index b1065ca..512184d 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -367,6 +367,8 @@ struct rpcrdma_xprt;
 struct rpcrdma_memreg_ops {
int (*ro_map)(struct rpcrdma_xprt *,
  struct rpcrdma_mr_seg *, int, bool);
+   void(*ro_unmap_sync)(struct rpcrdma_xprt *,
+struct rpcrdma_req *);
int (*ro_unmap)(struct rpcrdma_xprt *,
struct rpcrdma_mr_seg *);
int (*ro_open)(struct rpcrdma_ia *,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 01/10] xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)

2015-12-16 Thread Chuck Lever

Clean up.

rb_lock critical sections added in rpcrdma_ep_post_extra_recv()
should have first been converted to use normal spin_lock now that
the reply handler is a work queue.

The backchannel set up code should use the appropriate helper
instead of open-coding a rb_recv_bufs list add.

Problem introduced by glib patch re-ordering on my part.

Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/backchannel.c |6 +-
 net/sunrpc/xprtrdma/verbs.c   |7 +++
 2 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c 
b/net/sunrpc/xprtrdma/backchannel.c
index 2dcb44f..11d2cfb 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -84,9 +84,7 @@ out_fail:
 static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
 unsigned int count)
 {
-   struct rpcrdma_buffer *buffers = &r_xprt->rx_buf;
struct rpcrdma_rep *rep;
-   unsigned long flags;
int rc = 0;
 
while (count--) {
@@ -98,9 +96,7 @@ static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
break;
}
 
-   spin_lock_irqsave(&buffers->rb_lock, flags);
-   list_add(&rep->rr_list, &buffers->rb_recv_bufs);
-   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   rpcrdma_recv_buffer_put(rep);
}
 
return rc;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 650034b..f23f3d6 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1329,15 +1329,14 @@ rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, 
unsigned int count)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_ep *ep = &r_xprt->rx_ep;
struct rpcrdma_rep *rep;
-   unsigned long flags;
int rc;
 
while (count--) {
-   spin_lock_irqsave(&buffers->rb_lock, flags);
+   spin_lock(&buffers->rb_lock);
if (list_empty(&buffers->rb_recv_bufs))
goto out_reqbuf;
rep = rpcrdma_buffer_get_rep_locked(buffers);
-   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   spin_unlock(&buffers->rb_lock);
 
rc = rpcrdma_ep_post_recv(ia, ep, rep);
if (rc)
@@ -1347,7 +1346,7 @@ rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, 
unsigned int count)
return 0;
 
 out_reqbuf:
-   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   spin_unlock(&buffers->rb_lock);
pr_warn("%s: no extra receive buffers\n", __func__);
return -ENOMEM;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 00/10] NFS/RDMA client patches for 4.5

2015-12-16 Thread Chuck Lever

For 4.5, I'd like to address the send queue accounting and
invalidation/unmap ordering issues Jason brought up a couple of
months ago.

Also available in the "nfs-rdma-for-4.5" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5


Changes since v3:
- Dropped xprt_commit_rqst()
- __frmr_dma_unmap now uses ib_dma_unmap_sg()
- Use transparent union in struct rpcrdma_frmr


Changes since v2:
- Rebased on Christoph's ib_device_attr branch


Changes since v1:

- Rebased on v4.4-rc3
- Receive buffer safety margin patch dropped
- Backchannel pr_err and pr_info converted to dprintk
- Backchannel spin locks converted to work queue-safe locks
- Fixed premature release of backchannel request buffer
- NFSv4.1 callbacks tested with for-4.5 server

---

Chuck Lever (10):
  xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
  xprtrdma: xprt_rdma_free() must not release backchannel reqs
  xprtrdma: Disable RPC/RDMA backchannel debugging messages
  xprtrdma: Move struct ib_send_wr off the stack
  xprtrdma: Introduce ro_unmap_sync method
  xprtrdma: Add ro_unmap_sync method for FRWR
  xprtrdma: Add ro_unmap_sync method for FMR
  xprtrdma: Add ro_unmap_sync method for all-physical registration
  xprtrdma: Invalidate in the RPC reply handler
  xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').


 net/sunrpc/xprtrdma/backchannel.c  |   22 ++---
 net/sunrpc/xprtrdma/fmr_ops.c  |   64 +
 net/sunrpc/xprtrdma/frwr_ops.c |  174 +++-
 net/sunrpc/xprtrdma/physical_ops.c |   13 +++
 net/sunrpc/xprtrdma/rpc_rdma.c |   16 +++
 net/sunrpc/xprtrdma/transport.c|3 +
 net/sunrpc/xprtrdma/verbs.c|   13 +--
 net/sunrpc/xprtrdma/xprt_rdma.h|   14 ++-
 8 files changed, 271 insertions(+), 48 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] svc_rdma: use local_dma_lkey

2015-12-16 Thread Chuck Lever


> On Dec 16, 2015, at 10:11 AM, Christoph Hellwig  wrote:
> 
> We now alwasy have a per-PD local_dma_lkey available.  Make use of that
> fact in svc_rdma and stop registering our own MR.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Chuck Lever 

> ---
> include/linux/sunrpc/svc_rdma.h|  2 --
> net/sunrpc/xprtrdma/svc_rdma_backchannel.c |  2 +-
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c|  4 ++--
> net/sunrpc/xprtrdma/svc_rdma_sendto.c  |  6 ++---
> net/sunrpc/xprtrdma/svc_rdma_transport.c   | 36 --
> 5 files changed, 10 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> index b13513a..5322fea 100644
> --- a/include/linux/sunrpc/svc_rdma.h
> +++ b/include/linux/sunrpc/svc_rdma.h
> @@ -156,13 +156,11 @@ struct svcxprt_rdma {
>   struct ib_qp *sc_qp;
>   struct ib_cq *sc_rq_cq;
>   struct ib_cq *sc_sq_cq;
> - struct ib_mr *sc_phys_mr;   /* MR for server memory */
>   int  (*sc_reader)(struct svcxprt_rdma *,
> struct svc_rqst *,
> struct svc_rdma_op_ctxt *,
> int *, u32 *, u32, u32, u64, bool);
>   u32  sc_dev_caps;   /* distilled device caps */
> - u32  sc_dma_lkey;   /* local dma key */
>   unsigned int sc_frmr_pg_list_len;
>   struct list_head sc_frmr_q;
>   spinlock_t   sc_frmr_q_lock;
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c 
> b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
> index 417cec1..c428734 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
> @@ -128,7 +128,7 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
> 
>   ctxt->wr_op = IB_WR_SEND;
>   ctxt->direction = DMA_TO_DEVICE;
> - ctxt->sge[0].lkey = rdma->sc_dma_lkey;
> + ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey;
>   ctxt->sge[0].length = sndbuf->len;
>   ctxt->sge[0].addr =
>   ib_dma_map_page(rdma->sc_cm_id->device, ctxt->pages[0], 0,
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
> b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index 3dfe464..c8b8a8b 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -144,6 +144,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
> 
>   head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no];
>   head->arg.page_len += len;
> +
>   head->arg.len += len;
>   if (!pg_off)
>   head->count++;
> @@ -160,8 +161,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
>   goto err;
>   atomic_inc(&xprt->sc_dma_used);
> 
> - /* The lkey here is either a local dma lkey or a dma_mr lkey */
> - ctxt->sge[pno].lkey = xprt->sc_dma_lkey;
> + ctxt->sge[pno].lkey = xprt->sc_pd->local_dma_lkey;
>   ctxt->sge[pno].length = len;
>   ctxt->count++;
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> index ced3151..20bd5d4 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> @@ -265,7 +265,7 @@ static int send_write(struct svcxprt_rdma *xprt, struct 
> svc_rqst *rqstp,
>sge[sge_no].addr))
>   goto err;
>   atomic_inc(&xprt->sc_dma_used);
> - sge[sge_no].lkey = xprt->sc_dma_lkey;
> + sge[sge_no].lkey = xprt->sc_pd->local_dma_lkey;
>   ctxt->count++;
>   sge_off = 0;
>   sge_no++;
> @@ -487,7 +487,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
>   ctxt->count = 1;
> 
>   /* Prepare the SGE for the RPCRDMA Header */
> - ctxt->sge[0].lkey = rdma->sc_dma_lkey;
> + ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey;
>   ctxt->sge[0].length = svc_rdma_xdr_get_reply_hdr_len(rdma_resp);
>   ctxt->sge[0].addr =
>   ib_dma_map_page(rdma->sc_cm_id->device, page, 0,
> @@ -511,7 +511,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
>ctxt->sge[sge_no].addr))
>   goto err;
>   atomic_inc(&rdma->sc_dma_used);
> - ctxt->sge[sge_no].lkey = rdma-&

Re: [PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack

2015-12-16 Thread Chuck Lever


> On Dec 16, 2015, at 10:11 AM, Christoph Hellwig  wrote:
> 
> On Wed, Dec 16, 2015 at 10:06:33AM -0500, Chuck Lever wrote:
>>> Would it make sense to unionize these as they are guaranteed not to
>>> execute together? Some people don't like this sort of savings.
>> 
>> I dislike unions because they make the code that uses
>> them less readable. I can define macros to help that,
>> but sigh! OK.
> 
> Shouldn't be an issue with transparent unions these days:
> 
>   union {
>   struct ib_reg_wrfr_regwr;
>   struct ib_send_wr   fr_invwr;
>   };

Right, but isn't that a gcc-ism that Al hates? If
everyone is OK with that construction, I will use it.

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 09/11] SUNRPC: Introduce xprt_commit_rqst()

2015-12-16 Thread Chuck Lever

Hi Anna-


> On Dec 16, 2015, at 8:48 AM, Anna Schumaker  wrote:
> 
> Hi Chuck,
> 
> Sorry for the last minute comment.
> 
> On 12/14/2015 04:19 PM, Chuck Lever wrote:
>> I'm about to add code in the RPC/RDMA reply handler between the
>> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
>> to execute outside of spinlock critical sections.
>> 
>> Add a hook to remove an rpc_rqst from the pending list once
>> the transport knows its going to invoke xprt_complete_rqst().
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>> include/linux/sunrpc/xprt.h|1 +
>> net/sunrpc/xprt.c  |   14 ++
>> net/sunrpc/xprtrdma/rpc_rdma.c |4 
>> 3 files changed, 19 insertions(+)
>> 
>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>> index 69ef5b3..ab6c3a5 100644
>> --- a/include/linux/sunrpc/xprt.h
>> +++ b/include/linux/sunrpc/xprt.h
>> @@ -366,6 +366,7 @@ void 
>> xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
>> void xprt_write_space(struct rpc_xprt *xprt);
>> void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task 
>> *task, int result);
>> struct rpc_rqst *xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
>> +voidxprt_commit_rqst(struct rpc_task *task);
>> void xprt_complete_rqst(struct rpc_task *task, int copied);
>> void xprt_release_rqst_cong(struct rpc_task *task);
>> void xprt_disconnect_done(struct rpc_xprt *xprt);
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index 2e98f4a..a5be4ab 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
>> }
>> 
>> /**
>> + * xprt_commit_rqst - remove rqst from pending list early
>> + * @task: RPC request to remove
>> + *
>> + * Caller holds transport lock.
>> + */
>> +void xprt_commit_rqst(struct rpc_task *task)
>> +{
>> +struct rpc_rqst *req = task->tk_rqstp;
>> +
>> +list_del_init(&req->rq_list);
>> +}
>> +EXPORT_SYMBOL_GPL(xprt_commit_rqst);
> 
> Can you move this function into the xprtrdma code, since it's not called 
> outside of there?

I think that's a layering violation, and the idea is
to allow other transports to use this API eventually.

But I'll include this change in the next version of
the series.


> Thanks,
> Anna
> 
>> +
>> +/**
>>  * xprt_complete_rqst - called when reply processing is complete
>>  * @task: RPC request that recently completed
>>  * @copied: actual number of bytes received from the transport
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index c10d969..0bc8c39 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -804,6 +804,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>  if (req->rl_reply)
>>  goto out_duplicate;
>> 
>> +xprt_commit_rqst(rqst->rq_task);
>> +spin_unlock_bh(&xprt->transport_lock);
>> +
>>  dprintk("RPC:   %s: reply 0x%p completes request 0x%p\n"
>>  "   RPC request 0x%p xid 0x%08x\n",
>>  __func__, rep, req, rqst,
>> @@ -894,6 +897,7 @@ badheader:
>>  else if (credits > r_xprt->rx_buf.rb_max_requests)
>>  credits = r_xprt->rx_buf.rb_max_requests;
>> 
>> +spin_lock_bh(&xprt->transport_lock);
>>  cwnd = xprt->cwnd;
>>  xprt->cwnd = credits << RPC_CWNDSHIFT;
>>  if (xprt->cwnd > cwnd)
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 06/11] xprtrdma: Add ro_unmap_sync method for FRWR

2015-12-16 Thread Chuck Lever


> On Dec 16, 2015, at 8:57 AM, Sagi Grimberg  wrote:
> 
> 
>> +static void
>> +__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>> + int rc)
>> +{
>> +struct ib_device *device = r_xprt->rx_ia.ri_device;
>> +struct rpcrdma_mw *mw = seg->rl_mw;
>> +int nsegs = seg->mr_nsegs;
>> +
>> +seg->rl_mw = NULL;
>> +
>> +while (nsegs--)
>> +rpcrdma_unmap_one(device, seg++);
> 
> Chuck, shouldn't this be replaced with ib_dma_unmap_sg?

Looks like this was left over from before the conversion
to use ib_dma_unmap_sg. I'll have a look.

> Sorry for the late comment (Didn't find enough time to properly
> review this...)

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack

2015-12-16 Thread Chuck Lever


> On Dec 16, 2015, at 9:00 AM, Sagi Grimberg  wrote:
> 
> 
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h 
>> b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index 4197191..e60d817 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -206,6 +206,8 @@ struct rpcrdma_frmr {
>>  enum rpcrdma_frmr_state fr_state;
>>  struct work_struct  fr_work;
>>  struct rpcrdma_xprt *fr_xprt;
>> +struct ib_reg_wrfr_regwr;
>> +struct ib_send_wr   fr_invwr;
> 
> Would it make sense to unionize these as they are guaranteed not to
> execute together? Some people don't like this sort of savings.

I dislike unions because they make the code that uses
them less readable. I can define macros to help that,
but sigh! OK.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-14 Thread Chuck Lever


> On Dec 9, 2015, at 8:45 PM, ira.weiny  wrote:
> 
> On Wed, Dec 09, 2015 at 10:42:35AM -0800, Christoph Hellwig wrote:
>> On Tue, Dec 08, 2015 at 07:52:03PM -0500, ira.weiny wrote:
>>> Searching patchworks...
>>> 
>>> I'm a bit worried about the size of the patch and I would like to see it 
>>> split
>>> up for review.  But I agree Christophs method is better long term.
>> 
>> I'd be happy to split it up if I could see a way to split it.  So if
>> anyone has an idea you're welcome!
> 
> Well this is a ~3300 line patch which is pretty hard to review in total.
> 
>> 
>>> Christoph do you have this on github somewhere?  Perhaps it is split but I'm
>>> not finding in on patchworks?
>> 
>> No need for github, we have much better (and older) git hosting sites :)
>> 
>> http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/ib_device_attr

Tested-by: Chuck Lever 

With NFS/RDMA client and server.


> Another nice side effect of this patch is to get rid of all the struct
> ib_device_attr allocations which are littered all over the ULPs.
> 
> For the core, srp, ipoib, qib, hfi1 bits.  Generally the rest looks fine I 
> just
> did not have time to really go through it line by line.
> 
> Reviewed-by: Ira Weiny 
> 
> Doug this is going to conflict with the rdmavt work.  So if you take this 
> could
> you respond on the list.


--
Chuck Lever
chuckle...@gmail.com



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 08/11] svcrdma: Remove last two __GFP_NOFAIL call sites

2015-12-14 Thread Chuck Lever

Clean up.

These functions can otherwise fail, so check for page allocation
failures too.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|5 -
 net/sunrpc/xprtrdma/svc_rdma_transport.c |4 +++-
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 2d3d7a4..9221086 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -605,7 +605,10 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
inline_bytes = rqstp->rq_res.len;
 
/* Create the RDMA response header */
-   res_page = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
+   ret = -ENOMEM;
+   res_page = alloc_page(GFP_KERNEL);
+   if (!res_page)
+   goto err0;
rdma_resp = page_address(res_page);
reply_ary = svc_rdma_get_reply_array(rdma_argp);
if (reply_ary)
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 14b692d..694ade4 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1445,7 +1445,9 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, 
struct rpcrdma_msg *rmsgp,
int length;
int ret;
 
-   p = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
+   p = alloc_page(GFP_KERNEL);
+   if (!p)
+   return;
va = page_address(p);
 
/* XDR encode error */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 11/11] svcrdma: Add class for RDMA backwards direction transport

2015-12-14 Thread Chuck Lever

To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class that enables backward
direction messages on an existing forward channel connection.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h|5 
 net/sunrpc/xprt.c  |1 
 net/sunrpc/xprtrdma/Makefile   |2 
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |  371 
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c|   52 
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |   14 +
 net/sunrpc/xprtrdma/transport.c|   30 +-
 net/sunrpc/xprtrdma/xprt_rdma.h|   15 +
 8 files changed, 475 insertions(+), 15 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/svc_rdma_backchannel.c

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 9a2c418..b13513a 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -195,6 +195,11 @@ struct svcxprt_rdma {
 
 #define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
 
+/* svc_rdma_backchannel.c */
+extern int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt,
+   struct rpcrdma_msg *rmsgp,
+   struct xdr_buf *rcvbuf);
+
 /* svc_rdma_marshal.c */
 extern int svc_rdma_xdr_decode_req(struct rpcrdma_msg **, struct svc_rqst *);
 extern int svc_rdma_xdr_encode_error(struct svcxprt_rdma *,
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2e98f4a..37edea6 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1425,3 +1425,4 @@ void xprt_put(struct rpc_xprt *xprt)
if (atomic_dec_and_test(&xprt->count))
xprt_destroy(xprt);
 }
+EXPORT_SYMBOL_GPL(xprt_put);
diff --git a/net/sunrpc/xprtrdma/Makefile b/net/sunrpc/xprtrdma/Makefile
index 33f99d3..dc9f3b5 100644
--- a/net/sunrpc/xprtrdma/Makefile
+++ b/net/sunrpc/xprtrdma/Makefile
@@ -2,7 +2,7 @@ obj-$(CONFIG_SUNRPC_XPRT_RDMA) += rpcrdma.o
 
 rpcrdma-y := transport.o rpc_rdma.o verbs.o \
fmr_ops.o frwr_ops.o physical_ops.o \
-   svc_rdma.o svc_rdma_transport.o \
+   svc_rdma.o svc_rdma_backchannel.o svc_rdma_transport.o \
svc_rdma_marshal.o svc_rdma_sendto.o svc_rdma_recvfrom.o \
module.o
 rpcrdma-$(CONFIG_SUNRPC_BACKCHANNEL) += backchannel.o
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c 
b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
new file mode 100644
index 000..417cec1
--- /dev/null
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -0,0 +1,371 @@
+/*
+ * Copyright (c) 2015 Oracle.  All rights reserved.
+ *
+ * Support for backward direction RPCs on RPC/RDMA (server-side).
+ */
+
+#include 
+#include "xprt_rdma.h"
+
+#define RPCDBG_FACILITYRPCDBG_SVCXPRT
+
+#undef SVCRDMA_BACKCHANNEL_DEBUG
+
+int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
+struct xdr_buf *rcvbuf)
+{
+   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct kvec *dst, *src = &rcvbuf->head[0];
+   struct rpc_rqst *req;
+   unsigned long cwnd;
+   u32 credits;
+   size_t len;
+   __be32 xid;
+   __be32 *p;
+   int ret;
+
+   p = (__be32 *)src->iov_base;
+   len = src->iov_len;
+   xid = rmsgp->rm_xid;
+
+#ifdef SVCRDMA_BACKCHANNEL_DEBUG
+   pr_info("%s: xid=%08x, length=%zu\n",
+   __func__, be32_to_cpu(xid), len);
+   pr_info("%s: RPC/RDMA: %*ph\n",
+   __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
+   pr_info("%s:  RPC: %*ph\n",
+   __func__, (int)len, p);
+#endif
+
+   ret = -EAGAIN;
+   if (src->iov_len < 24)
+   goto out_shortreply;
+
+   spin_lock_bh(&xprt->transport_lock);
+   req = xprt_lookup_rqst(xprt, xid);
+   if (!req)
+   goto out_notfound;
+
+   dst = &req->rq_private_buf.head[0];
+   memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
+   if (dst->iov_len < len)
+   goto out_unlock;
+   memcpy(dst->iov_base, p, len);
+
+   credits = be32_to_cpu(rmsgp->rm_credit);
+   if (credits == 0)
+   credits = 1;/* don't deadlock */
+   else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
+   credits = r_xprt->rx_buf.rb_bc_max_requests;
+
+   cwnd = xprt->cwnd;
+   xprt->cwnd = credits << RPC_CWNDSHIFT;
+   if (xprt->cwnd > cwnd)
+   xprt_release_rqst_cong(req->rq_task);
+
+   ret = 0;
+   xprt_complete_rqst(req->rq_task, rcvbuf->len);
+   rcvbuf->len = 0;
+
+out_unlock:
+   spin_unlock_bh(&xprt->transport_lock);
+out:
+   return ret;
+
+out_shortreply:
+   dprintk("svcrdma: short bc reply: xprt=%p, len=%zu\n",
+   xprt, src->iov_len);
+   goto o

[PATCH v4 10/11] svcrdma: Define maximum number of backchannel requests

2015-12-14 Thread Chuck Lever

Extra resources for handling backchannel requests have to be
pre-allocated when a transport instance is created. Set up
additional fields in svcxprt_rdma to track these resources.

The max_requests fields are elements of the RPC-over-RDMA
protocol, so they should be u32. To ensure that unsigned
arithmetic is used everywhere, some other fields in the
svcxprt_rdma struct are updated.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |   13 ++---
 net/sunrpc/xprtrdma/svc_rdma.c   |6 --
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   24 ++--
 3 files changed, 28 insertions(+), 15 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index aeffa30..9a2c418 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -51,6 +51,7 @@
 /* RPC/RDMA parameters and stats */
 extern unsigned int svcrdma_ord;
 extern unsigned int svcrdma_max_requests;
+extern unsigned int svcrdma_max_bc_requests;
 extern unsigned int svcrdma_max_req_size;
 
 extern atomic_t rdma_stat_recv;
@@ -134,10 +135,11 @@ struct svcxprt_rdma {
int  sc_max_sge;
int  sc_max_sge_rd; /* max sge for read target */
 
-   int  sc_sq_depth;   /* Depth of SQ */
atomic_t sc_sq_count;   /* Number of SQ WR on queue */
-
-   int  sc_max_requests;   /* Depth of RQ */
+   unsigned int sc_sq_depth;   /* Depth of SQ */
+   unsigned int sc_rq_depth;   /* Depth of RQ */
+   u32  sc_max_requests;   /* Forward credits */
+   u32  sc_max_bc_requests;/* Backward credits */
int  sc_max_req_size;   /* Size of each RQ WR buf */
 
struct ib_pd *sc_pd;
@@ -186,6 +188,11 @@ struct svcxprt_rdma {
 #define RPCRDMA_MAX_REQUESTS32
 #define RPCRDMA_MAX_REQ_SIZE4096
 
+/* Typical ULP usage of BC requests is NFSv4.1 backchannel. Our
+ * current NFSv4.1 implementation supports one backchannel slot.
+ */
+#define RPCRDMA_MAX_BC_REQUESTS2
+
 #define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
 
 /* svc_rdma_marshal.c */
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index e894e06..c846ca9 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -55,6 +55,7 @@ unsigned int svcrdma_ord = RPCRDMA_ORD;
 static unsigned int min_ord = 1;
 static unsigned int max_ord = 4096;
 unsigned int svcrdma_max_requests = RPCRDMA_MAX_REQUESTS;
+unsigned int svcrdma_max_bc_requests = RPCRDMA_MAX_BC_REQUESTS;
 static unsigned int min_max_requests = 4;
 static unsigned int max_max_requests = 16384;
 unsigned int svcrdma_max_req_size = RPCRDMA_MAX_REQ_SIZE;
@@ -245,9 +246,10 @@ int svc_rdma_init(void)
 {
dprintk("SVCRDMA Module Init, register RPC RDMA transport\n");
dprintk("\tsvcrdma_ord  : %d\n", svcrdma_ord);
-   dprintk("\tmax_requests : %d\n", svcrdma_max_requests);
-   dprintk("\tsq_depth : %d\n",
+   dprintk("\tmax_requests : %u\n", svcrdma_max_requests);
+   dprintk("\tsq_depth : %u\n",
svcrdma_max_requests * RPCRDMA_SQ_DEPTH_MULT);
+   dprintk("\tmax_bc_requests  : %u\n", svcrdma_max_bc_requests);
dprintk("\tmax_inline   : %d\n", svcrdma_max_req_size);
 
svc_rdma_wq = alloc_workqueue("svc_rdma", 0, 0);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 694ade4..35326a3 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -169,12 +169,12 @@ static struct svc_rdma_op_ctxt *alloc_ctxt(struct 
svcxprt_rdma *xprt,
 
 static bool svc_rdma_prealloc_ctxts(struct svcxprt_rdma *xprt)
 {
-   int i;
+   unsigned int i;
 
/* Each RPC/RDMA credit can consume a number of send
 * and receive WQEs. One ctxt is allocated for each.
 */
-   i = xprt->sc_sq_depth + xprt->sc_max_requests;
+   i = xprt->sc_sq_depth + xprt->sc_rq_depth;
 
while (i--) {
struct svc_rdma_op_ctxt *ctxt;
@@ -285,7 +285,7 @@ static struct svc_rdma_req_map *alloc_req_map(gfp_t flags)
 
 static bool svc_rdma_prealloc_maps(struct svcxprt_rdma *xprt)
 {
-   int i;
+   unsigned int i;
 
/* One for each receive buffer on this connection. */
i = xprt->sc_max_requests;
@@ -1016,8 +1016,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
struct ib_device *dev;
int uninitialized_var(dma_mr_acc);
int need_dma_mr = 0;
+   unsigned int i;
int ret = 0;
-   int i;
 
listen_rdma = container_of(xprt, struct svcxprt_rdma, sc_xprt);
clear_bit(XPT_CONN, &xprt->xp

[PATCH v4 07/11] svcrdma: Add gfp flags to svc_rdma_post_recv()

2015-12-14 Thread Chuck Lever

svc_rdma_post_recv() allocates pages for receive buffers on-demand.
It uses GFP_KERNEL so the allocator tries hard, and may sleep. But
I'm about to add a call to svc_rdma_post_recv() from a function
that may not sleep.

Since all svc_rdma_post_recv() call sites can tolerate its failure,
allow it to fail if the page allocator returns nothing. Longer term,
receive buffers, being a finite resource per-connection, should be
pre-allocated and re-used.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |2 +-
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|2 +-
 net/sunrpc/xprtrdma/svc_rdma_transport.c |8 +---
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 141edbb..729ff35 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -221,7 +221,7 @@ extern struct rpcrdma_read_chunk *
 extern int svc_rdma_send(struct svcxprt_rdma *, struct ib_send_wr *);
 extern void svc_rdma_send_error(struct svcxprt_rdma *, struct rpcrdma_msg *,
enum rpcrdma_errcode);
-extern int svc_rdma_post_recv(struct svcxprt_rdma *);
+extern int svc_rdma_post_recv(struct svcxprt_rdma *, gfp_t);
 extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *);
 extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
 extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index b75566c..2d3d7a4 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -472,7 +472,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
int ret;
 
/* Post a recv buffer to handle another request. */
-   ret = svc_rdma_post_recv(rdma);
+   ret = svc_rdma_post_recv(rdma, GFP_KERNEL);
if (ret) {
printk(KERN_INFO
   "svcrdma: could not post a receive buffer, err=%d."
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index ec10ae3..14b692d 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -668,7 +668,7 @@ static struct svcxprt_rdma *rdma_create_xprt(struct 
svc_serv *serv,
return cma_xprt;
 }
 
-int svc_rdma_post_recv(struct svcxprt_rdma *xprt)
+int svc_rdma_post_recv(struct svcxprt_rdma *xprt, gfp_t flags)
 {
struct ib_recv_wr recv_wr, *bad_recv_wr;
struct svc_rdma_op_ctxt *ctxt;
@@ -686,7 +686,9 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt)
pr_err("svcrdma: Too many sges (%d)\n", sge_no);
goto err_put_ctxt;
}
-   page = alloc_page(GFP_KERNEL | __GFP_NOFAIL);
+   page = alloc_page(flags);
+   if (!page)
+   goto err_put_ctxt;
ctxt->pages[sge_no] = page;
pa = ib_dma_map_page(xprt->sc_cm_id->device,
 page, 0, PAGE_SIZE,
@@ -1182,7 +1184,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
 
/* Post receive buffers */
for (i = 0; i < newxprt->sc_max_requests; i++) {
-   ret = svc_rdma_post_recv(newxprt);
+   ret = svc_rdma_post_recv(newxprt, GFP_KERNEL);
if (ret) {
dprintk("svcrdma: failure posting receive buffers\n");
goto errout;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 09/11] svcrdma: Make map_xdr non-static

2015-12-14 Thread Chuck Lever

Pre-requisite to use map_xdr in the backchannel code.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h   |2 ++
 net/sunrpc/xprtrdma/svc_rdma_sendto.c |   14 +++---
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 729ff35..aeffa30 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -213,6 +213,8 @@ extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, 
struct svc_rqst *,
u32, u32, u64, bool);
 
 /* svc_rdma_sendto.c */
+extern int svc_rdma_map_xdr(struct svcxprt_rdma *, struct xdr_buf *,
+   struct svc_rdma_req_map *);
 extern int svc_rdma_sendto(struct svc_rqst *);
 extern struct rpcrdma_read_chunk *
svc_rdma_get_read_chunk(struct rpcrdma_msg *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 9221086..ced3151 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -50,9 +50,9 @@
 
 #define RPCDBG_FACILITYRPCDBG_SVCXPRT
 
-static int map_xdr(struct svcxprt_rdma *xprt,
-  struct xdr_buf *xdr,
-  struct svc_rdma_req_map *vec)
+int svc_rdma_map_xdr(struct svcxprt_rdma *xprt,
+struct xdr_buf *xdr,
+struct svc_rdma_req_map *vec)
 {
int sge_no;
u32 sge_bytes;
@@ -62,7 +62,7 @@ static int map_xdr(struct svcxprt_rdma *xprt,
 
if (xdr->len !=
(xdr->head[0].iov_len + xdr->page_len + xdr->tail[0].iov_len)) {
-   pr_err("svcrdma: map_xdr: XDR buffer length error\n");
+   pr_err("svcrdma: %s: XDR buffer length error\n", __func__);
return -EIO;
}
 
@@ -97,9 +97,9 @@ static int map_xdr(struct svcxprt_rdma *xprt,
sge_no++;
}
 
-   dprintk("svcrdma: map_xdr: sge_no %d page_no %d "
+   dprintk("svcrdma: %s: sge_no %d page_no %d "
"page_base %u page_len %u head_len %zu tail_len %zu\n",
-   sge_no, page_no, xdr->page_base, xdr->page_len,
+   __func__, sge_no, page_no, xdr->page_base, xdr->page_len,
xdr->head[0].iov_len, xdr->tail[0].iov_len);
 
vec->count = sge_no;
@@ -599,7 +599,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
ctxt = svc_rdma_get_context(rdma);
ctxt->direction = DMA_TO_DEVICE;
vec = svc_rdma_get_req_map(rdma);
-   ret = map_xdr(rdma, &rqstp->rq_res, vec);
+   ret = svc_rdma_map_xdr(rdma, &rqstp->rq_res, vec);
if (ret)
goto err0;
inline_bytes = rqstp->rq_res.len;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 04/11] svcrdma: Improve allocation of struct svc_rdma_op_ctxt

2015-12-14 Thread Chuck Lever

When the maximum payload size of NFS READ and WRITE was increased
by commit cc9a903d915c ("svcrdma: Change maximum server payload back
to RPCSVC_MAXPAYLOAD"), the size of struct svc_rdma_op_ctxt
increased to over 6KB (on x86_64). That makes allocating one of
these from a kmem_cache more likely to fail in situations when
system memory is exhausted.

Since I'm about to add a caller where this allocation must always
work _and_ it cannot sleep, pre-allocate ctxts for each connection.

Another motivation for this change is that NFSv4.x servers are
required by specification not to drop NFS requests. Pre-allocating
memory resources reduces the likelihood of a drop.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |6 +-
 net/sunrpc/xprtrdma/svc_rdma_transport.c |  102 ++
 2 files changed, 94 insertions(+), 14 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f869807..be2804b 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -69,6 +69,7 @@ extern atomic_t rdma_stat_sq_prod;
  * completes.
  */
 struct svc_rdma_op_ctxt {
+   struct list_head free;
struct svc_rdma_op_ctxt *read_hdr;
struct svc_rdma_fastreg_mr *frmr;
int hdr_count;
@@ -141,7 +142,10 @@ struct svcxprt_rdma {
struct ib_pd *sc_pd;
 
atomic_t sc_dma_used;
-   atomic_t sc_ctxt_used;
+   spinlock_t   sc_ctxt_lock;
+   struct list_head sc_ctxts;
+   int  sc_ctxt_used;
+
struct list_head sc_rq_dto_q;
spinlock_t   sc_rq_dto_lock;
struct ib_qp *sc_qp;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 0783f6e..58ed9f2 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -153,18 +153,76 @@ static void svc_rdma_bc_free(struct svc_xprt *xprt)
 }
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
-struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
+static struct svc_rdma_op_ctxt *alloc_ctxt(struct svcxprt_rdma *xprt,
+  gfp_t flags)
 {
struct svc_rdma_op_ctxt *ctxt;
 
-   ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
-   GFP_KERNEL | __GFP_NOFAIL);
-   ctxt->xprt = xprt;
-   INIT_LIST_HEAD(&ctxt->dto_q);
+   ctxt = kmalloc(sizeof(*ctxt), flags);
+   if (ctxt) {
+   ctxt->xprt = xprt;
+   INIT_LIST_HEAD(&ctxt->free);
+   INIT_LIST_HEAD(&ctxt->dto_q);
+   }
+   return ctxt;
+}
+
+static bool svc_rdma_prealloc_ctxts(struct svcxprt_rdma *xprt)
+{
+   int i;
+
+   /* Each RPC/RDMA credit can consume a number of send
+* and receive WQEs. One ctxt is allocated for each.
+*/
+   i = xprt->sc_sq_depth + xprt->sc_max_requests;
+
+   while (i--) {
+   struct svc_rdma_op_ctxt *ctxt;
+
+   ctxt = alloc_ctxt(xprt, GFP_KERNEL);
+   if (!ctxt) {
+   dprintk("svcrdma: No memory for RDMA ctxt\n");
+   return false;
+   }
+   list_add(&ctxt->free, &xprt->sc_ctxts);
+   }
+   return true;
+}
+
+struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
+{
+   struct svc_rdma_op_ctxt *ctxt = NULL;
+
+   spin_lock_bh(&xprt->sc_ctxt_lock);
+   xprt->sc_ctxt_used++;
+   if (list_empty(&xprt->sc_ctxts))
+   goto out_empty;
+
+   ctxt = list_first_entry(&xprt->sc_ctxts,
+   struct svc_rdma_op_ctxt, free);
+   list_del_init(&ctxt->free);
+   spin_unlock_bh(&xprt->sc_ctxt_lock);
+
+out:
ctxt->count = 0;
ctxt->frmr = NULL;
-   atomic_inc(&xprt->sc_ctxt_used);
return ctxt;
+
+out_empty:
+   /* Either pre-allocation missed the mark, or send
+* queue accounting is broken.
+*/
+   spin_unlock_bh(&xprt->sc_ctxt_lock);
+
+   ctxt = alloc_ctxt(xprt, GFP_NOIO);
+   if (ctxt)
+   goto out;
+
+   spin_lock_bh(&xprt->sc_ctxt_lock);
+   xprt->sc_ctxt_used--;
+   spin_unlock_bh(&xprt->sc_ctxt_lock);
+   WARN_ON_ONCE("svcrdma: empty RDMA ctxt list?\n");
+   return NULL;
 }
 
 void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
@@ -190,16 +248,29 @@ void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
 
 void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, int free_pages)
 {
-   struct svcxprt_rdma *xprt;
+   struct svcxprt_rdma *xprt = ctxt->xprt;
int i;
 
-   xprt = ctxt->xprt;
if (free_pages)
for (i = 0; i < ctxt->count; i+

[PATCH v4 06/11] svcrdma: Remove unused req_map and ctxt kmem_caches

2015-12-14 Thread Chuck Lever

Clean up.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h |1 +
 net/sunrpc/xprtrdma/svc_rdma.c  |   35 ---
 net/sunrpc/xprtrdma/xprt_rdma.h |7 ---
 3 files changed, 1 insertion(+), 42 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 05bf4fe..141edbb 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -242,6 +242,7 @@ extern struct svc_xprt_class svc_rdma_bc_class;
 #endif
 
 /* svc_rdma.c */
+extern struct workqueue_struct *svc_rdma_wq;
 extern int svc_rdma_init(void);
 extern void svc_rdma_cleanup(void);
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index 1b7051b..e894e06 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -71,10 +71,6 @@ atomic_t rdma_stat_rq_prod;
 atomic_t rdma_stat_sq_poll;
 atomic_t rdma_stat_sq_prod;
 
-/* Temporary NFS request map and context caches */
-struct kmem_cache *svc_rdma_map_cachep;
-struct kmem_cache *svc_rdma_ctxt_cachep;
-
 struct workqueue_struct *svc_rdma_wq;
 
 /*
@@ -243,8 +239,6 @@ void svc_rdma_cleanup(void)
svc_unreg_xprt_class(&svc_rdma_bc_class);
 #endif
svc_unreg_xprt_class(&svc_rdma_class);
-   kmem_cache_destroy(svc_rdma_map_cachep);
-   kmem_cache_destroy(svc_rdma_ctxt_cachep);
 }
 
 int svc_rdma_init(void)
@@ -264,39 +258,10 @@ int svc_rdma_init(void)
svcrdma_table_header =
register_sysctl_table(svcrdma_root_table);
 
-   /* Create the temporary map cache */
-   svc_rdma_map_cachep = kmem_cache_create("svc_rdma_map_cache",
-   sizeof(struct svc_rdma_req_map),
-   0,
-   SLAB_HWCACHE_ALIGN,
-   NULL);
-   if (!svc_rdma_map_cachep) {
-   printk(KERN_INFO "Could not allocate map cache.\n");
-   goto err0;
-   }
-
-   /* Create the temporary context cache */
-   svc_rdma_ctxt_cachep =
-   kmem_cache_create("svc_rdma_ctxt_cache",
- sizeof(struct svc_rdma_op_ctxt),
- 0,
- SLAB_HWCACHE_ALIGN,
- NULL);
-   if (!svc_rdma_ctxt_cachep) {
-   printk(KERN_INFO "Could not allocate WR ctxt cache.\n");
-   goto err1;
-   }
-
/* Register RDMA with the SVC transport switch */
svc_reg_xprt_class(&svc_rdma_class);
 #if defined(CONFIG_SUNRPC_BACKCHANNEL)
svc_reg_xprt_class(&svc_rdma_bc_class);
 #endif
return 0;
- err1:
-   kmem_cache_destroy(svc_rdma_map_cachep);
- err0:
-   unregister_sysctl_table(svcrdma_table_header);
-   destroy_workqueue(svc_rdma_wq);
-   return -ENOMEM;
 }
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 4197191..72276c7 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -528,11 +528,4 @@ void xprt_rdma_bc_free_rqst(struct rpc_rqst *);
 void xprt_rdma_bc_destroy(struct rpc_xprt *, unsigned int);
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
-/* Temporary NFS request map cache. Created in svc_rdma.c  */
-extern struct kmem_cache *svc_rdma_map_cachep;
-/* WR context cache. Created in svc_rdma.c  */
-extern struct kmem_cache *svc_rdma_ctxt_cachep;
-/* Workqueue created in svc_rdma.c */
-extern struct workqueue_struct *svc_rdma_wq;
-
 #endif /* _LINUX_SUNRPC_XPRT_RDMA_H */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 05/11] svcrdma: Improve allocation of struct svc_rdma_req_map

2015-12-14 Thread Chuck Lever

To ensure this allocation cannot fail and will not sleep,
pre-allocate the req_map structures per-connection.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |8 ++-
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|6 +-
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   85 ++
 3 files changed, 84 insertions(+), 15 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index be2804b..05bf4fe 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -113,6 +113,7 @@ struct svc_rdma_fastreg_mr {
struct list_head frmr_list;
 };
 struct svc_rdma_req_map {
+   struct list_head free;
unsigned long count;
union {
struct kvec sge[RPCSVC_MAXPAGES];
@@ -145,6 +146,8 @@ struct svcxprt_rdma {
spinlock_t   sc_ctxt_lock;
struct list_head sc_ctxts;
int  sc_ctxt_used;
+   spinlock_t   sc_map_lock;
+   struct list_head sc_maps;
 
struct list_head sc_rq_dto_q;
spinlock_t   sc_rq_dto_lock;
@@ -223,8 +226,9 @@ extern int svc_rdma_create_listen(struct svc_serv *, int, 
struct sockaddr *);
 extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
 extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
 extern void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt);
-extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
-extern void svc_rdma_put_req_map(struct svc_rdma_req_map *);
+extern struct svc_rdma_req_map *svc_rdma_get_req_map(struct svcxprt_rdma *);
+extern void svc_rdma_put_req_map(struct svcxprt_rdma *,
+struct svc_rdma_req_map *);
 extern struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *);
 extern void svc_rdma_put_frmr(struct svcxprt_rdma *,
  struct svc_rdma_fastreg_mr *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index bad5eaa..b75566c 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -598,7 +598,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
/* Build an req vec for the XDR */
ctxt = svc_rdma_get_context(rdma);
ctxt->direction = DMA_TO_DEVICE;
-   vec = svc_rdma_get_req_map();
+   vec = svc_rdma_get_req_map(rdma);
ret = map_xdr(rdma, &rqstp->rq_res, vec);
if (ret)
goto err0;
@@ -637,14 +637,14 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
 
ret = send_reply(rdma, rqstp, res_page, rdma_resp, ctxt, vec,
 inline_bytes);
-   svc_rdma_put_req_map(vec);
+   svc_rdma_put_req_map(rdma, vec);
dprintk("svcrdma: send_reply returns %d\n", ret);
return ret;
 
  err1:
put_page(res_page);
  err0:
-   svc_rdma_put_req_map(vec);
+   svc_rdma_put_req_map(rdma, vec);
svc_rdma_put_context(ctxt, 0);
return ret;
 }
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 58ed9f2..ec10ae3 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -273,23 +273,83 @@ static void svc_rdma_destroy_ctxts(struct svcxprt_rdma 
*xprt)
}
 }
 
-/*
- * Temporary NFS req mappings are shared across all transport
- * instances. These are short lived and should be bounded by the number
- * of concurrent server threads * depth of the SQ.
- */
-struct svc_rdma_req_map *svc_rdma_get_req_map(void)
+static struct svc_rdma_req_map *alloc_req_map(gfp_t flags)
 {
struct svc_rdma_req_map *map;
-   map = kmem_cache_alloc(svc_rdma_map_cachep,
-  GFP_KERNEL | __GFP_NOFAIL);
+
+   map = kmalloc(sizeof(*map), flags);
+   if (map)
+   INIT_LIST_HEAD(&map->free);
+   return map;
+}
+
+static bool svc_rdma_prealloc_maps(struct svcxprt_rdma *xprt)
+{
+   int i;
+
+   /* One for each receive buffer on this connection. */
+   i = xprt->sc_max_requests;
+
+   while (i--) {
+   struct svc_rdma_req_map *map;
+
+   map = alloc_req_map(GFP_KERNEL);
+   if (!map) {
+   dprintk("svcrdma: No memory for request map\n");
+   return false;
+   }
+   list_add(&map->free, &xprt->sc_maps);
+   }
+   return true;
+}
+
+struct svc_rdma_req_map *svc_rdma_get_req_map(struct svcxprt_rdma *xprt)
+{
+   struct svc_rdma_req_map *map = NULL;
+
+   spin_lock(&xprt->sc_map_lock);
+   if (list_empty(&xprt->sc_maps))
+   goto out_empty;
+
+   map = list_first_entry(&xprt->sc_maps,
+  struct svc_rdma_req_map, free);
+   list_del_init(&map->free);
+   s

[PATCH v4 03/11] svcrdma: Clean up process_context()

2015-12-14 Thread Chuck Lever

Be sure the completed ctxt is put in every path.

The xprt enqueue can take a while, so put the completed ctxt back
in circulation _before_ enqueuing the xprt.

Remove/disable debugging.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   44 ++
 1 file changed, 21 insertions(+), 23 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 27f338a..0783f6e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -386,46 +386,44 @@ static void rq_cq_reap(struct svcxprt_rdma *xprt)
 static void process_context(struct svcxprt_rdma *xprt,
struct svc_rdma_op_ctxt *ctxt)
 {
+   struct svc_rdma_op_ctxt *read_hdr;
+   int free_pages = 0;
+
svc_rdma_unmap_dma(ctxt);
 
switch (ctxt->wr_op) {
case IB_WR_SEND:
-   if (ctxt->frmr)
-   pr_err("svcrdma: SEND: ctxt->frmr != NULL\n");
-   svc_rdma_put_context(ctxt, 1);
+   free_pages = 1;
break;
 
case IB_WR_RDMA_WRITE:
-   if (ctxt->frmr)
-   pr_err("svcrdma: WRITE: ctxt->frmr != NULL\n");
-   svc_rdma_put_context(ctxt, 0);
break;
 
case IB_WR_RDMA_READ:
case IB_WR_RDMA_READ_WITH_INV:
svc_rdma_put_frmr(xprt, ctxt->frmr);
-   if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) {
-   struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr;
-   if (read_hdr) {
-   spin_lock_bh(&xprt->sc_rq_dto_lock);
-   set_bit(XPT_DATA, &xprt->sc_xprt.xpt_flags);
-   list_add_tail(&read_hdr->dto_q,
- &xprt->sc_read_complete_q);
-   spin_unlock_bh(&xprt->sc_rq_dto_lock);
-   } else {
-   pr_err("svcrdma: ctxt->read_hdr == NULL\n");
-   }
-   svc_xprt_enqueue(&xprt->sc_xprt);
-   }
+
+   if (!test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags))
+   break;
+
+   read_hdr = ctxt->read_hdr;
svc_rdma_put_context(ctxt, 0);
-   break;
+
+   spin_lock_bh(&xprt->sc_rq_dto_lock);
+   set_bit(XPT_DATA, &xprt->sc_xprt.xpt_flags);
+   list_add_tail(&read_hdr->dto_q,
+ &xprt->sc_read_complete_q);
+   spin_unlock_bh(&xprt->sc_rq_dto_lock);
+   svc_xprt_enqueue(&xprt->sc_xprt);
+   return;
 
default:
-   printk(KERN_ERR "svcrdma: unexpected completion type, "
-  "opcode=%d\n",
-  ctxt->wr_op);
+   dprintk("svcrdma: unexpected completion opcode=%d\n",
+   ctxt->wr_op);
break;
}
+
+   svc_rdma_put_context(ctxt, free_pages);
 }
 
 /*

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 01/11] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-12-14 Thread Chuck Lever

Minor optimization: when dealing with write chunk XDR roundup, do
not post a Write WR for the zero bytes in the pad. Simply update
the write segment in the RPC-over-RDMA header to reflect the extra
pad bytes.

The Reply chunk is also a write chunk, but the server does not use
send_write_chunks() to send the Reply chunk. That's OK in this case:
the server Upper Layer typically marshals the Reply chunk contents
in a single contiguous buffer, without a separate tail for the XDR
pad.

The comments and the variable naming refer to "chunks" but what is
really meant is "segments." The existing code sends only one
xdr_write_chunk per RPC reply.

The fix assumes this as well. When the XDR pad in the first write
chunk is reached, the assumption is the Write list is complete and
send_write_chunks() returns.

That will remain a valid assumption until the server Upper Layer can
support multiple bulk payload results per RPC.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 969a1ab..bad5eaa 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
arg_ch->rs_handle,
arg_ch->rs_offset,
write_len);
+
+   /* Do not send XDR pad bytes */
+   if (chunk_no && write_len < 4) {
+   chunk_no++;
+   break;
+   }
+
chunk_off = 0;
while (write_len) {
ret = send_write(xprt, rqstp,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 00/11] NFS/RDMA server patches for v4.5

2015-12-14 Thread Chuck Lever

Here are patches to support server-side bi-directional RPC/RDMA
operation (to enable NFSv4.1 on RPC/RDMA transports). Thanks to
all who reviewed v1, v2, and v3. This version has some significant
changes since the previous one.

In preparation for Doug's final topic branch, Bruce, I've rebased
these on Christoph's ib_device_attr branch. There were some merge
conflicts which I've fixed and tested. These are ready for your
review.

Also available in the "nfsd-rdma-for-4.5" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-for-4.5


Changes since v3:
- Rebased on Christoph's ib_device_attr branch
- Backchannel patches have been squashed together
- Memory allocation overhaul to prevent blocking allocation
  when sending backchannel calls


Changes since v2:
- Rebased on v4.4-rc4
- Backchannel code in new source file to address dprintk issues
- svc_rdma_get_context() now uses a pre-allocated cache
- Dropped svc_rdma_send clean up


Changes since v1:

- Rebased on v4.4-rc3
- Removed the use of CONFIG_SUNRPC_BACKCHANNEL
- Fixed computation of forward and backward max_requests
- Updated some comments and patch descriptions
- pr_err and pr_info converted to dprintk
- Simplified svc_rdma_get_context()
- Dropped patch removing access_flags field
- NFSv4.1 callbacks tested with for-4.5 client

---

Chuck Lever (11):
  svcrdma: Do not send XDR roundup bytes for a write chunk
  svcrdma: Clean up rdma_create_xprt()
  svcrdma: Clean up process_context()
  svcrdma: Improve allocation of struct svc_rdma_op_ctxt
  svcrdma: Improve allocation of struct svc_rdma_req_map
  svcrdma: Remove unused req_map and ctxt kmem_caches
  svcrdma: Add gfp flags to svc_rdma_post_recv()
  svcrdma: Remove last two __GFP_NOFAIL call sites
  svcrdma: Make map_xdr non-static
  svcrdma: Define maximum number of backchannel requests
  svcrdma: Add class for RDMA backwards direction transport


 include/linux/sunrpc/svc_rdma.h|   37 ++-
 net/sunrpc/xprt.c  |1 
 net/sunrpc/xprtrdma/Makefile   |2 
 net/sunrpc/xprtrdma/svc_rdma.c |   41 ---
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |  371 
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c|   52 
 net/sunrpc/xprtrdma/svc_rdma_sendto.c  |   34 ++-
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |  284 -
 net/sunrpc/xprtrdma/transport.c|   30 +-
 net/sunrpc/xprtrdma/xprt_rdma.h|   20 +-
 10 files changed, 730 insertions(+), 142 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/svc_rdma_backchannel.c

--
Signature
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 02/11] svcrdma: Clean up rdma_create_xprt()

2015-12-14 Thread Chuck Lever

kzalloc is used here, so setting the atomic fields to zero is
unnecessary. sc_ord is set again in handle_connect_req. The other
fields are re-initialized in svc_rdma_accept().

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 9f3eb89..27f338a 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -529,14 +529,6 @@ static struct svcxprt_rdma *rdma_create_xprt(struct 
svc_serv *serv,
spin_lock_init(&cma_xprt->sc_rq_dto_lock);
spin_lock_init(&cma_xprt->sc_frmr_q_lock);
 
-   cma_xprt->sc_ord = svcrdma_ord;
-
-   cma_xprt->sc_max_req_size = svcrdma_max_req_size;
-   cma_xprt->sc_max_requests = svcrdma_max_requests;
-   cma_xprt->sc_sq_depth = svcrdma_max_requests * RPCRDMA_SQ_DEPTH_MULT;
-   atomic_set(&cma_xprt->sc_sq_count, 0);
-   atomic_set(&cma_xprt->sc_ctxt_used, 0);
-
if (listener)
set_bit(XPT_LISTENER, &cma_xprt->sc_xprt.xpt_flags);
 
@@ -918,6 +910,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
  (size_t)RPCSVC_MAXPAGES);
newxprt->sc_max_sge_rd = min_t(size_t, dev->max_sge_rd,
   RPCSVC_MAXPAGES);
+   newxprt->sc_max_req_size = svcrdma_max_req_size;
newxprt->sc_max_requests = min((size_t)dev->max_qp_wr,
   (size_t)svcrdma_max_requests);
newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 10/11] xprtrdma: Invalidate in the RPC reply handler

2015-12-14 Thread Chuck Lever

There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.

The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.

This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.

Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.

Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.

Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.

Reported-by: Jason Gunthorpe 
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c |   10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 0bc8c39..3d00c5d 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -891,6 +891,16 @@ badheader:
break;
}
 
+   /* Invalidate and flush the data payloads before waking the
+* waiting application. This guarantees the memory region is
+* properly fenced from the server before the application
+* accesses the data. It also ensures proper send flow
+* control: waking the next RPC waits until this RPC has
+* relinquished all its Send Queue entries.
+*/
+   if (req->rl_nchunks)
+   r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);
+
credits = be32_to_cpu(headerp->rm_credit);
if (credits == 0)
credits = 1;/* don't deadlock */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 11/11] xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').

2015-12-14 Thread Chuck Lever

The root of the problem was that sends (especially unsignalled
FASTREG and LOCAL_INV Work Requests) were not properly flow-
controlled, which allowed a send queue overrun.

Now that the RPC/RDMA reply handler waits for invalidation to
complete, the send queue is properly flow-controlled. Thus this
limit is no longer necessary.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |6 ++
 net/sunrpc/xprtrdma/xprt_rdma.h |6 --
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index f23f3d6..1867e3a 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -608,10 +608,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia 
*ia,
 
/* set trigger for requesting send completion */
ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
-   if (ep->rep_cqinit > RPCRDMA_MAX_UNSIGNALED_SENDS)
-   ep->rep_cqinit = RPCRDMA_MAX_UNSIGNALED_SENDS;
-   else if (ep->rep_cqinit <= 2)
-   ep->rep_cqinit = 0;
+   if (ep->rep_cqinit <= 2)
+   ep->rep_cqinit = 0; /* always signal? */
INIT_CQCOUNT(ep);
init_waitqueue_head(&ep->rep_connect_wait);
INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 089a7db..ba3bc3f 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -87,12 +87,6 @@ struct rpcrdma_ep {
struct delayed_work rep_connect_worker;
 };
 
-/*
- * Force a signaled SEND Work Request every so often,
- * in case the provider needs to do some housekeeping.
- */
-#define RPCRDMA_MAX_UNSIGNALED_SENDS   (32)
-
 #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
 #define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 09/11] SUNRPC: Introduce xprt_commit_rqst()

2015-12-14 Thread Chuck Lever

I'm about to add code in the RPC/RDMA reply handler between the
xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
to execute outside of spinlock critical sections.

Add a hook to remove an rpc_rqst from the pending list once
the transport knows its going to invoke xprt_complete_rqst().

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/xprt.h|1 +
 net/sunrpc/xprt.c  |   14 ++
 net/sunrpc/xprtrdma/rpc_rdma.c |4 
 3 files changed, 19 insertions(+)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 69ef5b3..ab6c3a5 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -366,6 +366,7 @@ void
xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
 void   xprt_write_space(struct rpc_xprt *xprt);
 void   xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task 
*task, int result);
 struct rpc_rqst *  xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
+void   xprt_commit_rqst(struct rpc_task *task);
 void   xprt_complete_rqst(struct rpc_task *task, int copied);
 void   xprt_release_rqst_cong(struct rpc_task *task);
 void   xprt_disconnect_done(struct rpc_xprt *xprt);
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2e98f4a..a5be4ab 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
 }
 
 /**
+ * xprt_commit_rqst - remove rqst from pending list early
+ * @task: RPC request to remove
+ *
+ * Caller holds transport lock.
+ */
+void xprt_commit_rqst(struct rpc_task *task)
+{
+   struct rpc_rqst *req = task->tk_rqstp;
+
+   list_del_init(&req->rq_list);
+}
+EXPORT_SYMBOL_GPL(xprt_commit_rqst);
+
+/**
  * xprt_complete_rqst - called when reply processing is complete
  * @task: RPC request that recently completed
  * @copied: actual number of bytes received from the transport
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..0bc8c39 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -804,6 +804,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
if (req->rl_reply)
goto out_duplicate;
 
+   xprt_commit_rqst(rqst->rq_task);
+   spin_unlock_bh(&xprt->transport_lock);
+
dprintk("RPC:   %s: reply 0x%p completes request 0x%p\n"
"   RPC request 0x%p xid 0x%08x\n",
__func__, rep, req, rqst,
@@ -894,6 +897,7 @@ badheader:
else if (credits > r_xprt->rx_buf.rb_max_requests)
credits = r_xprt->rx_buf.rb_max_requests;
 
+   spin_lock_bh(&xprt->transport_lock);
cwnd = xprt->cwnd;
xprt->cwnd = credits << RPC_CWNDSHIFT;
if (xprt->cwnd > cwnd)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 08/11] xprtrdma: Add ro_unmap_sync method for all-physical registration

2015-12-14 Thread Chuck Lever

physical's ro_unmap is synchronous already. The new ro_unmap_sync
method just has to DMA unmap all MRs associated with the RPC
request.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/physical_ops.c |   13 +
 1 file changed, 13 insertions(+)

diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index 617b76f..dbb302e 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -83,6 +83,18 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
return 1;
 }
 
+/* DMA unmap all memory regions that were mapped for "req".
+ */
+static void
+physical_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   unsigned int i;
+
+   for (i = 0; req->rl_nchunks; --req->rl_nchunks)
+   rpcrdma_unmap_one(device, &req->rl_segments[i++]);
+}
+
 static void
 physical_op_destroy(struct rpcrdma_buffer *buf)
 {
@@ -90,6 +102,7 @@ physical_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
+   .ro_unmap_sync  = physical_op_unmap_sync,
.ro_unmap   = physical_op_unmap,
.ro_open= physical_op_open,
.ro_maxpages= physical_op_maxpages,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 07/11] xprtrdma: Add ro_unmap_sync method for FMR

2015-12-14 Thread Chuck Lever

FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.

1. Gather all the MRs for the RPC request onto a list, and invoke
   ib_unmap_fmr() once with that list. This reduces the number of
   doorbells when there is more than one MR to invalidate

2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
   This is critical after invalidating a Write chunk.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c |   64 +
 1 file changed, 64 insertions(+)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index f1e8daf..c14f3a4 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -179,6 +179,69 @@ out_maperr:
return rc;
 }
 
+static void
+__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   int nsegs = seg->mr_nsegs;
+
+   seg->rl_mw = NULL;
+
+   while (nsegs--)
+   rpcrdma_unmap_one(device, seg++);
+
+   rpcrdma_put_mw(r_xprt, mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct rpcrdma_mr_seg *seg;
+   unsigned int i, nchunks;
+   struct rpcrdma_mw *mw;
+   LIST_HEAD(unmap_list);
+   int rc;
+
+   dprintk("RPC:   %s: req %p\n", __func__, req);
+
+   /* ORDER: Invalidate all of the req's MRs first
+*
+* ib_unmap_fmr() is slow, so use a single call instead
+* of one call per mapped MR.
+*/
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = &req->rl_segments[i];
+   mw = seg->rl_mw;
+
+   list_add(&mw->r.fmr.fmr->list, &unmap_list);
+
+   i += seg->mr_nsegs;
+   }
+   rc = ib_unmap_fmr(&unmap_list);
+   if (rc)
+   pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
+
+   /* ORDER: Now DMA unmap all of the req's MRs, and return
+* them to the free MW list.
+*/
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = &req->rl_segments[i];
+
+   __fmr_dma_unmap(r_xprt, seg);
+
+   i += seg->mr_nsegs;
+   seg->mr_nsegs = 0;
+   }
+
+   req->rl_nchunks = 0;
+}
+
 /* Use the ib_unmap_fmr() verb to prevent further remote
  * access via RDMA READ or RDMA WRITE.
  */
@@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
+   .ro_unmap_sync  = fmr_op_unmap_sync,
.ro_unmap   = fmr_op_unmap,
.ro_open= fmr_op_open,
.ro_maxpages= fmr_op_maxpages,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack

2015-12-14 Thread Chuck Lever

For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off
the stack. This allows frwr_op_map and frwr_op_unmap to chain
WRs together without limit to register or invalidate a set of MRs
with a single ib_post_send().

(This will be for chaining LOCAL_INV requests).

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/frwr_ops.c  |   38 --
 net/sunrpc/xprtrdma/xprt_rdma.h |2 ++
 2 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index ae2a241..660d0b6 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -318,7 +318,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
struct rpcrdma_mw *mw;
struct rpcrdma_frmr *frmr;
struct ib_mr *mr;
-   struct ib_reg_wr reg_wr;
+   struct ib_reg_wr *reg_wr;
struct ib_send_wr *bad_wr;
int rc, i, n, dma_nents;
u8 key;
@@ -335,6 +335,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
mr = frmr->fr_mr;
+   reg_wr = &frmr->fr_regwr;
 
if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;
@@ -380,19 +381,19 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
key = (u8)(mr->rkey & 0x00FF);
ib_update_fast_reg_key(mr, ++key);
 
-   reg_wr.wr.next = NULL;
-   reg_wr.wr.opcode = IB_WR_REG_MR;
-   reg_wr.wr.wr_id = (uintptr_t)mw;
-   reg_wr.wr.num_sge = 0;
-   reg_wr.wr.send_flags = 0;
-   reg_wr.mr = mr;
-   reg_wr.key = mr->rkey;
-   reg_wr.access = writing ?
-   IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
-   IB_ACCESS_REMOTE_READ;
+   reg_wr->wr.next = NULL;
+   reg_wr->wr.opcode = IB_WR_REG_MR;
+   reg_wr->wr.wr_id = (uintptr_t)mw;
+   reg_wr->wr.num_sge = 0;
+   reg_wr->wr.send_flags = 0;
+   reg_wr->mr = mr;
+   reg_wr->key = mr->rkey;
+   reg_wr->access = writing ?
+IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+IB_ACCESS_REMOTE_READ;
 
DECR_CQCOUNT(&r_xprt->rx_ep);
-   rc = ib_post_send(ia->ri_id->qp, ®_wr.wr, &bad_wr);
+   rc = ib_post_send(ia->ri_id->qp, ®_wr->wr, &bad_wr);
if (rc)
goto out_senderr;
 
@@ -422,23 +423,24 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mw *mw = seg1->rl_mw;
struct rpcrdma_frmr *frmr = &mw->r.frmr;
-   struct ib_send_wr invalidate_wr, *bad_wr;
+   struct ib_send_wr *invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;
 
dprintk("RPC:   %s: FRMR %p\n", __func__, mw);
 
seg1->rl_mw = NULL;
frmr->fr_state = FRMR_IS_INVALID;
+   invalidate_wr = &mw->r.frmr.fr_invwr;
 
-   memset(&invalidate_wr, 0, sizeof(invalidate_wr));
-   invalidate_wr.wr_id = (unsigned long)(void *)mw;
-   invalidate_wr.opcode = IB_WR_LOCAL_INV;
-   invalidate_wr.ex.invalidate_rkey = frmr->fr_mr->rkey;
+   memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+   invalidate_wr->wr_id = (uintptr_t)mw;
+   invalidate_wr->opcode = IB_WR_LOCAL_INV;
+   invalidate_wr->ex.invalidate_rkey = frmr->fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);
 
ib_dma_unmap_sg(ia->ri_device, frmr->sg, frmr->sg_nents, seg1->mr_dir);
read_lock(&ia->ri_qplock);
-   rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+   rc = ib_post_send(ia->ri_id->qp, invalidate_wr, &bad_wr);
read_unlock(&ia->ri_qplock);
if (rc)
goto out_err;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 4197191..e60d817 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -206,6 +206,8 @@ struct rpcrdma_frmr {
enum rpcrdma_frmr_state fr_state;
struct work_struct  fr_work;
struct rpcrdma_xprt *fr_xprt;
+   struct ib_reg_wrfr_regwr;
+   struct ib_send_wr   fr_invwr;
 };
 
 struct rpcrdma_fmr {

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 06/11] xprtrdma: Add ro_unmap_sync method for FRWR

2015-12-14 Thread Chuck Lever

FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts
LOCAL_INV Work Requests and waits for them to complete before
returning.

Note also, DMA unmapping is now done _after_ invalidation.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/frwr_ops.c  |  137 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h |2 +
 2 files changed, 135 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 660d0b6..5b9e41d 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -244,12 +244,14 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
-/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs to be reset. */
+/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs
+ * to be reset.
+ *
+ * WARNING: Only wr_id and status are reliable at this point
+ */
 static void
-frwr_sendcompletion(struct ib_wc *wc)
+__frwr_sendcompletion_flush(struct ib_wc *wc, struct rpcrdma_mw *r)
 {
-   struct rpcrdma_mw *r;
-
if (likely(wc->status == IB_WC_SUCCESS))
return;
 
@@ -260,9 +262,23 @@ frwr_sendcompletion(struct ib_wc *wc)
else
pr_warn("RPC:   %s: frmr %p error, status %s (%d)\n",
__func__, r, ib_wc_status_msg(wc->status), wc->status);
+
r->r.frmr.fr_state = FRMR_IS_STALE;
 }
 
+static void
+frwr_sendcompletion(struct ib_wc *wc)
+{
+   struct rpcrdma_mw *r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+   struct rpcrdma_frmr *f = &r->r.frmr;
+
+   if (unlikely(wc->status != IB_WC_SUCCESS))
+   __frwr_sendcompletion_flush(wc, r);
+
+   if (f->fr_waiter)
+   complete(&f->fr_linv_done);
+}
+
 static int
 frwr_op_init(struct rpcrdma_xprt *r_xprt)
 {
@@ -334,6 +350,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
} while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
+   frmr->fr_waiter = false;
mr = frmr->fr_mr;
reg_wr = &frmr->fr_regwr;
 
@@ -413,6 +430,117 @@ out_senderr:
return rc;
 }
 
+static struct ib_send_wr *
+__frwr_prepare_linv_wr(struct rpcrdma_mr_seg *seg)
+{
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   struct rpcrdma_frmr *f = &mw->r.frmr;
+   struct ib_send_wr *invalidate_wr;
+
+   f->fr_waiter = false;
+   f->fr_state = FRMR_IS_INVALID;
+   invalidate_wr = &f->fr_invwr;
+
+   memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+   invalidate_wr->wr_id = (unsigned long)(void *)mw;
+   invalidate_wr->opcode = IB_WR_LOCAL_INV;
+   invalidate_wr->ex.invalidate_rkey = f->fr_mr->rkey;
+
+   return invalidate_wr;
+}
+
+static void
+__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+int rc)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   int nsegs = seg->mr_nsegs;
+
+   seg->rl_mw = NULL;
+
+   while (nsegs--)
+   rpcrdma_unmap_one(device, seg++);
+
+   if (!rc)
+   rpcrdma_put_mw(r_xprt, mw);
+   else
+   __frwr_queue_recovery(mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct ib_send_wr *invalidate_wrs, *pos, *prev, *bad_wr;
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg;
+   unsigned int i, nchunks;
+   struct rpcrdma_frmr *f;
+   int rc;
+
+   dprintk("RPC:   %s: req %p\n", __func__, req);
+
+   /* ORDER: Invalidate all of the req's MRs first
+*
+* Chain the LOCAL_INV Work Requests and post them with
+* a single ib_post_send() call.
+*/
+   invalidate_wrs = pos = prev = NULL;
+   seg = NULL;
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = &req->rl_segments[i];
+
+   pos = __frwr_prepare_linv_wr(seg);
+
+   if (!invalidate_wrs)
+   invalidate_wrs = pos;
+   else
+   prev->next = pos;
+   prev = pos;
+
+   i += seg->mr_nsegs;
+   }
+   f = &seg->rl_mw->r.frmr;
+
+   /* Strong send queue ordering guarantees that when the
+* last WR in the chain completes, all WRs in the chain
+* are complete.
+*/
+   f->fr_invwr.send_flags = IB_SEND_SIGNALED;
+   f->fr_waiter = true;
+

[PATCH v3 05/11] xprtrdma: Introduce ro_unmap_sync method

2015-12-14 Thread Chuck Lever

In the current xprtrdma implementation, some memreg strategies
implement ro_unmap synchronously (the MR is knocked down before the
method returns) and some asynchonously (the MR will be knocked down
and returned to the pool in the background).

To guarantee the MR is truly invalid before the RPC consumer is
allowed to resume execution, we need an unmap method that is
always synchronous, invoked from the RPC/RDMA reply handler.

The new method unmaps all MRs for an RPC. The existing ro_unmap
method unmaps only one MR at a time.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/xprt_rdma.h |2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index e60d817..d9f2f65 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -365,6 +365,8 @@ struct rpcrdma_xprt;
 struct rpcrdma_memreg_ops {
int (*ro_map)(struct rpcrdma_xprt *,
  struct rpcrdma_mr_seg *, int, bool);
+   void(*ro_unmap_sync)(struct rpcrdma_xprt *,
+struct rpcrdma_req *);
int (*ro_unmap)(struct rpcrdma_xprt *,
struct rpcrdma_mr_seg *);
int (*ro_open)(struct rpcrdma_ia *,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 02/11] xprtrdma: xprt_rdma_free() must not release backchannel reqs

2015-12-14 Thread Chuck Lever

Preserve any rpcrdma_req that is attached to rpc_rqst's allocated
for the backchannel. Otherwise, after all the pre-allocated
backchannel req's are consumed, incoming backward calls start
writing on freed memory.

Somehow this hunk got lost.

Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/transport.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8c545f7..740bddc 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -576,6 +576,9 @@ xprt_rdma_free(void *buffer)
 
rb = container_of(buffer, struct rpcrdma_regbuf, rg_base[0]);
req = rb->rg_owner;
+   if (req->rl_backchannel)
+   return;
+
r_xprt = container_of(req->rl_buffer, struct rpcrdma_xprt, rx_buf);
 
dprintk("RPC:   %s: called on 0x%p\n", __func__, req->rl_reply);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 03/11] xprtrdma: Disable RPC/RDMA backchannel debugging messages

2015-12-14 Thread Chuck Lever

Clean up.

Fixes: 63cae47005af ('xprtrdma: Handle incoming backward direction')
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/backchannel.c |   16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c 
b/net/sunrpc/xprtrdma/backchannel.c
index 11d2cfb..cd31181 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -15,7 +15,7 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
-#define RPCRDMA_BACKCHANNEL_DEBUG
+#undef RPCRDMA_BACKCHANNEL_DEBUG
 
 static void rpcrdma_bc_free_rqst(struct rpcrdma_xprt *r_xprt,
 struct rpc_rqst *rqst)
@@ -136,6 +136,7 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int 
reqs)
   __func__);
goto out_free;
}
+   dprintk("RPC:   %s: new rqst %p\n", __func__, rqst);
 
rqst->rq_xprt = &r_xprt->rx_xprt;
INIT_LIST_HEAD(&rqst->rq_list);
@@ -216,12 +217,14 @@ int rpcrdma_bc_marshal_reply(struct rpc_rqst *rqst)
 
rpclen = rqst->rq_svec[0].iov_len;
 
+#ifdef RPCRDMA_BACKCHANNEL_DEBUG
pr_info("RPC:   %s: rpclen %zd headerp 0x%p lkey 0x%x\n",
__func__, rpclen, headerp, rdmab_lkey(req->rl_rdmabuf));
pr_info("RPC:   %s: RPC/RDMA: %*ph\n",
__func__, (int)RPCRDMA_HDRLEN_MIN, headerp);
pr_info("RPC:   %s:  RPC: %*ph\n",
__func__, (int)rpclen, rqst->rq_svec[0].iov_base);
+#endif
 
req->rl_send_iov[0].addr = rdmab_addr(req->rl_rdmabuf);
req->rl_send_iov[0].length = RPCRDMA_HDRLEN_MIN;
@@ -265,6 +268,9 @@ void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
 {
struct rpc_xprt *xprt = rqst->rq_xprt;
 
+   dprintk("RPC:   %s: freeing rqst %p (req %p)\n",
+   __func__, rqst, rpcr_to_rdmar(rqst));
+
smp_mb__before_atomic();
WARN_ON_ONCE(!test_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state));
clear_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state);
@@ -329,9 +335,7 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
struct rpc_rqst, rq_bc_pa_list);
list_del(&rqst->rq_bc_pa_list);
spin_unlock(&xprt->bc_pa_lock);
-#ifdef RPCRDMA_BACKCHANNEL_DEBUG
-   pr_info("RPC:   %s: using rqst %p\n", __func__, rqst);
-#endif
+   dprintk("RPC:   %s: using rqst %p\n", __func__, rqst);
 
/* Prepare rqst */
rqst->rq_reply_bytes_recvd = 0;
@@ -351,10 +355,8 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
 * direction reply.
 */
req = rpcr_to_rdmar(rqst);
-#ifdef RPCRDMA_BACKCHANNEL_DEBUG
-   pr_info("RPC:   %s: attaching rep %p to req %p\n",
+   dprintk("RPC:   %s: attaching rep %p to req %p\n",
__func__, rep, req);
-#endif
req->rl_reply = rep;
 
/* Defeat the retransmit detection logic in send_request */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 01/11] xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)

2015-12-14 Thread Chuck Lever

Clean up.

rb_lock critical sections added in rpcrdma_ep_post_extra_recv()
should have first been converted to use normal spin_lock now that
the reply handler is a work queue.

The backchannel set up code should use the appropriate helper
instead of open-coding a rb_recv_bufs list add.

Problem introduced by glib patch re-ordering on my part.

Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/backchannel.c |6 +-
 net/sunrpc/xprtrdma/verbs.c   |7 +++
 2 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c 
b/net/sunrpc/xprtrdma/backchannel.c
index 2dcb44f..11d2cfb 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -84,9 +84,7 @@ out_fail:
 static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
 unsigned int count)
 {
-   struct rpcrdma_buffer *buffers = &r_xprt->rx_buf;
struct rpcrdma_rep *rep;
-   unsigned long flags;
int rc = 0;
 
while (count--) {
@@ -98,9 +96,7 @@ static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
break;
}
 
-   spin_lock_irqsave(&buffers->rb_lock, flags);
-   list_add(&rep->rr_list, &buffers->rb_recv_bufs);
-   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   rpcrdma_recv_buffer_put(rep);
}
 
return rc;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 650034b..f23f3d6 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1329,15 +1329,14 @@ rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, 
unsigned int count)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_ep *ep = &r_xprt->rx_ep;
struct rpcrdma_rep *rep;
-   unsigned long flags;
int rc;
 
while (count--) {
-   spin_lock_irqsave(&buffers->rb_lock, flags);
+   spin_lock(&buffers->rb_lock);
if (list_empty(&buffers->rb_recv_bufs))
goto out_reqbuf;
rep = rpcrdma_buffer_get_rep_locked(buffers);
-   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   spin_unlock(&buffers->rb_lock);
 
rc = rpcrdma_ep_post_recv(ia, ep, rep);
if (rc)
@@ -1347,7 +1346,7 @@ rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, 
unsigned int count)
return 0;
 
 out_reqbuf:
-   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   spin_unlock(&buffers->rb_lock);
pr_warn("%s: no extra receive buffers\n", __func__);
return -ENOMEM;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 00/11] NFS/RDMA client patches for 4.5

2015-12-14 Thread Chuck Lever

For 4.5, I'd like to address the send queue accounting and
invalidation/unmap ordering issues Jason brought up a couple of
months ago.

In preparation for Doug's final topic branch, Anna, I've rebased
these on Christoph's ib_device_attr branch, but there were no merge
conflicts or other changes needed. Could you begin preparing these
for linux-next and other final testing and review?

Also available in the "nfs-rdma-for-4.5" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5


Changes since v2:
- Rebased on Christoph's ib_device_attr branch


Changes since v1:

- Rebased on v4.4-rc3
- Receive buffer safety margin patch dropped
- Backchannel pr_err and pr_info converted to dprintk
- Backchannel spin locks converted to work queue-safe locks
- Fixed premature release of backchannel request buffer
- NFSv4.1 callbacks tested with for-4.5 server

---

Chuck Lever (11):
  xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
  xprtrdma: xprt_rdma_free() must not release backchannel reqs
  xprtrdma: Disable RPC/RDMA backchannel debugging messages
  xprtrdma: Move struct ib_send_wr off the stack
  xprtrdma: Introduce ro_unmap_sync method
  xprtrdma: Add ro_unmap_sync method for FRWR
  xprtrdma: Add ro_unmap_sync method for FMR
  xprtrdma: Add ro_unmap_sync method for all-physical registration
  SUNRPC: Introduce xprt_commit_rqst()
  xprtrdma: Invalidate in the RPC reply handler
  xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').


 include/linux/sunrpc/xprt.h|1 
 net/sunrpc/xprt.c  |   14 +++
 net/sunrpc/xprtrdma/backchannel.c  |   22 ++---
 net/sunrpc/xprtrdma/fmr_ops.c  |   64 +
 net/sunrpc/xprtrdma/frwr_ops.c |  175 +++-
 net/sunrpc/xprtrdma/physical_ops.c |   13 +++
 net/sunrpc/xprtrdma/rpc_rdma.c |   14 +++
 net/sunrpc/xprtrdma/transport.c|3 +
 net/sunrpc/xprtrdma/verbs.c|   13 +--
 net/sunrpc/xprtrdma/xprt_rdma.h|   12 +-
 10 files changed, 283 insertions(+), 48 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 5/6] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies

2015-12-13 Thread Chuck Lever

Hi Tom-

> On Dec 12, 2015, at 10:24 PM, Tom Talpey  wrote:
> 
> Three comments.
> 
> On 12/7/2015 12:43 PM, Chuck Lever wrote:
>> To support the NFSv4.1 backchannel on RDMA connections, add a
>> capability for receiving an RPC/RDMA reply on a connection
>> established by a client.
>> (snip)
> 
>> +/* By convention, backchannel calls arrive via rdma_msg type
> 
> "By convention" is ok, but it's important to note that this is
> actually not "by protocol". Therefore, the following check may
> reject valid messages. Even though it is unlikely an implementation
> will insert chunks, it's not illegal, and ignoring them will
> be less harmful. So I'm going to remake my earlier observation
> that three checks below should be removed:

The convention is established in

 https://datatracker.ietf.org/doc/draft-ietf-nfsv4-rpcrdma-bidirection/

Specifically, it states that backward direction messages in
RPC-over-RDMA Version One cannot have chunks and cannot be
larger than the connection's inline threshold. Thus these
checks are all valid and will not reject valid forward or
backward messages.

The reason for that stipulation is it makes the RPC-over-RDMA
header in backward direction messages a fixed size.

a. The call direction field in incoming RPC headers is then
   always at a predictable offset in backward direction calls.
   This means chunk lists don't have to be parsed to find the
   call direction field. All that is needed is to ensure that
   the chunk lists in the header are empty before going after
   the call direction field. If any chunk list is present, it
   must be a forward direction message.

b. The maximum size of backward direction messages is
   predictable: it is always the inline threshold minus the
   size of the RPC-over-RDMA header (always 28 bytes when
   there are no chunks).

The client side has a very similar looking set of checks in
its reply handler to distinguish incoming backward direction
RPC calls from forward direction RPC replies.

>> + * messages, and never populate the chunk lists. This makes
>> + * the RPC/RDMA header small and fixed in size, so it is
>> + * straightforward to check the RPC header's direction field.
>> + */
>> +static bool
>> +svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg 
>> *rmsgp)
>> +{
>> +__be32 *p = (__be32 *)rmsgp;
>> +
>> +if (!xprt->xpt_bc_xprt)
>> +return false;
>> +
>> +if (rmsgp->rm_type != rdma_msg)
>> +return false;
> 
> These three:
> 
>> +if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
>> +return false;
>> +if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
>> +return false;
>> +if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
>> +return false;
>> +
> 
> (snip)
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h 
>> b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index a1fd74a..3895574 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -309,6 +309,8 @@ struct rpcrdma_buffer {
>>  u32 rb_bc_srv_max_requests;
>>  spinlock_t  rb_reqslock;/* protect rb_allreqs */
>>  struct list_headrb_allreqs;
>> +
>> +u32 rb_bc_max_requests;
> 
> Why does this need to be u32? Shouldn't it be an int, and also the
> rb_bc_srv_max_requests just above? The forward channel max_requests
> are int, btw.

The XDR definition of RPC-over-RDMA in Section 4.3 of
RFC 5666 defines the on-the-wire credits field as a 
uint32.

I prefer that the host representation of this
field match the sign and be able to contain the full
range values that can be conveyed in the on-the-wire
field. That makes marshaling and unmarshaling of the
wire value straightforward; and reasoning about the
C code in our implementation also applies directly to
wire values as well.

The forward channel max_requests field on the client,
rb_max_requests, is also a u32 since commit 9b1dcbc8cf46
from February 2015.

I've changed the equivalent server side field,
sc_max_requests, to a u32 in the next version of this
series.

ib_device_attr::max_qp_wr and svcxprt_rdma::sc_sq_depth
are both signed, but I can't imagine what a negative
value in these fields could mean.

sc_sq_depth is plugged into ib_qp_cap::max_send_wr and
ib_cq_init_attr::cqe, which are both unsigned, so
sc_sq_depth really should be unsigned too, IMHO.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 1/6] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-12-13 Thread Chuck Lever

Hi Tom-


> On Dec 12, 2015, at 10:14 PM, Tom Talpey  wrote:
> 
> Two small comments.
> 
> On 12/7/2015 12:42 PM, Chuck Lever wrote:
>> Minor optimization: when dealing with write chunk XDR roundup, do
>> not post a Write WR for the zero bytes in the pad. Simply update
>> the write segment in the RPC-over-RDMA header to reflect the extra
>> pad bytes.
>> 
>> The Reply chunk is also a write chunk, but the server does not use
>> send_write_chunks() to send the Reply chunk. That's OK in this case:
>> the server Upper Layer typically marshals the Reply chunk contents
>> in a single contiguous buffer, without a separate tail for the XDR
>> pad.
>> 
>> The comments and the variable naming refer to "chunks" but what is
>> really meant is "segments." The existing code sends only one
>> xdr_write_chunk per RPC reply.
>> 
>> The fix assumes this as well. When the XDR pad in the first write
>> chunk is reached, the assumption is the Write list is complete and
>> send_write_chunks() returns.
>> 
>> That will remain a valid assumption until the server Upper Layer can
>> support multiple bulk payload results per RPC.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>>  net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++
>>  1 file changed, 7 insertions(+)
>> 
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
>> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> index 969a1ab..bad5eaa 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
>>  arg_ch->rs_handle,
>>  arg_ch->rs_offset,
>>  write_len);
>> +
>> +/* Do not send XDR pad bytes */
> 
> It might be clearer to say "marshal" instead of "send".

Marshaling each segment happens unconditionally in the
svc_rdma_xdr_encode_array_chunk() call just before this
comment. I really do mean "Do not send" here: the patch
is intended to squelch the RDMA Write of the XDR pad for
this chunk.

Perhaps "Do not write" would be more precise, but Bruce
has already committed this patch, IIRC.


>> +if (chunk_no && write_len < 4) {
> 
> Why is it necessary to check for chunk_no == 0? It is not
> possible for leading data to ever be padding, nor is a leading
> data element ever less than 4 bytes long. Right?

I'm checking for chunk_no != 0, for exactly the reasons
you mentioned.


> Tom.
> 
>> +chunk_no++;
>> +break;
>> +}
>> +
>>  chunk_off = 0;
>>  while (write_len) {
>>  ret = send_write(xprt, rqstp,

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Chuck Lever


> On Dec 10, 2015, at 6:30 PM, Christoph Hellwig  wrote:
> 
> On Thu, Dec 10, 2015 at 11:07:03AM -0700, Jason Gunthorpe wrote:
>> The ARM folks do this sort of stuff on a regular basis.. Very early on
>> Doug prepares a topic branch with only the big change, NFS folks pull
>> it and then pull your work. Then Doug would send the topic branch to
>> Linus as soon as the merge window opens, then NFS would send theirs.
>> 
>> This is alot less work overall than trying to sequence multiple
>> patches over multiple releases..
> 
> Agreed.  Staging has alaways been a giant pain and things tend to never
> finish moving over that way if they are non-trivial enough.

In that case:

You need to make sure you have all the right Acks. I've added
Anna and Bruce to Ack the NFS-related portions. Santosh should
Ack the RDS part.

http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/ib_device_attr

Given the proximity to the holidays and the next merge window,
Doug will need to get a properly-acked topic branch published
in the next day or two so the rest of us can rebase and start
testing before the relevant parties disappear for the holiday.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Chuck Lever

> On Dec 10, 2015, at 3:27 AM, Sagi Grimberg  wrote:
> 
> 
> 
>> Doug this is going to conflict with the rdmavt work.  So if you take this 
>> could
>> you respond on the list.
> 
> It will also conflict with the iser remote invalidate series.
> 
> Doug it would help if you share your plans so people can rebase
> accordingly.

I would be remiss not to mention that it probably also
conflicts with the NFS server bi-directional RPC/RDMA
series.

Invasive IB core changes like this clean up are especially
burdensome for me because NFS/RDMA changes do not normally
go through Doug's tree, so it takes extra co-ordination.

Here is a modest proposal. An obvious way to split the
device attr cleanup might go like this:

a. first patch: add new fields to ib_device
b. then one patch for each provider to populate these fields
c. then one patch for each kernel ULP to use the new fields
d. then one patch for each provider to remove ->query_attr
e. last patch: remove ib_device_attr from the IB core

That way each provider and ULP maintainer can review and
ack the portion of the changes that he or she is responsible
for, and it should help make it much easier to merge with
conflicting changes.

Splitting it across more than one kernel release would be
helpful too, IMO. a. and b. can go into 4.5, c. into 4.6,
and d. and e. can go in any time after that.

This adds more "process" but given the long chain of core
changes now in plan, we should acknowledge how disruptive
they will be, and come up with ways to make it possible to
get other work done while the core maintenance work
progresses.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/6] svcrdma: Define maximum number of backchannel requests

2015-12-07 Thread Chuck Lever

Extra resources for handling backchannel requests have to be
pre-allocated when a transport instance is created. Set a limit.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |2 ++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   14 +-
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 2bb0ff3..f71c625 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -138,6 +138,7 @@ struct svcxprt_rdma {
 
int  sc_max_requests;   /* Depth of RQ */
int  sc_max_req_size;   /* Size of each RQ WR buf */
+   int  sc_max_bc_requests;
 
struct ib_pd *sc_pd;
 
@@ -182,6 +183,7 @@ struct svcxprt_rdma {
 #define RPCRDMA_SQ_DEPTH_MULT   8
 #define RPCRDMA_MAX_REQUESTS32
 #define RPCRDMA_MAX_REQ_SIZE4096
+#define RPCRDMA_MAX_BC_REQUESTS2
 
 #define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index ede88f3..ed5dd93 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -566,6 +566,7 @@ static struct svcxprt_rdma *rdma_create_xprt(struct 
svc_serv *serv,
 
cma_xprt->sc_max_req_size = svcrdma_max_req_size;
cma_xprt->sc_max_requests = svcrdma_max_requests;
+   cma_xprt->sc_max_bc_requests = RPCRDMA_MAX_BC_REQUESTS;
cma_xprt->sc_sq_depth = svcrdma_max_requests * RPCRDMA_SQ_DEPTH_MULT;
atomic_set(&cma_xprt->sc_sq_count, 0);
atomic_set(&cma_xprt->sc_ctxt_used, 0);
@@ -922,6 +923,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
struct ib_device_attr devattr;
int uninitialized_var(dma_mr_acc);
int need_dma_mr = 0;
+   int total_reqs;
int ret;
int i;
 
@@ -957,8 +959,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
   RPCSVC_MAXPAGES);
newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
-  (size_t)svcrdma_max_requests);
-   newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;
+  (size_t)svcrdma_max_requests);
+   newxprt->sc_max_bc_requests = RPCRDMA_MAX_BC_REQUESTS;
+   total_reqs = newxprt->sc_max_requests + newxprt->sc_max_bc_requests;
+   newxprt->sc_sq_depth = total_reqs * RPCRDMA_SQ_DEPTH_MULT;
 
for (i = newxprt->sc_sq_depth; i; i--) {
struct svc_rdma_op_ctxt *ctxt;
@@ -997,7 +1001,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
dprintk("svcrdma: error creating SQ CQ for connect request\n");
goto errout;
}
-   cq_attr.cqe = newxprt->sc_max_requests;
+   cq_attr.cqe = total_reqs;
newxprt->sc_rq_cq = ib_create_cq(newxprt->sc_cm_id->device,
 rq_comp_handler,
 cq_event_handler,
@@ -1012,7 +1016,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
qp_attr.event_handler = qp_event_handler;
qp_attr.qp_context = &newxprt->sc_xprt;
qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
-   qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
+   qp_attr.cap.max_recv_wr = total_reqs;
qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
@@ -1108,7 +1112,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
newxprt->sc_cm_id->device->local_dma_lkey;
 
/* Post receive buffers */
-   for (i = 0; i < newxprt->sc_max_requests; i++) {
+   for (i = 0; i < total_reqs; i++) {
ret = svc_rdma_post_recv(newxprt);
if (ret) {
dprintk("svcrdma: failure posting receive buffers\n");

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 6/6] xprtrdma: Add class for RDMA backwards direction transport

2015-12-07 Thread Chuck Lever

To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class that enables backward
direction messages on an existing forward channel connection.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/xprt.h|1 
 net/sunrpc/xprt.c  |1 
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |  219 
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |   14 +-
 net/sunrpc/xprtrdma/transport.c|   31 +++-
 net/sunrpc/xprtrdma/xprt_rdma.h|   11 +
 6 files changed, 263 insertions(+), 14 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 69ef5b3..7637ccd 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -85,6 +85,7 @@ struct rpc_rqst {
__u32 * rq_buffer;  /* XDR encode buffer */
size_t  rq_callsize,
rq_rcvsize;
+   void *  rq_privdata; /* xprt-specific per-rqst data */
size_t  rq_xmit_bytes_sent; /* total bytes sent */
size_t  rq_reply_bytes_recvd;   /* total reply bytes */
/* received */
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2e98f4a..37edea6 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1425,3 +1425,4 @@ void xprt_put(struct rpc_xprt *xprt)
if (atomic_dec_and_test(&xprt->count))
xprt_destroy(xprt);
 }
+EXPORT_SYMBOL_GPL(xprt_put);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c 
b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index 69dab71..3534e75 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -84,3 +84,222 @@ out_notfound:
 
goto out_unlock;
 }
+
+/* Server-side transport endpoint wants a whole page for its send
+ * buffer. The client RPC code constructs the RPC header in this
+ * buffer before it invokes ->send_request.
+ *
+ * Returns NULL if there was a temporary allocation failure.
+ */
+static void *
+xprt_rdma_bc_allocate(struct rpc_task *task, size_t size)
+{
+   struct rpc_rqst *rqst = task->tk_rqstp;
+   struct svc_rdma_op_ctxt *ctxt;
+   struct svcxprt_rdma *rdma;
+   struct svc_xprt *sxprt;
+   struct page *page;
+
+   /* Prevent an infinite loop: don't return NULL */
+   if (size > PAGE_SIZE)
+   WARN_ONCE(1, "svcrdma: bc buffer request too large (size 
%zu)\n",
+ size);
+
+   page = alloc_page(RPCRDMA_DEF_GFP);
+   if (!page)
+   return NULL;
+
+   sxprt = rqst->rq_xprt->bc_xprt;
+   rdma = container_of(sxprt, struct svcxprt_rdma, sc_xprt);
+   ctxt = svc_rdma_get_context(rdma);
+   if (!ctxt) {
+   put_page(page);
+   return NULL;
+   }
+
+   rqst->rq_privdata = ctxt;
+   ctxt->pages[0] = page;
+   ctxt->count = 1;
+   return page_address(page);
+}
+
+static void
+xprt_rdma_bc_free(void *buffer)
+{
+   /* No-op: ctxt and page have already been freed. */
+}
+
+static int
+rpcrdma_bc_send_request(struct svcxprt_rdma *rdma, struct rpc_rqst *rqst)
+{
+   struct rpc_xprt *xprt = rqst->rq_xprt;
+   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct rpcrdma_msg *headerp = (struct rpcrdma_msg *)rqst->rq_buffer;
+   struct svc_rdma_op_ctxt *ctxt;
+   int rc;
+
+   /* Space in the send buffer for an RPC/RDMA header is reserved
+* via xprt->tsh_size */
+   headerp->rm_xid = rqst->rq_xid;
+   headerp->rm_vers = rpcrdma_version;
+   headerp->rm_credit = cpu_to_be32(r_xprt->rx_buf.rb_bc_max_requests);
+   headerp->rm_type = rdma_msg;
+   headerp->rm_body.rm_chunks[0] = xdr_zero;
+   headerp->rm_body.rm_chunks[1] = xdr_zero;
+   headerp->rm_body.rm_chunks[2] = xdr_zero;
+
+#ifdef SVCRDMA_BACKCHANNEL_DEBUG
+   pr_info("%s: %*ph\n", __func__, 64, rqst->rq_buffer);
+#endif
+
+   ctxt = (struct svc_rdma_op_ctxt *)rqst->rq_privdata;
+   rc = svc_rdma_bc_post_send(rdma, ctxt, &rqst->rq_snd_buf);
+   if (rc)
+   goto drop_connection;
+   return rc;
+
+drop_connection:
+   dprintk("svcrdma: failed to send bc call\n");
+   svc_rdma_put_context(ctxt, 1);
+   xprt_disconnect_done(xprt);
+   return -ENOTCONN;
+}
+
+/* Send an RPC call on the passive end of a transport
+ * connection.
+ */
+static int
+xprt_rdma_bc_send_request(struct rpc_task *task)
+{
+   struct rpc_rqst *rqst = task->tk_rqstp;
+   struct svc_xprt *sxprt = rqst->rq_xprt->bc_xprt;
+   struct svcxprt_rdma *rdma;
+   u32 len;
+
+   dprintk("svcrdma: sending bc call with xid: %08x\n",
+   be32_to_cpu(rqst->rq_xid))

[PATCH v3 5/6] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies

2015-12-07 Thread Chuck Lever

To support the NFSv4.1 backchannel on RDMA connections, add a
capability for receiving an RPC/RDMA reply on a connection
established by a client.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h|5 ++
 net/sunrpc/xprtrdma/Makefile   |2 -
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |   86 
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c|   51 +
 net/sunrpc/xprtrdma/xprt_rdma.h|2 +
 5 files changed, 145 insertions(+), 1 deletion(-)
 create mode 100644 net/sunrpc/xprtrdma/svc_rdma_backchannel.c

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index bf9b17b..5371d42 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -187,6 +187,11 @@ struct svcxprt_rdma {
 
 #define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
 
+/* svc_rdma_backchannel.c */
+extern int svc_rdma_handle_bc_reply(struct rpc_xprt *xprt,
+   struct rpcrdma_msg *rmsgp,
+   struct xdr_buf *rcvbuf);
+
 /* svc_rdma_marshal.c */
 extern int svc_rdma_xdr_decode_req(struct rpcrdma_msg **, struct svc_rqst *);
 extern int svc_rdma_xdr_encode_error(struct svcxprt_rdma *,
diff --git a/net/sunrpc/xprtrdma/Makefile b/net/sunrpc/xprtrdma/Makefile
index 33f99d3..dc9f3b5 100644
--- a/net/sunrpc/xprtrdma/Makefile
+++ b/net/sunrpc/xprtrdma/Makefile
@@ -2,7 +2,7 @@ obj-$(CONFIG_SUNRPC_XPRT_RDMA) += rpcrdma.o
 
 rpcrdma-y := transport.o rpc_rdma.o verbs.o \
fmr_ops.o frwr_ops.o physical_ops.o \
-   svc_rdma.o svc_rdma_transport.o \
+   svc_rdma.o svc_rdma_backchannel.o svc_rdma_transport.o \
svc_rdma_marshal.o svc_rdma_sendto.o svc_rdma_recvfrom.o \
module.o
 rpcrdma-$(CONFIG_SUNRPC_BACKCHANNEL) += backchannel.o
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c 
b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
new file mode 100644
index 000..69dab71
--- /dev/null
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -0,0 +1,86 @@
+/*
+ * Copyright (c) 2015 Oracle.  All rights reserved.
+ *
+ * Support for backward direction RPCs on RPC/RDMA (server-side).
+ */
+
+#include 
+#include "xprt_rdma.h"
+
+#define RPCDBG_FACILITYRPCDBG_SVCXPRT
+
+#undef SVCRDMA_BACKCHANNEL_DEBUG
+
+int
+svc_rdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
+struct xdr_buf *rcvbuf)
+{
+   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct kvec *dst, *src = &rcvbuf->head[0];
+   struct rpc_rqst *req;
+   unsigned long cwnd;
+   u32 credits;
+   size_t len;
+   __be32 xid;
+   __be32 *p;
+   int ret;
+
+   p = (__be32 *)src->iov_base;
+   len = src->iov_len;
+   xid = rmsgp->rm_xid;
+
+#ifdef SVCRDMA_BACKCHANNEL_DEBUG
+   pr_info("%s: xid=%08x, length=%zu\n",
+   __func__, be32_to_cpu(xid), len);
+   pr_info("%s: RPC/RDMA: %*ph\n",
+   __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
+   pr_info("%s:  RPC: %*ph\n",
+   __func__, (int)len, p);
+#endif
+
+   ret = -EAGAIN;
+   if (src->iov_len < 24)
+   goto out_shortreply;
+
+   spin_lock_bh(&xprt->transport_lock);
+   req = xprt_lookup_rqst(xprt, xid);
+   if (!req)
+   goto out_notfound;
+
+   dst = &req->rq_private_buf.head[0];
+   memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
+   if (dst->iov_len < len)
+   goto out_unlock;
+   memcpy(dst->iov_base, p, len);
+
+   credits = be32_to_cpu(rmsgp->rm_credit);
+   if (credits == 0)
+   credits = 1;/* don't deadlock */
+   else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
+   credits = r_xprt->rx_buf.rb_bc_max_requests;
+
+   cwnd = xprt->cwnd;
+   xprt->cwnd = credits << RPC_CWNDSHIFT;
+   if (xprt->cwnd > cwnd)
+   xprt_release_rqst_cong(req->rq_task);
+
+   ret = 0;
+   xprt_complete_rqst(req->rq_task, rcvbuf->len);
+   rcvbuf->len = 0;
+
+out_unlock:
+   spin_unlock_bh(&xprt->transport_lock);
+out:
+   return ret;
+
+out_shortreply:
+   dprintk("svcrdma: short bc reply: xprt=%p, len=%zu\n",
+   xprt, src->iov_len);
+   goto out;
+
+out_notfound:
+   dprintk("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n",
+   xprt, be32_to_cpu(xid));
+
+   goto out_unlock;
+}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index ff4f01e..faa1fc0 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -567,6 +567,38 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
return ret;
 }
 
+/*

[PATCH v3 1/6] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-12-07 Thread Chuck Lever

Minor optimization: when dealing with write chunk XDR roundup, do
not post a Write WR for the zero bytes in the pad. Simply update
the write segment in the RPC-over-RDMA header to reflect the extra
pad bytes.

The Reply chunk is also a write chunk, but the server does not use
send_write_chunks() to send the Reply chunk. That's OK in this case:
the server Upper Layer typically marshals the Reply chunk contents
in a single contiguous buffer, without a separate tail for the XDR
pad.

The comments and the variable naming refer to "chunks" but what is
really meant is "segments." The existing code sends only one
xdr_write_chunk per RPC reply.

The fix assumes this as well. When the XDR pad in the first write
chunk is reached, the assumption is the Write list is complete and
send_write_chunks() returns.

That will remain a valid assumption until the server Upper Layer can
support multiple bulk payload results per RPC.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 969a1ab..bad5eaa 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
arg_ch->rs_handle,
arg_ch->rs_offset,
write_len);
+
+   /* Do not send XDR pad bytes */
+   if (chunk_no && write_len < 4) {
+   chunk_no++;
+   break;
+   }
+
chunk_off = 0;
while (write_len) {
ret = send_write(xprt, rqstp,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 4/6] svcrdma: Add infrastructure to send backwards direction RPC/RDMA calls

2015-12-07 Thread Chuck Lever

To support the NFSv4.1 backchannel on RDMA connections, add a
mechanism for sending a backwards-direction RPC/RDMA call on a
connection established by a client.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h   |2 +
 net/sunrpc/xprtrdma/svc_rdma_sendto.c |   61 +
 2 files changed, 63 insertions(+)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f71c625..bf9b17b 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -215,6 +215,8 @@ extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, 
struct svc_rqst *,
 extern int svc_rdma_sendto(struct svc_rqst *);
 extern struct rpcrdma_read_chunk *
svc_rdma_get_read_chunk(struct rpcrdma_msg *);
+extern int svc_rdma_bc_post_send(struct svcxprt_rdma *,
+struct svc_rdma_op_ctxt *, struct xdr_buf *);
 
 /* svc_rdma_transport.c */
 extern int svc_rdma_send(struct svcxprt_rdma *, struct ib_send_wr *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index bad5eaa..846df63 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -648,3 +648,64 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
svc_rdma_put_context(ctxt, 0);
return ret;
 }
+
+/* Send a backwards direction RPC call.
+ *
+ * Caller holds the connection's mutex and has already marshaled the
+ * RPC/RDMA request. Before sending the request, this API also posts
+ * an extra receive buffer to catch the bc reply for this request.
+ */
+int svc_rdma_bc_post_send(struct svcxprt_rdma *rdma,
+ struct svc_rdma_op_ctxt *ctxt, struct xdr_buf *sndbuf)
+{
+   struct svc_rdma_req_map *vec;
+   struct ib_send_wr send_wr;
+   int ret;
+
+   vec = svc_rdma_get_req_map();
+   ret = map_xdr(rdma, sndbuf, vec);
+   if (ret)
+   goto out;
+
+   /* Post a recv buffer to handle reply for this request */
+   ret = svc_rdma_post_recv(rdma);
+   if (ret) {
+   pr_err("svcrdma: Failed to post bc receive buffer, err=%d. "
+  "Closing transport %p.\n", ret, rdma);
+   set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
+   ret = -ENOTCONN;
+   goto out;
+   }
+
+   ctxt->wr_op = IB_WR_SEND;
+   ctxt->direction = DMA_TO_DEVICE;
+   ctxt->sge[0].lkey = rdma->sc_dma_lkey;
+   ctxt->sge[0].length = sndbuf->len;
+   ctxt->sge[0].addr =
+   ib_dma_map_page(rdma->sc_cm_id->device, ctxt->pages[0], 0,
+   sndbuf->len, DMA_TO_DEVICE);
+   if (ib_dma_mapping_error(rdma->sc_cm_id->device, ctxt->sge[0].addr)) {
+   svc_rdma_unmap_dma(ctxt);
+   ret = -EIO;
+   goto out;
+   }
+   atomic_inc(&rdma->sc_dma_used);
+
+   memset(&send_wr, 0, sizeof send_wr);
+   send_wr.wr_id = (unsigned long)ctxt;
+   send_wr.sg_list = ctxt->sge;
+   send_wr.num_sge = 1;
+   send_wr.opcode = IB_WR_SEND;
+   send_wr.send_flags = IB_SEND_SIGNALED;
+
+   ret = svc_rdma_send(rdma, &send_wr);
+   if (ret) {
+   svc_rdma_unmap_dma(ctxt);
+   ret = -EIO;
+   goto out;
+   }
+out:
+   svc_rdma_put_req_map(vec);
+   dprintk("svcrdma: %s returns %d\n", __func__, ret);
+   return ret;
+}

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/6] svcrdma: Improve allocation of struct svc_rdma_op_ctxt

2015-12-07 Thread Chuck Lever

Turns out that when the maximum payload size of NFS READ and WRITE
was increased to 1MB, the size of struct svc_rdma_op_ctxt
increased to 6KB (x86_64). That makes allocating one of these from
a kmem_cache more likely to fail.

Allocating one of these has to be fast in general, and none of the
current caller sites expect allocation failure. The existing logic
ensures no failure by looping and sleeping.

Since I'm about to add a caller where this allocation must always
work _and_ it cannot sleep, pre-allocate them for each connection,
like other RDMA transport-related resources.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |4 ++
 net/sunrpc/xprtrdma/svc_rdma.c   |   17 ---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   76 ++
 net/sunrpc/xprtrdma/xprt_rdma.h  |2 -
 4 files changed, 70 insertions(+), 29 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f869807..2bb0ff3 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -69,6 +69,7 @@ extern atomic_t rdma_stat_sq_prod;
  * completes.
  */
 struct svc_rdma_op_ctxt {
+   struct list_head free_q;
struct svc_rdma_op_ctxt *read_hdr;
struct svc_rdma_fastreg_mr *frmr;
int hdr_count;
@@ -142,6 +143,9 @@ struct svcxprt_rdma {
 
atomic_t sc_dma_used;
atomic_t sc_ctxt_used;
+   struct list_head sc_ctxt_q;
+   spinlock_t   sc_ctxt_lock;
+
struct list_head sc_rq_dto_q;
spinlock_t   sc_rq_dto_lock;
struct ib_qp *sc_qp;
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index 1b7051b..aed1d96 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -71,9 +71,7 @@ atomic_t rdma_stat_rq_prod;
 atomic_t rdma_stat_sq_poll;
 atomic_t rdma_stat_sq_prod;
 
-/* Temporary NFS request map and context caches */
 struct kmem_cache *svc_rdma_map_cachep;
-struct kmem_cache *svc_rdma_ctxt_cachep;
 
 struct workqueue_struct *svc_rdma_wq;
 
@@ -244,7 +242,6 @@ void svc_rdma_cleanup(void)
 #endif
svc_unreg_xprt_class(&svc_rdma_class);
kmem_cache_destroy(svc_rdma_map_cachep);
-   kmem_cache_destroy(svc_rdma_ctxt_cachep);
 }
 
 int svc_rdma_init(void)
@@ -275,26 +272,12 @@ int svc_rdma_init(void)
goto err0;
}
 
-   /* Create the temporary context cache */
-   svc_rdma_ctxt_cachep =
-   kmem_cache_create("svc_rdma_ctxt_cache",
- sizeof(struct svc_rdma_op_ctxt),
- 0,
- SLAB_HWCACHE_ALIGN,
- NULL);
-   if (!svc_rdma_ctxt_cachep) {
-   printk(KERN_INFO "Could not allocate WR ctxt cache.\n");
-   goto err1;
-   }
-
/* Register RDMA with the SVC transport switch */
svc_reg_xprt_class(&svc_rdma_class);
 #if defined(CONFIG_SUNRPC_BACKCHANNEL)
svc_reg_xprt_class(&svc_rdma_bc_class);
 #endif
return 0;
- err1:
-   kmem_cache_destroy(svc_rdma_map_cachep);
  err0:
unregister_sysctl_table(svcrdma_table_header);
destroy_workqueue(svc_rdma_wq);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index b348b4a..ede88f3 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -155,16 +155,27 @@ static void svc_rdma_bc_free(struct svc_xprt *xprt)
 
 struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
 {
-   struct svc_rdma_op_ctxt *ctxt;
+   struct svc_rdma_op_ctxt *ctxt = NULL;
+
+   spin_lock_bh(&xprt->sc_ctxt_lock);
+   if (list_empty(&xprt->sc_ctxt_q))
+   goto out_empty;
+
+   ctxt = list_first_entry(&xprt->sc_ctxt_q,
+  struct svc_rdma_op_ctxt, free_q);
+   list_del_init(&ctxt->free_q);
+   spin_unlock_bh(&xprt->sc_ctxt_lock);
 
-   ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
-   GFP_KERNEL | __GFP_NOFAIL);
-   ctxt->xprt = xprt;
-   INIT_LIST_HEAD(&ctxt->dto_q);
ctxt->count = 0;
ctxt->frmr = NULL;
+
atomic_inc(&xprt->sc_ctxt_used);
return ctxt;
+
+out_empty:
+   spin_unlock_bh(&xprt->sc_ctxt_lock);
+   pr_err("svcrdma: empty RDMA ctxt list?\n");
+   return NULL;
 }
 
 void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
@@ -198,7 +209,27 @@ void svc_rdma_put_context(struct svc_rdma_op_ctxt *ctxt, 
int free_pages)
for (i = 0; i < ctxt->count; i++)
put_page(ctxt->pages[i]);
 
-   kmem_cache_free(svc_rdma_ctxt_cachep, ctxt);
+   spin_lo

[PATCH v3 0/6] NFS/RDMA server patches for 4.5

2015-12-07 Thread Chuck Lever

Here are patches to support server-side bi-directional RPC/RDMA
operation (to enable NFSv4.1 on RPC/RDMA transports). Thanks to
all who reviewed v1 and v2.

Also available in the "nfsd-rdma-for-4.5" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-for-4.5


Changes since v2:
- Rebased on v4.4-rc4
- Backchannel code in new source file to address dprintk issues
- svc_rdma_get_context() now uses a pre-allocated cache
- Dropped svc_rdma_send clean up


Changes since v1:

- Rebased on v4.4-rc3
- Removed the use of CONFIG_SUNRPC_BACKCHANNEL
- Fixed computation of forward and backward max_requests
- Updated some comments and patch descriptions
- pr_err and pr_info converted to dprintk
- Simplified svc_rdma_get_context()
- Dropped patch removing access_flags field
- NFSv4.1 callbacks tested with for-4.5 client

---

Chuck Lever (6):
  svcrdma: Do not send XDR roundup bytes for a write chunk
  svcrdma: Improve allocation of struct svc_rdma_op_ctxt
  svcrdma: Define maximum number of backchannel requests
  svcrdma: Add infrastructure to send backwards direction RPC/RDMA calls
  svcrdma: Add infrastructure to receive backwards direction RPC/RDMA 
replies
  xprtrdma: Add class for RDMA backwards direction transport


 include/linux/sunrpc/svc_rdma.h|   13 +
 include/linux/sunrpc/xprt.h|1 
 net/sunrpc/xprt.c  |1 
 net/sunrpc/xprtrdma/Makefile   |2 
 net/sunrpc/xprtrdma/svc_rdma.c |   17 --
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |  305 
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c|   51 +
 net/sunrpc/xprtrdma/svc_rdma_sendto.c  |   68 ++
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |  104 --
 net/sunrpc/xprtrdma/transport.c|   31 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h|   15 +
 11 files changed, 559 insertions(+), 49 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/svc_rdma_backchannel.c

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: warning in ext4 with nfs/rdma server

2015-12-07 Thread Chuck Lever

Hi Steve-

> On Dec 7, 2015, at 10:38 AM, Steve Wise  wrote:
> 
> Hey Chuck/NFS developers,
> 
> We're hitting this warning in ext4 on the linux-4.3 nfs server running over 
> RDMA/cxgb4.  We're still gathering data, like if it
> happens with NFS/TCP.  But has anyone seen this warning on 4.3?  Is it likely 
> to indicate some bug in the xprtrdma transport or
> above it in NFS?

Yes, please confirm with NFS/TCP. Thanks!


> We can hit this running cthon tests over 2 mount points:
> 
> -
> #!/bin/bash
> rm -rf /root/cthon04/loop_iter.txt
> while [ 1 ]
> do
> {
> 
> ./server -s -m /mnt/share1 -o rdma,port=20049,vers=4 -p /mnt/share1 -N 100
> 102.1.1.162 &
> ./server -s -m /mnt/share2 -o rdma,port=20049,vers=3,rsize=65535,wsize=65535 
> -p
> /mnt/share2 -N 100 102.2.2.162 &
> wait
> echo "iteration $i" >>/root/cthon04/loop_iter.txt
> date >>/root/cthon04/loop_iter.txt
> }
> done
> --
> 
> Thanks,
> 
> Steve.
> 
> [ cut here ]
> WARNING: CPU: 14 PID: 6689 at fs/ext4/inode.c:231 ext4_evict_inode+0x41e/0x490
> [ext4]()
> Modules linked in: nfsd(E) lockd(E) grace(E) nfs_acl(E) exportfs(E)
> auth_rpcgss(E) rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_uverbs(E) rdma_cm(E)
> ib_cm(E) ib_sa(E) ib_mad(E) iw_cxgb4(E) iw_cm(E) ib_core(E) ib_addr(E) 
> cxgb4(E)
> autofs4(E) target_core_iblock(E) target_core_file(E) target_core_pscsi(E)
> target_core_mod(E) configfs(E) bnx2fc(E) cnic(E) uio(E) fcoe(E) libfcoe(E)
> 8021q(E) libfc(E) garp(E) stp(E) llc(E) cpufreq_ondemand(E) cachefiles(E)
> fscache(E) ipv6(E) dm_mirror(E) dm_region_hash(E) dm_log(E) vhost_net(E)
> macvtap(E) macvlan(E) vhost(E) tun(E) kvm(E) uinput(E) microcode(E) sg(E)
> pcspkr(E) serio_raw(E) fam15h_power(E) k10temp(E) amd64_edac_mod(E)
> edac_core(E) edac_mce_amd(E) i2c_piix4(E) igb(E) dca(E) i2c_algo_bit(E)
> i2c_core(E) ptp(E) pps_core(E) scsi_transport_fc(E) acpi_cpufreq(E) dm_mod(E)
> ext4(E) jbd2(E) mbcache(E) sr_mod(E) cdrom(E) sd_mod(E) ahci(E) libahci(E)
> [last unloaded: cxgb4]
> CPU: 14 PID: 6689 Comm: nfsd Tainted: GE   4.3.0 #1
> Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.512/19/2013
> 00e7 88400634fad8 812a4084 a00c96eb
>  88400634fb18 81059fd5 88400634fbd8
> 880fd1a460c8 880fd1a461d8 880fd1a46008 88400634fbd8
> Call Trace:
> [] dump_stack+0x48/0x64
> [] warn_slowpath_common+0x95/0xe0
> [] warn_slowpath_null+0x1a/0x20
> [] ext4_evict_inode+0x41e/0x490 [ext4]
> [] evict+0xae/0x1a0
> [] iput_final+0xe5/0x170
> [] iput+0xa3/0xf0
> [] ? fsnotify_destroy_marks+0x64/0x80
> [] dentry_unlink_inode+0xa9/0xe0
> [] d_delete+0xa6/0xb0
> [] vfs_unlink+0x138/0x140
> [] nfsd_unlink+0x165/0x200 [nfsd]
> [] ? lru_put_end+0x5c/0x70 [nfsd]
> [] nfsd3_proc_remove+0x83/0x120 [nfsd]
> [] nfsd_dispatch+0xdc/0x210 [nfsd]
> [] svc_process_common+0x311/0x620 [sunrpc]
> [] ? nfsd_set_nrthreads+0x1b0/0x1b0 [nfsd]
> [] svc_process+0x128/0x1b0 [sunrpc]
> [] nfsd+0xf3/0x160 [nfsd]
> [] kthread+0xcc/0xf0
> [] ? schedule_tail+0x1e/0xc0
> [] ? kthread_freezable_should_stop+0x70/0x70
> [] ret_from_fork+0x3f/0x70
> [] ? kthread_freezable_should_stop+0x70/0x70
> ---[ end trace 39afe9aeef2cfb34 ]---
> [ cut here ]

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 3/8] svcrdma: Add svc_rdma_get_context() API that is allowed to fail

2015-12-04 Thread Chuck Lever

> On Nov 24, 2015, at 1:55 AM, Christoph Hellwig  wrote:
> 
>> +struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *xprt,
>> +  gfp_t flags)
>> +{
>> +struct svc_rdma_op_ctxt *ctxt;
>> +
>> +ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep, flags);
>> +if (!ctxt)
>> +return NULL;
>> +svc_rdma_init_context(xprt, ctxt);
>> +return ctxt;
>> +}
>> +
>> +struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
>> +{
>> +struct svc_rdma_op_ctxt *ctxt;
>> +
>> +ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
>> +GFP_KERNEL | __GFP_NOFAIL);
>> +svc_rdma_init_context(xprt, ctxt);
>>  return ctxt;
> 
> Sounds like you should have just added a gfp_t argument to
> svc_rdma_get_context.  And if we have any way to avoid the __GFP_NOFAIL
> I'd really appreciate if we could give that a try.

Changed my mind on this.

struct svc_rdma_op_ctxt used to be smaller than a page, so these
allocations were not likely to fail. But since the maximum NFS
READ and WRITE payload for NFS/RDMA has been increased to 1MB,
struct svc_rdma_op_ctxt has grown to more than 6KB, thus it is
no longer an order 0 memory allocation.

Some ideas:

1. Pre-allocate these per connection in svc_rdma_accept().
There will never be more than sc_sq_depth of these. But that
could be a large number to allocate during connection
establishment.

2. Once allocated, cache them. If traffic doesn’t manage to
allocate sc_sq_depth of these over time, allocation can still
fail during a traffic burst in very low memory scenarios.

3. Use a mempool. This reserves a few of these which may never
be used. But allocation can still fail once the reserve is
consumed (same as 2).

4. Break out the sge and pages arrays into separate allocations
so the allocation requests are order 0.

1 seems like the most robust solution, and it would be fast.
svc_rdma_get_context is a very common operation.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR

2015-12-01 Thread Chuck Lever

> On Nov 24, 2015, at 2:12 AM, Jason Gunthorpe 
>  wrote:
> 
> On Mon, Nov 23, 2015 at 10:52:26PM -0800, Christoph Hellwig wrote:
>> 
>> So at lest for 4.5 we're unlikely to be able to get rid of it alone
>> due to the RDS issue.  We'll then need performance numbers for mlx4,
>> and figure out how much we care about mthca.
> 
> mthca is unfortunately very popular in the used market, it is very
> easy to get cards, build a cheap test cluster, etc. :(

Oracle recently announced Sonoma, which is a SPARC CPU with
an on-chip IB HCA. Oracle plans to publish an open-source
GPL device driver that enables this HCA in Linux for SPARC.
We’d eventually like to contribute it to the upstream
Linux kernel.

At the moment, the device and driver support only FMR. As
you might expect, Oracle needs it to work at least with
in-kernel RDS. Thus we hope to see the life of in-kernel
FMR extended for a bit to accommodate this new device.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 6/7] xprtrdma: Add class for RDMA backwards direction transport

2015-12-01 Thread Chuck Lever


> On Dec 1, 2015, at 8:38 AM, Tom Talpey  wrote:
> 
> On 11/30/2015 5:25 PM, Chuck Lever wrote:
>> To support the server-side of an NFSv4.1 backchannel on RDMA
>> connections, add a transport class that enables backward
>> direction messages on an existing forward channel connection.
>> 
> 
>> +static void *
>> +xprt_rdma_bc_allocate(struct rpc_task *task, size_t size)
>> +{
>> +struct rpc_rqst *rqst = task->tk_rqstp;
>> +struct svc_rdma_op_ctxt *ctxt;
>> +struct svcxprt_rdma *rdma;
>> +struct svc_xprt *sxprt;
>> +struct page *page;
>> +
>> +if (size > PAGE_SIZE) {
>> +WARN_ONCE(1, "failed to handle buffer allocation (size %zu)\n",
>> +  size);
> 
> You may want to add more context to that rather cryptic string, at
> least the function name.
> 
> Also, it's not exactly "failed to handle", it's an invalid size. Why
> would this ever happen? Why even log it?
> 
> 
> 
>> +static int
>> +rpcrdma_bc_send_request(struct svcxprt_rdma *rdma, struct rpc_rqst *rqst)
>> +{
> ...
>> +
>> +drop_connection:
>> +dprintk("Failed to send backchannel call\n");
> 
> Ditto on the prefix / function context.
> 
> And also...
> 
>> +dprintk("%s: sending request with xid: %08x\n",
>> +__func__, be32_to_cpu(rqst->rq_xid));
> ...
>> +dprintk("RPC:   %s: xprt %p\n", __func__, xprt);
> 
> The format strings for many of the dprintk's are somewhat inconsistent.
> Some start with "RPC", some with the function name, and some (in other
> patches of this series) with "svcrdma". Confusing, and perhaps hard to
> pick out of the log.

They do follow a convention: “RPC:  “ is used on the client
side when there is no rpc_task::tk_pid available. “svcrdma” is
used on the server everywhere.

The dprintk changes here were a bit cursory, so I will go back and
review them more carefully.

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 4/7] svcrdma: Add infrastructure to send backwards direction RPC/RDMA calls

2015-11-30 Thread Chuck Lever

To support the NFSv4.1 backchannel on RDMA connections, add a
mechanism for sending a backwards-direction RPC/RDMA call on a
connection established by a client.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h   |2 +
 net/sunrpc/xprtrdma/svc_rdma_sendto.c |   61 +
 2 files changed, 63 insertions(+)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index c189fbd..6f52995 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -211,6 +211,8 @@ extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, 
struct svc_rqst *,
 extern int svc_rdma_sendto(struct svc_rqst *);
 extern struct rpcrdma_read_chunk *
svc_rdma_get_read_chunk(struct rpcrdma_msg *);
+extern int svc_rdma_bc_post_send(struct svcxprt_rdma *,
+struct svc_rdma_op_ctxt *, struct xdr_buf *);
 
 /* svc_rdma_transport.c */
 extern int svc_rdma_send(struct svcxprt_rdma *, struct ib_send_wr *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index bad5eaa..846df63 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -648,3 +648,64 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
svc_rdma_put_context(ctxt, 0);
return ret;
 }
+
+/* Send a backwards direction RPC call.
+ *
+ * Caller holds the connection's mutex and has already marshaled the
+ * RPC/RDMA request. Before sending the request, this API also posts
+ * an extra receive buffer to catch the bc reply for this request.
+ */
+int svc_rdma_bc_post_send(struct svcxprt_rdma *rdma,
+ struct svc_rdma_op_ctxt *ctxt, struct xdr_buf *sndbuf)
+{
+   struct svc_rdma_req_map *vec;
+   struct ib_send_wr send_wr;
+   int ret;
+
+   vec = svc_rdma_get_req_map();
+   ret = map_xdr(rdma, sndbuf, vec);
+   if (ret)
+   goto out;
+
+   /* Post a recv buffer to handle reply for this request */
+   ret = svc_rdma_post_recv(rdma);
+   if (ret) {
+   pr_err("svcrdma: Failed to post bc receive buffer, err=%d. "
+  "Closing transport %p.\n", ret, rdma);
+   set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
+   ret = -ENOTCONN;
+   goto out;
+   }
+
+   ctxt->wr_op = IB_WR_SEND;
+   ctxt->direction = DMA_TO_DEVICE;
+   ctxt->sge[0].lkey = rdma->sc_dma_lkey;
+   ctxt->sge[0].length = sndbuf->len;
+   ctxt->sge[0].addr =
+   ib_dma_map_page(rdma->sc_cm_id->device, ctxt->pages[0], 0,
+   sndbuf->len, DMA_TO_DEVICE);
+   if (ib_dma_mapping_error(rdma->sc_cm_id->device, ctxt->sge[0].addr)) {
+   svc_rdma_unmap_dma(ctxt);
+   ret = -EIO;
+   goto out;
+   }
+   atomic_inc(&rdma->sc_dma_used);
+
+   memset(&send_wr, 0, sizeof send_wr);
+   send_wr.wr_id = (unsigned long)ctxt;
+   send_wr.sg_list = ctxt->sge;
+   send_wr.num_sge = 1;
+   send_wr.opcode = IB_WR_SEND;
+   send_wr.send_flags = IB_SEND_SIGNALED;
+
+   ret = svc_rdma_send(rdma, &send_wr);
+   if (ret) {
+   svc_rdma_unmap_dma(ctxt);
+   ret = -EIO;
+   goto out;
+   }
+out:
+   svc_rdma_put_req_map(vec);
+   dprintk("svcrdma: %s returns %d\n", __func__, ret);
+   return ret;
+}

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 3/7] svcrdma: Define maximum number of backchannel requests

2015-11-30 Thread Chuck Lever

Extra resources for handling backchannel requests have to be
pre-allocated when a transport instance is created. Set a limit.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |2 ++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   14 +-
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index cc69551..c189fbd 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -137,6 +137,7 @@ struct svcxprt_rdma {
 
int  sc_max_requests;   /* Depth of RQ */
int  sc_max_req_size;   /* Size of each RQ WR buf */
+   int  sc_max_bc_requests;
 
struct ib_pd *sc_pd;
 
@@ -178,6 +179,7 @@ struct svcxprt_rdma {
 #define RPCRDMA_SQ_DEPTH_MULT   8
 #define RPCRDMA_MAX_REQUESTS32
 #define RPCRDMA_MAX_REQ_SIZE4096
+#define RPCRDMA_MAX_BC_REQUESTS2
 
 #define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 94b8d4c..643402e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -541,6 +541,7 @@ static struct svcxprt_rdma *rdma_create_xprt(struct 
svc_serv *serv,
 
cma_xprt->sc_max_req_size = svcrdma_max_req_size;
cma_xprt->sc_max_requests = svcrdma_max_requests;
+   cma_xprt->sc_max_bc_requests = RPCRDMA_MAX_BC_REQUESTS;
cma_xprt->sc_sq_depth = svcrdma_max_requests * RPCRDMA_SQ_DEPTH_MULT;
atomic_set(&cma_xprt->sc_sq_count, 0);
atomic_set(&cma_xprt->sc_ctxt_used, 0);
@@ -897,6 +898,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
struct ib_device_attr devattr;
int uninitialized_var(dma_mr_acc);
int need_dma_mr = 0;
+   int total_reqs;
int ret;
int i;
 
@@ -932,8 +934,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
   RPCSVC_MAXPAGES);
newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
-  (size_t)svcrdma_max_requests);
-   newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;
+  (size_t)svcrdma_max_requests);
+   newxprt->sc_max_bc_requests = RPCRDMA_MAX_BC_REQUESTS;
+   total_reqs = newxprt->sc_max_requests + newxprt->sc_max_bc_requests;
+   newxprt->sc_sq_depth = total_reqs * RPCRDMA_SQ_DEPTH_MULT;
 
/*
 * Limit ORD based on client limit, local device limit, and
@@ -957,7 +961,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
dprintk("svcrdma: error creating SQ CQ for connect request\n");
goto errout;
}
-   cq_attr.cqe = newxprt->sc_max_requests;
+   cq_attr.cqe = total_reqs;
newxprt->sc_rq_cq = ib_create_cq(newxprt->sc_cm_id->device,
 rq_comp_handler,
 cq_event_handler,
@@ -972,7 +976,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
qp_attr.event_handler = qp_event_handler;
qp_attr.qp_context = &newxprt->sc_xprt;
qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
-   qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
+   qp_attr.cap.max_recv_wr = total_reqs;
qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
@@ -1068,7 +1072,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
newxprt->sc_cm_id->device->local_dma_lkey;
 
/* Post receive buffers */
-   for (i = 0; i < newxprt->sc_max_requests; i++) {
+   for (i = 0; i < total_reqs; i++) {
ret = svc_rdma_post_recv(newxprt);
if (ret) {
dprintk("svcrdma: failure posting receive buffers\n");

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 6/7] xprtrdma: Add class for RDMA backwards direction transport

2015-11-30 Thread Chuck Lever

To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class that enables backward
direction messages on an existing forward channel connection.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/xprt.h  |1 
 net/sunrpc/xprt.c|1 
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   14 +-
 net/sunrpc/xprtrdma/transport.c  |  230 ++
 net/sunrpc/xprtrdma/xprt_rdma.h  |2 
 5 files changed, 243 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 69ef5b3..7637ccd 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -85,6 +85,7 @@ struct rpc_rqst {
__u32 * rq_buffer;  /* XDR encode buffer */
size_t  rq_callsize,
rq_rcvsize;
+   void *  rq_privdata; /* xprt-specific per-rqst data */
size_t  rq_xmit_bytes_sent; /* total bytes sent */
size_t  rq_reply_bytes_recvd;   /* total reply bytes */
/* received */
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2e98f4a..37edea6 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1425,3 +1425,4 @@ void xprt_put(struct rpc_xprt *xprt)
if (atomic_dec_and_test(&xprt->count))
xprt_destroy(xprt);
 }
+EXPORT_SYMBOL_GPL(xprt_put);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 643402e..ab5e376 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1172,12 +1172,14 @@ static void __svc_rdma_free(struct work_struct *work)
 {
struct svcxprt_rdma *rdma =
container_of(work, struct svcxprt_rdma, sc_work);
-   dprintk("svcrdma: svc_rdma_free(%p)\n", rdma);
+   struct svc_xprt *xprt = &rdma->sc_xprt;
+
+   dprintk("svcrdma: %s(%p)\n", __func__, rdma);
 
/* We should only be called from kref_put */
-   if (atomic_read(&rdma->sc_xprt.xpt_ref.refcount) != 0)
+   if (atomic_read(&xprt->xpt_ref.refcount) != 0)
pr_err("svcrdma: sc_xprt still in use? (%d)\n",
-  atomic_read(&rdma->sc_xprt.xpt_ref.refcount));
+  atomic_read(&xprt->xpt_ref.refcount));
 
/*
 * Destroy queued, but not processed read completions. Note
@@ -1212,6 +1214,12 @@ static void __svc_rdma_free(struct work_struct *work)
pr_err("svcrdma: dma still in use? (%d)\n",
   atomic_read(&rdma->sc_dma_used));
 
+   /* Final put of backchannel client transport */
+   if (xprt->xpt_bc_xprt) {
+   xprt_put(xprt->xpt_bc_xprt);
+   xprt->xpt_bc_xprt = NULL;
+   }
+
/* De-allocate fastreg mr */
rdma_dealloc_frmr_q(rdma);
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8c545f7..db1fd1f 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "xprt_rdma.h"
 
@@ -148,7 +149,8 @@ static struct ctl_table sunrpc_table[] = {
 #define RPCRDMA_MAX_REEST_TO   (30U * HZ)
 #define RPCRDMA_IDLE_DISC_TO   (5U * 60 * HZ)
 
-static struct rpc_xprt_ops xprt_rdma_procs;/* forward reference */
+static struct rpc_xprt_ops xprt_rdma_procs;
+static struct rpc_xprt_ops xprt_rdma_bc_procs;
 
 static void
 xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr *sap)
@@ -499,7 +501,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
if (req == NULL)
return NULL;
 
-   flags = GFP_NOIO | __GFP_NOWARN;
+   flags = RPCRDMA_DEF_GFP;
if (RPC_IS_SWAPPER(task))
flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
 
@@ -684,6 +686,195 @@ xprt_rdma_disable_swap(struct rpc_xprt *xprt)
 {
 }
 
+/* Server-side transport endpoint wants a whole page for its send
+ * buffer. The client RPC code constructs the RPC header in this
+ * buffer before it invokes ->send_request.
+ */
+static void *
+xprt_rdma_bc_allocate(struct rpc_task *task, size_t size)
+{
+   struct rpc_rqst *rqst = task->tk_rqstp;
+   struct svc_rdma_op_ctxt *ctxt;
+   struct svcxprt_rdma *rdma;
+   struct svc_xprt *sxprt;
+   struct page *page;
+
+   if (size > PAGE_SIZE) {
+   WARN_ONCE(1, "failed to handle buffer allocation (size %zu)\n",
+ size);
+   return NULL;
+   }
+
+   page = alloc_page(RPCRDMA_DEF_GFP);
+   if (!page)
+   return NULL;
+
+   sxprt = rqst->rq_xprt->bc_xprt;
+   rdma =

[PATCH v2 5/7] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies

2015-11-30 Thread Chuck Lever

To support the NFSv4.1 backchannel on RDMA connections, add a
capability for receiving an RPC/RDMA reply on a connection
established by a client.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   72 +++
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   52 ++
 net/sunrpc/xprtrdma/xprt_rdma.h |4 ++
 3 files changed, 128 insertions(+)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..e711126 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -946,3 +946,75 @@ repost:
if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
rpcrdma_recv_buffer_put(rep);
 }
+
+int
+rpcrdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
+   struct xdr_buf *rcvbuf)
+{
+   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct kvec *dst, *src = &rcvbuf->head[0];
+   struct rpc_rqst *req;
+   unsigned long cwnd;
+   u32 credits;
+   size_t len;
+   __be32 xid;
+   __be32 *p;
+   int ret;
+
+   p = (__be32 *)src->iov_base;
+   len = src->iov_len;
+   xid = rmsgp->rm_xid;
+
+   dprintk("%s: xid=%08x, length=%zu\n",
+   __func__, be32_to_cpu(xid), len);
+   dprintk("%s: RPC/RDMA: %*ph\n",
+   __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
+   dprintk("%s:  RPC: %*ph\n",
+   __func__, (int)len, p);
+
+   ret = -EAGAIN;
+   if (src->iov_len < 24)
+   goto out_shortreply;
+
+   spin_lock_bh(&xprt->transport_lock);
+   req = xprt_lookup_rqst(xprt, xid);
+   if (!req)
+   goto out_notfound;
+
+   dst = &req->rq_private_buf.head[0];
+   memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
+   if (dst->iov_len < len)
+   goto out_unlock;
+   memcpy(dst->iov_base, p, len);
+
+   credits = be32_to_cpu(rmsgp->rm_credit);
+   if (credits == 0)
+   credits = 1;/* don't deadlock */
+   else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
+   credits = r_xprt->rx_buf.rb_bc_max_requests;
+
+   cwnd = xprt->cwnd;
+   xprt->cwnd = credits << RPC_CWNDSHIFT;
+   if (xprt->cwnd > cwnd)
+   xprt_release_rqst_cong(req->rq_task);
+
+   ret = 0;
+   xprt_complete_rqst(req->rq_task, rcvbuf->len);
+   rcvbuf->len = 0;
+
+out_unlock:
+   spin_unlock_bh(&xprt->transport_lock);
+out:
+   return ret;
+
+out_shortreply:
+   dprintk("svcrdma: short bc reply: xprt=%p, len=%zu\n",
+   xprt, src->iov_len);
+   goto out;
+
+out_notfound:
+   dprintk("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n",
+   xprt, be32_to_cpu(xid));
+
+   goto out_unlock;
+}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index ff4f01e..be89aa0 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include "xprt_rdma.h"
 
 #define RPCDBG_FACILITYRPCDBG_SVCXPRT
 
@@ -567,6 +568,38 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
return ret;
 }
 
+/* By convention, backchannel calls arrive via rdma_msg type
+ * messages, and never populate the chunk lists. This makes
+ * the RPC/RDMA header small and fixed in size, so it is
+ * straightforward to check the RPC header's direction field.
+ */
+static bool
+svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
+{
+   __be32 *p = (__be32 *)rmsgp;
+
+   if (!xprt->xpt_bc_xprt)
+   return false;
+
+   if (rmsgp->rm_type != rdma_msg)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
+   return false;
+
+   /* sanity */
+   if (p[7] != rmsgp->rm_xid)
+   return false;
+   /* call direction */
+   if (p[8] == cpu_to_be32(RPC_CALL))
+   return false;
+
+   return true;
+}
+
 /*
  * Set up the rqstp thread context to point to the RQ buffer. If
  * necessary, pull additional data from the client with an RDMA_READ
@@ -632,6 +665,15 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
goto close_out;
}
 
+   if (svc_rdma_is_backchannel_reply(xprt, rmsgp)) {
+   ret = rpcrdma_handle_bc_reply(xprt->xpt_bc_xprt, rmsgp,
+ &rqstp->rq_arg);
+

[PATCH v2 7/7] svcrdma: No need to count WRs in svc_rdma_send()

2015-11-30 Thread Chuck Lever

Minor optimization: Instead of counting WRs in a chain, have callers
pass in the number of WRs they've prepared.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |2 +-
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |9 ++---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|6 +++---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   17 ++---
 4 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 6f52995..f96d641 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -215,7 +215,7 @@ extern int svc_rdma_bc_post_send(struct svcxprt_rdma *,
 struct svc_rdma_op_ctxt *, struct xdr_buf *);
 
 /* svc_rdma_transport.c */
-extern int svc_rdma_send(struct svcxprt_rdma *, struct ib_send_wr *);
+extern int svc_rdma_send(struct svcxprt_rdma *, struct ib_send_wr *, int);
 extern void svc_rdma_send_error(struct svcxprt_rdma *, struct rpcrdma_msg *,
enum rpcrdma_errcode);
 extern int svc_rdma_post_recv(struct svcxprt_rdma *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index be89aa0..17b0835 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -190,7 +190,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
read_wr.wr.sg_list = ctxt->sge;
read_wr.wr.num_sge = pages_needed;
 
-   ret = svc_rdma_send(xprt, &read_wr.wr);
+   ret = svc_rdma_send(xprt, &read_wr.wr, 1);
if (ret) {
pr_err("svcrdma: Error %d posting RDMA_READ\n", ret);
set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
@@ -227,7 +227,7 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
int nents = PAGE_ALIGN(*page_offset + rs_length) >> PAGE_SHIFT;
struct svc_rdma_op_ctxt *ctxt = svc_rdma_get_context(xprt);
struct svc_rdma_fastreg_mr *frmr = svc_rdma_get_frmr(xprt);
-   int ret, read, pno, dma_nents, n;
+   int ret, read, pno, num_wrs, dma_nents, n;
u32 pg_off = *page_offset;
u32 pg_no = *page_no;
 
@@ -299,6 +299,8 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
ctxt->count = 1;
ctxt->read_hdr = head;
 
+   num_wrs = 2;
+
/* Prepare REG WR */
reg_wr.wr.opcode = IB_WR_REG_MR;
reg_wr.wr.wr_id = 0;
@@ -329,11 +331,12 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
inv_wr.opcode = IB_WR_LOCAL_INV;
inv_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_FENCE;
inv_wr.ex.invalidate_rkey = frmr->mr->lkey;
+   num_wrs++;
}
ctxt->wr_op = read_wr.wr.opcode;
 
/* Post the chain */
-   ret = svc_rdma_send(xprt, ®_wr.wr);
+   ret = svc_rdma_send(xprt, ®_wr.wr, num_wrs);
if (ret) {
pr_err("svcrdma: Error %d posting RDMA_READ\n", ret);
set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 846df63..65b2fd6 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -292,7 +292,7 @@ static int send_write(struct svcxprt_rdma *xprt, struct 
svc_rqst *rqstp,
 
/* Post It */
atomic_inc(&rdma_stat_write);
-   if (svc_rdma_send(xprt, &write_wr.wr))
+   if (svc_rdma_send(xprt, &write_wr.wr, 1))
goto err;
return write_len - bc;
  err:
@@ -557,7 +557,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
send_wr.opcode = IB_WR_SEND;
send_wr.send_flags =  IB_SEND_SIGNALED;
 
-   ret = svc_rdma_send(rdma, &send_wr);
+   ret = svc_rdma_send(rdma, &send_wr, 1);
if (ret)
goto err;
 
@@ -698,7 +698,7 @@ int svc_rdma_bc_post_send(struct svcxprt_rdma *rdma,
send_wr.opcode = IB_WR_SEND;
send_wr.send_flags = IB_SEND_SIGNALED;
 
-   ret = svc_rdma_send(rdma, &send_wr);
+   ret = svc_rdma_send(rdma, &send_wr, 1);
if (ret) {
svc_rdma_unmap_dma(ctxt);
ret = -EIO;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index ab5e376..77eeb23 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1274,20 +1274,15 @@ static int svc_rdma_secure_port(struct svc_rqst *rqstp)
return 1;
 }
 
-int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr)
+int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr,
+ int wr_count)
 {
-   struct ib_send_wr *bad_wr, *n_wr;
-   int wr_count;
-   int i;
-   int ret;
+   struct ib_send_wr *bad_wr;
+   int i, ret;
 
if (test

[PATCH v2 2/7] svcrdma: Add svc_rdma_get_context() API that is allowed to fail

2015-11-30 Thread Chuck Lever

To support backward direction calls, I'm going to add an
svc_rdma_get_context() call in the client RDMA transport.

Called from ->buf_alloc(), we can't sleep waiting for memory.
So add an API that can get a server op_ctxt but won't sleep.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |2 ++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   14 +++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f869807..cc69551 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -217,6 +217,8 @@ extern void svc_rdma_send_error(struct svcxprt_rdma *, 
struct rpcrdma_msg *,
 extern int svc_rdma_post_recv(struct svcxprt_rdma *);
 extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *);
 extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
+extern struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *,
+gfp_t);
 extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
 extern void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt);
 extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index b348b4a..94b8d4c 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -153,12 +153,15 @@ static void svc_rdma_bc_free(struct svc_xprt *xprt)
 }
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
-struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
+struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *xprt,
+ gfp_t flags)
 {
struct svc_rdma_op_ctxt *ctxt;
 
-   ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
-   GFP_KERNEL | __GFP_NOFAIL);
+   ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep, flags);
+   if (!ctxt)
+   return NULL;
+
ctxt->xprt = xprt;
INIT_LIST_HEAD(&ctxt->dto_q);
ctxt->count = 0;
@@ -167,6 +170,11 @@ struct svc_rdma_op_ctxt *svc_rdma_get_context(struct 
svcxprt_rdma *xprt)
return ctxt;
 }
 
+struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
+{
+   return svc_rdma_get_context_gfp(xprt, GFP_KERNEL | __GFP_NOFAIL);
+}
+
 void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
 {
struct svcxprt_rdma *xprt = ctxt->xprt;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/7] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-11-30 Thread Chuck Lever

Minor optimization: when dealing with write chunk XDR roundup, do
not post a Write WR for the zero bytes in the pad. Simply update
the write segment in the RPC-over-RDMA header to reflect the extra
pad bytes.

The Reply chunk is also a write chunk, but the server does not use
send_write_chunks() to send the Reply chunk. That's OK in this case:
the server Upper Layer typically marshals the Reply chunk contents
in a single contiguous buffer, without a separate tail for the XDR
pad.

The comments and the variable naming refer to "chunks" but what is
really meant is "segments." The existing code sends only one
xdr_write_chunk per RPC reply.

The fix assumes this as well. When the XDR pad in the first write
chunk is reached, the assumption is the Write list is complete and
send_write_chunks() returns.

That will remain a valid assumption until the server Upper Layer can
support multiple bulk payload results per RPC.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 969a1ab..bad5eaa 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
arg_ch->rs_handle,
arg_ch->rs_offset,
write_len);
+
+   /* Do not send XDR pad bytes */
+   if (chunk_no && write_len < 4) {
+   chunk_no++;
+   break;
+   }
+
chunk_off = 0;
while (write_len) {
ret = send_write(xprt, rqstp,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/8] NFS/RDMA server patches for 4.5

2015-11-30 Thread Chuck Lever

Here are patches to support server-side bi-directional RPC/RDMA
operation (to enable NFSv4.1 on RPC/RDMA transports). Thanks to
all who reviewed v1.

Also available in the "nfsd-rdma-for-4.5" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-for-4.5

Changes since v1:

- Rebased on v4.4-rc3
- Removed the use of CONFIG_SUNRPC_BACKCHANNEL
- Fixed computation of forward and backward max_requests
- Updated some comments and patch descriptions
- pr_err and pr_info converted to dprintk
- Simplified svc_rdma_get_context()
- Dropped patch removing access_flags field
- NFSv4.1 callbacks tested with for-4.5 client

---

Chuck Lever (8):
  svcrdma: Do not send XDR roundup bytes for a write chunk
  svcrdma: Add svc_rdma_get_context() API that is allowed to fail
  svcrdma: Define maximum number of backchannel requests
  svcrdma: Add infrastructure to send backwards direction RPC/RDMA calls
  svcrdma: Add infrastructure to receive backwards direction RPC/RDMA 
replies
  xprtrdma: Add class for RDMA backwards direction transport
  svcrdma: Display failed completions
  svcrdma: No need to count WRs in svc_rdma_send()


 include/linux/sunrpc/svc_rdma.h  |8 +
 include/linux/sunrpc/xprt.h  |1 
 net/sunrpc/xprt.c|1 
 net/sunrpc/xprtrdma/rpc_rdma.c   |   72 +
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |   61 
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|   72 +
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   88 +++
 net/sunrpc/xprtrdma/transport.c  |  230 ++
 net/sunrpc/xprtrdma/xprt_rdma.h  |6 +
 9 files changed, 500 insertions(+), 39 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 03/11] xprtrdma: Disable RPC/RDMA backchannel debugging messages

2015-11-30 Thread Chuck Lever

Clean up.

Fixes: 63cae47005af ('xprtrdma: Handle incoming backward direction')
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/backchannel.c |   16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c 
b/net/sunrpc/xprtrdma/backchannel.c
index 11d2cfb..cd31181 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -15,7 +15,7 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
-#define RPCRDMA_BACKCHANNEL_DEBUG
+#undef RPCRDMA_BACKCHANNEL_DEBUG
 
 static void rpcrdma_bc_free_rqst(struct rpcrdma_xprt *r_xprt,
 struct rpc_rqst *rqst)
@@ -136,6 +136,7 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int 
reqs)
   __func__);
goto out_free;
}
+   dprintk("RPC:   %s: new rqst %p\n", __func__, rqst);
 
rqst->rq_xprt = &r_xprt->rx_xprt;
INIT_LIST_HEAD(&rqst->rq_list);
@@ -216,12 +217,14 @@ int rpcrdma_bc_marshal_reply(struct rpc_rqst *rqst)
 
rpclen = rqst->rq_svec[0].iov_len;
 
+#ifdef RPCRDMA_BACKCHANNEL_DEBUG
pr_info("RPC:   %s: rpclen %zd headerp 0x%p lkey 0x%x\n",
__func__, rpclen, headerp, rdmab_lkey(req->rl_rdmabuf));
pr_info("RPC:   %s: RPC/RDMA: %*ph\n",
__func__, (int)RPCRDMA_HDRLEN_MIN, headerp);
pr_info("RPC:   %s:  RPC: %*ph\n",
__func__, (int)rpclen, rqst->rq_svec[0].iov_base);
+#endif
 
req->rl_send_iov[0].addr = rdmab_addr(req->rl_rdmabuf);
req->rl_send_iov[0].length = RPCRDMA_HDRLEN_MIN;
@@ -265,6 +268,9 @@ void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
 {
struct rpc_xprt *xprt = rqst->rq_xprt;
 
+   dprintk("RPC:   %s: freeing rqst %p (req %p)\n",
+   __func__, rqst, rpcr_to_rdmar(rqst));
+
smp_mb__before_atomic();
WARN_ON_ONCE(!test_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state));
clear_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state);
@@ -329,9 +335,7 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
struct rpc_rqst, rq_bc_pa_list);
list_del(&rqst->rq_bc_pa_list);
spin_unlock(&xprt->bc_pa_lock);
-#ifdef RPCRDMA_BACKCHANNEL_DEBUG
-   pr_info("RPC:   %s: using rqst %p\n", __func__, rqst);
-#endif
+   dprintk("RPC:   %s: using rqst %p\n", __func__, rqst);
 
/* Prepare rqst */
rqst->rq_reply_bytes_recvd = 0;
@@ -351,10 +355,8 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
 * direction reply.
 */
req = rpcr_to_rdmar(rqst);
-#ifdef RPCRDMA_BACKCHANNEL_DEBUG
-   pr_info("RPC:   %s: attaching rep %p to req %p\n",
+   dprintk("RPC:   %s: attaching rep %p to req %p\n",
__func__, rep, req);
-#endif
req->rl_reply = rep;
 
/* Defeat the retransmit detection logic in send_request */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 07/11] xprtrdma: Add ro_unmap_sync method for FMR

2015-11-30 Thread Chuck Lever

FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.

1. Gather all the MRs for the RPC request onto a list, and invoke
   ib_unmap_fmr() once with that list. This reduces the number of
   doorbells when there is more than one MR to invalidate

2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
   This is critical after invalidating a Write chunk.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c |   64 +
 1 file changed, 64 insertions(+)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index f1e8daf..c14f3a4 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -179,6 +179,69 @@ out_maperr:
return rc;
 }
 
+static void
+__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   int nsegs = seg->mr_nsegs;
+
+   seg->rl_mw = NULL;
+
+   while (nsegs--)
+   rpcrdma_unmap_one(device, seg++);
+
+   rpcrdma_put_mw(r_xprt, mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct rpcrdma_mr_seg *seg;
+   unsigned int i, nchunks;
+   struct rpcrdma_mw *mw;
+   LIST_HEAD(unmap_list);
+   int rc;
+
+   dprintk("RPC:   %s: req %p\n", __func__, req);
+
+   /* ORDER: Invalidate all of the req's MRs first
+*
+* ib_unmap_fmr() is slow, so use a single call instead
+* of one call per mapped MR.
+*/
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = &req->rl_segments[i];
+   mw = seg->rl_mw;
+
+   list_add(&mw->r.fmr.fmr->list, &unmap_list);
+
+   i += seg->mr_nsegs;
+   }
+   rc = ib_unmap_fmr(&unmap_list);
+   if (rc)
+   pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
+
+   /* ORDER: Now DMA unmap all of the req's MRs, and return
+* them to the free MW list.
+*/
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = &req->rl_segments[i];
+
+   __fmr_dma_unmap(r_xprt, seg);
+
+   i += seg->mr_nsegs;
+   seg->mr_nsegs = 0;
+   }
+
+   req->rl_nchunks = 0;
+}
+
 /* Use the ib_unmap_fmr() verb to prevent further remote
  * access via RDMA READ or RDMA WRITE.
  */
@@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
+   .ro_unmap_sync  = fmr_op_unmap_sync,
.ro_unmap   = fmr_op_unmap,
.ro_open= fmr_op_open,
.ro_maxpages= fmr_op_maxpages,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 09/11] SUNRPC: Introduce xprt_commit_rqst()

2015-11-30 Thread Chuck Lever

I'm about to add code in the RPC/RDMA reply handler between the
xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
to execute outside of spinlock critical sections.

Add a hook to remove an rpc_rqst from the pending list once
the transport knows its going to invoke xprt_complete_rqst().

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/xprt.h|1 +
 net/sunrpc/xprt.c  |   14 ++
 net/sunrpc/xprtrdma/rpc_rdma.c |4 
 3 files changed, 19 insertions(+)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 69ef5b3..ab6c3a5 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -366,6 +366,7 @@ void
xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
 void   xprt_write_space(struct rpc_xprt *xprt);
 void   xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task 
*task, int result);
 struct rpc_rqst *  xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
+void   xprt_commit_rqst(struct rpc_task *task);
 void   xprt_complete_rqst(struct rpc_task *task, int copied);
 void   xprt_release_rqst_cong(struct rpc_task *task);
 void   xprt_disconnect_done(struct rpc_xprt *xprt);
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2e98f4a..a5be4ab 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
 }
 
 /**
+ * xprt_commit_rqst - remove rqst from pending list early
+ * @task: RPC request to remove
+ *
+ * Caller holds transport lock.
+ */
+void xprt_commit_rqst(struct rpc_task *task)
+{
+   struct rpc_rqst *req = task->tk_rqstp;
+
+   list_del_init(&req->rq_list);
+}
+EXPORT_SYMBOL_GPL(xprt_commit_rqst);
+
+/**
  * xprt_complete_rqst - called when reply processing is complete
  * @task: RPC request that recently completed
  * @copied: actual number of bytes received from the transport
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..0bc8c39 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -804,6 +804,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
if (req->rl_reply)
goto out_duplicate;
 
+   xprt_commit_rqst(rqst->rq_task);
+   spin_unlock_bh(&xprt->transport_lock);
+
dprintk("RPC:   %s: reply 0x%p completes request 0x%p\n"
"   RPC request 0x%p xid 0x%08x\n",
__func__, rep, req, rqst,
@@ -894,6 +897,7 @@ badheader:
else if (credits > r_xprt->rx_buf.rb_max_requests)
credits = r_xprt->rx_buf.rb_max_requests;
 
+   spin_lock_bh(&xprt->transport_lock);
cwnd = xprt->cwnd;
xprt->cwnd = credits << RPC_CWNDSHIFT;
if (xprt->cwnd > cwnd)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 08/11] xprtrdma: Add ro_unmap_sync method for all-physical registration

2015-11-30 Thread Chuck Lever

physical's ro_unmap is synchronous already. The new ro_unmap_sync
method just has to DMA unmap all MRs associated with the RPC
request.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/physical_ops.c |   13 +
 1 file changed, 13 insertions(+)

diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index 617b76f..dbb302e 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -83,6 +83,18 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
return 1;
 }
 
+/* DMA unmap all memory regions that were mapped for "req".
+ */
+static void
+physical_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   unsigned int i;
+
+   for (i = 0; req->rl_nchunks; --req->rl_nchunks)
+   rpcrdma_unmap_one(device, &req->rl_segments[i++]);
+}
+
 static void
 physical_op_destroy(struct rpcrdma_buffer *buf)
 {
@@ -90,6 +102,7 @@ physical_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
+   .ro_unmap_sync  = physical_op_unmap_sync,
.ro_unmap   = physical_op_unmap,
.ro_open= physical_op_open,
.ro_maxpages= physical_op_maxpages,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 05/11] xprtrdma: Introduce ro_unmap_sync method

2015-11-30 Thread Chuck Lever

In the current xprtrdma implementation, some memreg strategies
implement ro_unmap synchronously (the MR is knocked down before the
method returns) and some asynchonously (the MR will be knocked down
and returned to the pool in the background).

To guarantee the MR is truly invalid before the RPC consumer is
allowed to resume execution, we need an unmap method that is
always synchronous, invoked from the RPC/RDMA reply handler.

The new method unmaps all MRs for an RPC. The existing ro_unmap
method unmaps only one MR at a time.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/xprt_rdma.h |2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index ca481b2..d2384b5 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -366,6 +366,8 @@ struct rpcrdma_xprt;
 struct rpcrdma_memreg_ops {
int (*ro_map)(struct rpcrdma_xprt *,
  struct rpcrdma_mr_seg *, int, bool);
+   void(*ro_unmap_sync)(struct rpcrdma_xprt *,
+struct rpcrdma_req *);
int (*ro_unmap)(struct rpcrdma_xprt *,
struct rpcrdma_mr_seg *);
int (*ro_open)(struct rpcrdma_ia *,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 04/11] xprtrdma: Move struct ib_send_wr off the stack

2015-11-30 Thread Chuck Lever

For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off
the stack. This allows frwr_op_map and frwr_op_unmap to chain
WRs together without limit to register or invalidate a set of MRs
with a single ib_post_send().

(This will be for chaining LOCAL_INV requests).

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/frwr_ops.c  |   38 --
 net/sunrpc/xprtrdma/xprt_rdma.h |2 ++
 2 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 88cf9e7..31a4578 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -319,7 +319,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
struct rpcrdma_mw *mw;
struct rpcrdma_frmr *frmr;
struct ib_mr *mr;
-   struct ib_reg_wr reg_wr;
+   struct ib_reg_wr *reg_wr;
struct ib_send_wr *bad_wr;
int rc, i, n, dma_nents;
u8 key;
@@ -336,6 +336,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
mr = frmr->fr_mr;
+   reg_wr = &frmr->fr_regwr;
 
if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;
@@ -381,19 +382,19 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
key = (u8)(mr->rkey & 0x00FF);
ib_update_fast_reg_key(mr, ++key);
 
-   reg_wr.wr.next = NULL;
-   reg_wr.wr.opcode = IB_WR_REG_MR;
-   reg_wr.wr.wr_id = (uintptr_t)mw;
-   reg_wr.wr.num_sge = 0;
-   reg_wr.wr.send_flags = 0;
-   reg_wr.mr = mr;
-   reg_wr.key = mr->rkey;
-   reg_wr.access = writing ?
-   IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
-   IB_ACCESS_REMOTE_READ;
+   reg_wr->wr.next = NULL;
+   reg_wr->wr.opcode = IB_WR_REG_MR;
+   reg_wr->wr.wr_id = (uintptr_t)mw;
+   reg_wr->wr.num_sge = 0;
+   reg_wr->wr.send_flags = 0;
+   reg_wr->mr = mr;
+   reg_wr->key = mr->rkey;
+   reg_wr->access = writing ?
+IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+IB_ACCESS_REMOTE_READ;
 
DECR_CQCOUNT(&r_xprt->rx_ep);
-   rc = ib_post_send(ia->ri_id->qp, ®_wr.wr, &bad_wr);
+   rc = ib_post_send(ia->ri_id->qp, ®_wr->wr, &bad_wr);
if (rc)
goto out_senderr;
 
@@ -423,23 +424,24 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mw *mw = seg1->rl_mw;
struct rpcrdma_frmr *frmr = &mw->r.frmr;
-   struct ib_send_wr invalidate_wr, *bad_wr;
+   struct ib_send_wr *invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;
 
dprintk("RPC:   %s: FRMR %p\n", __func__, mw);
 
seg1->rl_mw = NULL;
frmr->fr_state = FRMR_IS_INVALID;
+   invalidate_wr = &mw->r.frmr.fr_invwr;
 
-   memset(&invalidate_wr, 0, sizeof(invalidate_wr));
-   invalidate_wr.wr_id = (unsigned long)(void *)mw;
-   invalidate_wr.opcode = IB_WR_LOCAL_INV;
-   invalidate_wr.ex.invalidate_rkey = frmr->fr_mr->rkey;
+   memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+   invalidate_wr->wr_id = (uintptr_t)mw;
+   invalidate_wr->opcode = IB_WR_LOCAL_INV;
+   invalidate_wr->ex.invalidate_rkey = frmr->fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);
 
ib_dma_unmap_sg(ia->ri_device, frmr->sg, frmr->sg_nents, seg1->mr_dir);
read_lock(&ia->ri_qplock);
-   rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+   rc = ib_post_send(ia->ri_id->qp, invalidate_wr, &bad_wr);
read_unlock(&ia->ri_qplock);
if (rc)
goto out_err;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index ac7f8d4..ca481b2 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -207,6 +207,8 @@ struct rpcrdma_frmr {
enum rpcrdma_frmr_state fr_state;
struct work_struct  fr_work;
struct rpcrdma_xprt *fr_xprt;
+   struct ib_reg_wrfr_regwr;
+   struct ib_send_wr   fr_invwr;
 };
 
 struct rpcrdma_fmr {

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 06/11] xprtrdma: Add ro_unmap_sync method for FRWR

2015-11-30 Thread Chuck Lever

FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts
LOCAL_INV Work Requests and waits for them to complete before
returning.

Note also, DMA unmapping is now done _after_ invalidation.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/frwr_ops.c  |  137 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h |2 +
 2 files changed, 135 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 31a4578..5d58008 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -245,12 +245,14 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
-/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs to be reset. */
+/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs
+ * to be reset.
+ *
+ * WARNING: Only wr_id and status are reliable at this point
+ */
 static void
-frwr_sendcompletion(struct ib_wc *wc)
+__frwr_sendcompletion_flush(struct ib_wc *wc, struct rpcrdma_mw *r)
 {
-   struct rpcrdma_mw *r;
-
if (likely(wc->status == IB_WC_SUCCESS))
return;
 
@@ -261,9 +263,23 @@ frwr_sendcompletion(struct ib_wc *wc)
else
pr_warn("RPC:   %s: frmr %p error, status %s (%d)\n",
__func__, r, ib_wc_status_msg(wc->status), wc->status);
+
r->r.frmr.fr_state = FRMR_IS_STALE;
 }
 
+static void
+frwr_sendcompletion(struct ib_wc *wc)
+{
+   struct rpcrdma_mw *r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+   struct rpcrdma_frmr *f = &r->r.frmr;
+
+   if (unlikely(wc->status != IB_WC_SUCCESS))
+   __frwr_sendcompletion_flush(wc, r);
+
+   if (f->fr_waiter)
+   complete(&f->fr_linv_done);
+}
+
 static int
 frwr_op_init(struct rpcrdma_xprt *r_xprt)
 {
@@ -335,6 +351,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
} while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
+   frmr->fr_waiter = false;
mr = frmr->fr_mr;
reg_wr = &frmr->fr_regwr;
 
@@ -414,6 +431,117 @@ out_senderr:
return rc;
 }
 
+static struct ib_send_wr *
+__frwr_prepare_linv_wr(struct rpcrdma_mr_seg *seg)
+{
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   struct rpcrdma_frmr *f = &mw->r.frmr;
+   struct ib_send_wr *invalidate_wr;
+
+   f->fr_waiter = false;
+   f->fr_state = FRMR_IS_INVALID;
+   invalidate_wr = &f->fr_invwr;
+
+   memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+   invalidate_wr->wr_id = (unsigned long)(void *)mw;
+   invalidate_wr->opcode = IB_WR_LOCAL_INV;
+   invalidate_wr->ex.invalidate_rkey = f->fr_mr->rkey;
+
+   return invalidate_wr;
+}
+
+static void
+__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+int rc)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   int nsegs = seg->mr_nsegs;
+
+   seg->rl_mw = NULL;
+
+   while (nsegs--)
+   rpcrdma_unmap_one(device, seg++);
+
+   if (!rc)
+   rpcrdma_put_mw(r_xprt, mw);
+   else
+   __frwr_queue_recovery(mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct ib_send_wr *invalidate_wrs, *pos, *prev, *bad_wr;
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg;
+   unsigned int i, nchunks;
+   struct rpcrdma_frmr *f;
+   int rc;
+
+   dprintk("RPC:   %s: req %p\n", __func__, req);
+
+   /* ORDER: Invalidate all of the req's MRs first
+*
+* Chain the LOCAL_INV Work Requests and post them with
+* a single ib_post_send() call.
+*/
+   invalidate_wrs = pos = prev = NULL;
+   seg = NULL;
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = &req->rl_segments[i];
+
+   pos = __frwr_prepare_linv_wr(seg);
+
+   if (!invalidate_wrs)
+   invalidate_wrs = pos;
+   else
+   prev->next = pos;
+   prev = pos;
+
+   i += seg->mr_nsegs;
+   }
+   f = &seg->rl_mw->r.frmr;
+
+   /* Strong send queue ordering guarantees that when the
+* last WR in the chain completes, all WRs in the chain
+* are complete.
+*/
+   f->fr_invwr.send_flags = IB_SEND_SIGNALED;
+   f->fr_waiter = true;
+

[PATCH v2 10/11] xprtrdma: Invalidate in the RPC reply handler

2015-11-30 Thread Chuck Lever

There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.

The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.

This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.

Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.

Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.

Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.

Reported-by: Jason Gunthorpe 
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c |   10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 0bc8c39..3d00c5d 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -891,6 +891,16 @@ badheader:
break;
}
 
+   /* Invalidate and flush the data payloads before waking the
+* waiting application. This guarantees the memory region is
+* properly fenced from the server before the application
+* accesses the data. It also ensures proper send flow
+* control: waking the next RPC waits until this RPC has
+* relinquished all its Send Queue entries.
+*/
+   if (req->rl_nchunks)
+   r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);
+
credits = be32_to_cpu(headerp->rm_credit);
if (credits == 0)
credits = 1;/* don't deadlock */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 11/11] xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').

2015-11-30 Thread Chuck Lever

The root of the problem was that sends (especially unsignalled
FASTREG and LOCAL_INV Work Requests) were not properly flow-
controlled, which allowed a send queue overrun.

Now that the RPC/RDMA reply handler waits for invalidation to
complete, the send queue is properly flow-controlled. Thus this
limit is no longer necessary.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |6 ++
 net/sunrpc/xprtrdma/xprt_rdma.h |6 --
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index d9c2097..daff5e7 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -616,10 +616,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia 
*ia,
 
/* set trigger for requesting send completion */
ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
-   if (ep->rep_cqinit > RPCRDMA_MAX_UNSIGNALED_SENDS)
-   ep->rep_cqinit = RPCRDMA_MAX_UNSIGNALED_SENDS;
-   else if (ep->rep_cqinit <= 2)
-   ep->rep_cqinit = 0;
+   if (ep->rep_cqinit <= 2)
+   ep->rep_cqinit = 0; /* always signal? */
INIT_CQCOUNT(ep);
init_waitqueue_head(&ep->rep_connect_wait);
INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 6726cb3..6a9e627 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -88,12 +88,6 @@ struct rpcrdma_ep {
struct delayed_work rep_connect_worker;
 };
 
-/*
- * Force a signaled SEND Work Request every so often,
- * in case the provider needs to do some housekeeping.
- */
-#define RPCRDMA_MAX_UNSIGNALED_SENDS   (32)
-
 #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
 #define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 01/11] xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)

2015-11-30 Thread Chuck Lever

Clean up.

rb_lock critical sections added in rpcrdma_ep_post_extra_recv()
should have first been converted to use normal spin_lock now that
the reply handler is a work queue.

The backchannel set up code should use the appropriate helper
instead of open-coding a rb_recv_bufs list add.

Problem introduced by glib patch re-ordering on my part.

Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/backchannel.c |6 +-
 net/sunrpc/xprtrdma/verbs.c   |7 +++
 2 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c 
b/net/sunrpc/xprtrdma/backchannel.c
index 2dcb44f..11d2cfb 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -84,9 +84,7 @@ out_fail:
 static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
 unsigned int count)
 {
-   struct rpcrdma_buffer *buffers = &r_xprt->rx_buf;
struct rpcrdma_rep *rep;
-   unsigned long flags;
int rc = 0;
 
while (count--) {
@@ -98,9 +96,7 @@ static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
break;
}
 
-   spin_lock_irqsave(&buffers->rb_lock, flags);
-   list_add(&rep->rr_list, &buffers->rb_recv_bufs);
-   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   rpcrdma_recv_buffer_put(rep);
}
 
return rc;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index eadd1655..d9c2097 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1337,15 +1337,14 @@ rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, 
unsigned int count)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_ep *ep = &r_xprt->rx_ep;
struct rpcrdma_rep *rep;
-   unsigned long flags;
int rc;
 
while (count--) {
-   spin_lock_irqsave(&buffers->rb_lock, flags);
+   spin_lock(&buffers->rb_lock);
if (list_empty(&buffers->rb_recv_bufs))
goto out_reqbuf;
rep = rpcrdma_buffer_get_rep_locked(buffers);
-   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   spin_unlock(&buffers->rb_lock);
 
rc = rpcrdma_ep_post_recv(ia, ep, rep);
if (rc)
@@ -1355,7 +1354,7 @@ rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, 
unsigned int count)
return 0;
 
 out_reqbuf:
-   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   spin_unlock(&buffers->rb_lock);
pr_warn("%s: no extra receive buffers\n", __func__);
return -ENOMEM;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 00/11] NFS/RDMA client patches for 4.5

2015-11-30 Thread Chuck Lever

For 4.5, I'd like to address the send queue accounting and
invalidation/unmap ordering issues Jason brought up a couple of
months ago. Thanks to all reviewers of v1.

Also available in the "nfs-rdma-for-4.5" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5

Changes since v1:

- Rebased on v4.4-rc3
- Receive buffer safety margin patch dropped
- Backchannel pr_err and pr_info converted to dprintk
- Backchannel spin locks converted to work queue-safe locks
- Fixed premature release of backchannel request buffer
- NFSv4.1 callbacks tested with for-4.5 server

---

Chuck Lever (11):
  xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
  xprtrdma: xprt_rdma_free() must not release backchannel reqs
  xprtrdma: Disable RPC/RDMA backchannel debugging messages
  xprtrdma: Move struct ib_send_wr off the stack
  xprtrdma: Introduce ro_unmap_sync method
  xprtrdma: Add ro_unmap_sync method for FRWR
  xprtrdma: Add ro_unmap_sync method for FMR
  xprtrdma: Add ro_unmap_sync method for all-physical registration
  SUNRPC: Introduce xprt_commit_rqst()
  xprtrdma: Invalidate in the RPC reply handler
  xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').


 include/linux/sunrpc/xprt.h|1 
 net/sunrpc/xprt.c  |   14 +++
 net/sunrpc/xprtrdma/backchannel.c  |   22 ++---
 net/sunrpc/xprtrdma/fmr_ops.c  |   64 +
 net/sunrpc/xprtrdma/frwr_ops.c |  175 +++-
 net/sunrpc/xprtrdma/physical_ops.c |   13 +++
 net/sunrpc/xprtrdma/rpc_rdma.c |   14 +++
 net/sunrpc/xprtrdma/transport.c|3 +
 net/sunrpc/xprtrdma/verbs.c|   13 +--
 net/sunrpc/xprtrdma/xprt_rdma.h|   12 +-
 10 files changed, 283 insertions(+), 48 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 02/11] xprtrdma: xprt_rdma_free() must not release backchannel reqs

2015-11-30 Thread Chuck Lever

Preserve any rpcrdma_req that is attached to rpc_rqst's allocated
for the backchannel. Otherwise, after all the pre-allocated
backchannel req's are consumed, incoming backward calls start
writing on freed memory.

Somehow this hunk got lost.

Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/transport.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8c545f7..740bddc 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -576,6 +576,9 @@ xprt_rdma_free(void *buffer)
 
rb = container_of(buffer, struct rpcrdma_regbuf, rg_base[0]);
req = rb->rg_owner;
+   if (req->rl_backchannel)
+   return;
+
r_xprt = container_of(req->rl_buffer, struct rpcrdma_xprt, rx_buf);
 
dprintk("RPC:   %s: called on 0x%p\n", __func__, req->rl_reply);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 3/8] svcrdma: Add svc_rdma_get_context() API that is allowed to fail

2015-11-24 Thread Chuck Lever


> On Nov 24, 2015, at 3:02 PM, Christoph Hellwig  wrote:
> 
> On Tue, Nov 24, 2015 at 09:24:51AM -0500, Chuck Lever wrote:
>> There is only one (new) call site that needs it. I can simplify
>> this patch as Sagi suggested before, but it seems silly to
>> introduce the extra clutter of adding a gfp_t argument
>> everywhere.
> 
> We a) generally try to pass the gfp_t around if we expect calling
> contexts to change, and b the changes to the 6 callers are probably
> still smaller than this patch :)

I’ll post a v2 early next week. It will be smaller and simpler.


>>> And if we have any way to avoid the __GFP_NOFAIL
>>> I'd really appreciate if we could give that a try.
>> 
>> I???m not introducing the flag here.
>> 
>> Changing all the svc_rdma_get_context() call sites to handle
>> allocation failure (when it is already highly unlikely) is
>> a lot of needless work, IMO, and not related to supporting
>> bi-directional RPC.
> 
> Ok.

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 7/9] SUNRPC: Introduct xprt_commit_rqst()

2015-11-24 Thread Chuck Lever


> On Nov 24, 2015, at 2:54 PM, Anna Schumaker  wrote:
> 
> Hi Chuck,
> 
> On 11/23/2015 05:14 PM, Chuck Lever wrote:
>> I'm about to add code in the RPC/RDMA reply handler between the
>> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
>> to execute outside of spinlock critical sections.
>> 
>> Add a hook to remove an rpc_rqst from the pending list once
>> the transport knows its going to invoke xprt_complete_rqst().
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>> include/linux/sunrpc/xprt.h|1 +
>> net/sunrpc/xprt.c  |   14 ++
>> net/sunrpc/xprtrdma/rpc_rdma.c |4 
>> 3 files changed, 19 insertions(+)
>> 
>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>> index 69ef5b3..ab6c3a5 100644
>> --- a/include/linux/sunrpc/xprt.h
>> +++ b/include/linux/sunrpc/xprt.h
>> @@ -366,6 +366,7 @@ void 
>> xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
>> void xprt_write_space(struct rpc_xprt *xprt);
>> void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task 
>> *task, int result);
>> struct rpc_rqst *xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
>> +voidxprt_commit_rqst(struct rpc_task *task);
>> void xprt_complete_rqst(struct rpc_task *task, int copied);
>> void xprt_release_rqst_cong(struct rpc_task *task);
>> void xprt_disconnect_done(struct rpc_xprt *xprt);
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index 2e98f4a..a5be4ab 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
>> }
>> 
>> /**
>> + * xprt_commit_rqst - remove rqst from pending list early
>> + * @task: RPC request to remove
> 
> Is xprt_commit_rqst() the right name for this function?  Removing a request 
> from a list isn't how I would expect a commit to work.

“commit” means the request is committed: we have a parse-able
reply and will proceed to completion. The name does not reflect
the mechanism, but the policy.

Any suggestions on a different name?


> Anna
> 
>> + *
>> + * Caller holds transport lock.
>> + */
>> +void xprt_commit_rqst(struct rpc_task *task)
>> +{
>> +struct rpc_rqst *req = task->tk_rqstp;
>> +
>> +list_del_init(&req->rq_list);
>> +}
>> +EXPORT_SYMBOL_GPL(xprt_commit_rqst);
>> +
>> +/**
>>  * xprt_complete_rqst - called when reply processing is complete
>>  * @task: RPC request that recently completed
>>  * @copied: actual number of bytes received from the transport
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index a169252..d7b9156 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -811,6 +811,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>  if (req->rl_reply)
>>  goto out_duplicate;
>> 
>> +xprt_commit_rqst(rqst->rq_task);
>> +spin_unlock_bh(&xprt->transport_lock);
>> +
>>  dprintk("RPC:   %s: reply 0x%p completes request 0x%p\n"
>>  "   RPC request 0x%p xid 0x%08x\n",
>>  __func__, rep, req, rqst,
>> @@ -901,6 +904,7 @@ badheader:
>>      else if (credits > r_xprt->rx_buf.rb_max_requests)
>>  credits = r_xprt->rx_buf.rb_max_requests;
>> 
>> +spin_lock_bh(&xprt->transport_lock);
>>  cwnd = xprt->cwnd;
>>  xprt->cwnd = credits << RPC_CWNDSHIFT;
>>  if (xprt->cwnd > cwnd)
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 8/8] svcrdma: Remove svc_rdma_fastreg_mr::access_flags field

2015-11-24 Thread Chuck Lever


> On Nov 24, 2015, at 11:03 AM, Christoph Hellwig  wrote:
> 
> On Tue, Nov 24, 2015 at 09:08:21AM -0500, Chuck Lever wrote:
>> Why don???t you fold my change into yours?
> 
> It's already included.  Well, sort of - I have removed used of the
> field, but forgot to remove the definition.  I will update it.

Excellent, that works for me.

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-24 Thread Chuck Lever

> On Nov 24, 2015, at 9:44 AM, Sagi Grimberg  wrote:
> 
> Hey Chuck,
> 
>> 
>>> It is painful, too painful. The entire value proposition of RDMA is
>>> low-latency and waiting for the extra HW round-trip for a local
>>> invalidation to complete is unacceptable, moreover it adds a huge loads
>>> of extra interrupts and cache-line pollutions.
>> 
>> The killer is the extra context switches, I’ve found.
> 
> That too...
> 
>> I’ve noticed only a marginal loss of performance on modern
>> hardware.
> 
> Would you mind sharing your observations?

I’m testing with CX-3 Pro on FDR.

NFS READ and WRITE round trip latency, which includes the cost
of registration and now invalidation, is not noticeably longer.
dbench and fio results are marginally slower (in the neighborhood
of 5%).

For NFS, the cost of invalidation is probably not significant
compared to other bottlenecks in our stack (lock contention and
scheduling overhead are likely the largest contributors).

Notice that xprtrdma chains together all the LOCAL_INV WRs for
an RPC, and only signals the final one. Before, every LOCAL_INV
WR was signaled. So this patch actually reduces the send
completion rate.

The main benefit for NFS of waiting for invalidation to complete
is better send queue accounting. Even without the data integrity
issue, we have to ensure the WQEs consumed by invalidation
requests are released before dispatching another RPC. Otherwise
the send queue can be overrun.

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-24 Thread Chuck Lever


> On Nov 24, 2015, at 5:59 AM, Sagi Grimberg  wrote:
> 
> 
> 
> On 24/11/2015 08:45, Christoph Hellwig wrote:
>> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
>>> In the current xprtrdma implementation, some memreg strategies
>>> implement ro_unmap synchronously (the MR is knocked down before the
>>> method returns) and some asynchonously (the MR will be knocked down
>>> and returned to the pool in the background).
>>> 
>>> To guarantee the MR is truly invalid before the RPC consumer is
>>> allowed to resume execution, we need an unmap method that is
>>> always synchronous, invoked from the RPC/RDMA reply handler.
>>> 
>>> The new method unmaps all MRs for an RPC. The existing ro_unmap
>>> method unmaps only one MR at a time.
>> 
>> Do we really want to go down that road?  It seems like we've decided
>> in general that while the protocol specs say MR must be unmapped before
>> proceeding with the data that is painful enough to ignore this
>> requirement.  E.g. iser for example only does the local invalidate
>> just before reusing the MR.

That leaves the MR exposed to the remote indefinitely. If
the MR is registered for remote write, that seems hazardous.


> It is painful, too painful. The entire value proposition of RDMA is
> low-latency and waiting for the extra HW round-trip for a local
> invalidation to complete is unacceptable, moreover it adds a huge loads
> of extra interrupts and cache-line pollutions.

The killer is the extra context switches, I’ve found.


> As I see it, if we don't wait for local-invalidate to complete before
> unmap and IO completion (and no one does) then local invalidate before
> re-use is only marginally worse. For iSER, remote invalidate solves this 
> (patches submitted!) and I'd say we should push for all the
> storage standards to include remote invalidate.

I agree: the right answer is to use remote invalidation,
and to ensure the order is always:

  1. invalidate the MR
  2. unmap the MR
  3. wake up the consumer

And that is exactly my strategy for NFS/RDMA. I don’t have
a choice: as Tom observed yesterday, krb5i is meaningless
unless the integrity of the data is guaranteed by fencing
the server before the client performs checksumming. I
expect the same is true for T10-PI.


> There is the question
> of multi-rkey transactions, which is why I stated in the past that
> arbitrary sg registration is important (which will be submitted soon
> for ConnectX-4).
> 
> Waiting for local invalidate to complete would be a really big
> sacrifice for our storage ULPs.

I’ve noticed only a marginal loss of performance on modern
hardware.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 3/8] svcrdma: Add svc_rdma_get_context() API that is allowed to fail

2015-11-24 Thread Chuck Lever


> On Nov 24, 2015, at 1:55 AM, Christoph Hellwig  wrote:
> 
>> +struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *xprt,
>> +  gfp_t flags)
>> +{
>> +struct svc_rdma_op_ctxt *ctxt;
>> +
>> +ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep, flags);
>> +if (!ctxt)
>> +return NULL;
>> +svc_rdma_init_context(xprt, ctxt);
>> +return ctxt;
>> +}
>> +
>> +struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
>> +{
>> +struct svc_rdma_op_ctxt *ctxt;
>> +
>> +ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
>> +GFP_KERNEL | __GFP_NOFAIL);
>> +svc_rdma_init_context(xprt, ctxt);
>>  return ctxt;
> 
> Sounds like you should have just added a gfp_t argument to
> svc_rdma_get_context.

There is only one (new) call site that needs it. I can simplify
this patch as Sagi suggested before, but it seems silly to
introduce the extra clutter of adding a gfp_t argument
everywhere.


> And if we have any way to avoid the __GFP_NOFAIL
> I'd really appreciate if we could give that a try.

I’m not introducing the flag here.

Changing all the svc_rdma_get_context() call sites to handle
allocation failure (when it is already highly unlikely) is
a lot of needless work, IMO, and not related to supporting
bi-directional RPC.

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 8/8] svcrdma: Remove svc_rdma_fastreg_mr::access_flags field

2015-11-24 Thread Chuck Lever


> On Nov 24, 2015, at 1:39 AM, Christoph Hellwig  wrote:
> 
> On Mon, Nov 23, 2015 at 07:53:04PM -0500, Chuck Lever wrote:
>>> Wait, the REMOTE_WRITE is there to support iWARP, but it isn't
>>> needed for IB or RoCE. Shouldn't this be updated to peek at those
>>> new attributes to decide, instead of remaining unconditional?
>> 
>> That???s coming in another patch from Christoph.
> 
> Can you drop this patch so that we have less conflicts with that one,
> assuming this series goes in through the NFS tree, and the memory
> registration changes go in through the RDMA tree?

Why don’t you fold my change into yours?


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 5/8] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies

2015-11-23 Thread Chuck Lever


> On Nov 23, 2015, at 7:44 PM, Tom Talpey  wrote:
> 
> On 11/23/2015 5:20 PM, Chuck Lever wrote:
>> To support the NFSv4.1 backchannel on RDMA connections, add a
>> capability for receiving an RPC/RDMA reply on a connection
>> established by a client.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>>  net/sunrpc/xprtrdma/rpc_rdma.c  |   76 
>> +++
>>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   60 
>>  net/sunrpc/xprtrdma/xprt_rdma.h |4 ++
>>  3 files changed, 140 insertions(+)
>> 
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index c10d969..fef0623 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -946,3 +946,79 @@ repost:
>>  if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
>>  rpcrdma_recv_buffer_put(rep);
>>  }
>> +
>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>> +
>> +int
>> +rpcrdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
>> +struct xdr_buf *rcvbuf)
>> +{
>> +struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
>> +struct kvec *dst, *src = &rcvbuf->head[0];
>> +struct rpc_rqst *req;
>> +unsigned long cwnd;
>> +u32 credits;
>> +size_t len;
>> +__be32 xid;
>> +__be32 *p;
>> +int ret;
>> +
>> +p = (__be32 *)src->iov_base;
>> +len = src->iov_len;
>> +xid = rmsgp->rm_xid;
>> +
>> +pr_info("%s: xid=%08x, length=%zu\n",
>> +__func__, be32_to_cpu(xid), len);
>> +pr_info("%s: RPC/RDMA: %*ph\n",
>> +__func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
>> +pr_info("%s:  RPC: %*ph\n",
>> +__func__, (int)len, p);
>> +
>> +ret = -EAGAIN;
>> +if (src->iov_len < 24)
>> +goto out_shortreply;
>> +
>> +spin_lock_bh(&xprt->transport_lock);
>> +req = xprt_lookup_rqst(xprt, xid);
>> +if (!req)
>> +goto out_notfound;
>> +
>> +dst = &req->rq_private_buf.head[0];
>> +memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
>> +if (dst->iov_len < len)
>> +goto out_unlock;
>> +memcpy(dst->iov_base, p, len);
>> +
>> +credits = be32_to_cpu(rmsgp->rm_credit);
>> +if (credits == 0)
>> +credits = 1;/* don't deadlock */
>> +else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
>> +credits = r_xprt->rx_buf.rb_bc_max_requests;
>> +
>> +cwnd = xprt->cwnd;
>> +xprt->cwnd = credits << RPC_CWNDSHIFT;
>> +if (xprt->cwnd > cwnd)
>> +xprt_release_rqst_cong(req->rq_task);
>> +
>> +ret = 0;
>> +xprt_complete_rqst(req->rq_task, rcvbuf->len);
>> +rcvbuf->len = 0;
>> +
>> +out_unlock:
>> +spin_unlock_bh(&xprt->transport_lock);
>> +out:
>> +return ret;
>> +
>> +out_shortreply:
>> +pr_info("svcrdma: short bc reply: xprt=%p, len=%zu\n",
>> +xprt, src->iov_len);
>> +goto out;
>> +
>> +out_notfound:
>> +pr_info("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n",
>> +xprt, be32_to_cpu(xid));
>> +
>> +goto out_unlock;
>> +}
>> +
>> +#endif  /* CONFIG_SUNRPC_BACKCHANNEL */
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
>> b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> index ff4f01e..2b762b5 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> @@ -47,6 +47,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include "xprt_rdma.h"
>> 
>>  #define RPCDBG_FACILITY RPCDBG_SVCXPRT
>> 
>> @@ -567,6 +568,42 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
>>  return ret;
>>  }
>> 
>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>> +
>> +/* By convention, backchannel calls arrive via rdma_msg type
>> + * messages, and never populate the chunk lists. This makes
>> + * the RPC/RDMA header small and fixed in size, so it is
>> + * straightforward to check the RPC header's direction field.
>> + */
>> +static bool
>> +svc_rdma_is_backchannel_reply(stru

Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers

2015-11-23 Thread Chuck Lever


> On Nov 23, 2015, at 8:22 PM, Tom Talpey  wrote:
> 
> On 11/23/2015 8:16 PM, Chuck Lever wrote:
>> 
>>> On Nov 23, 2015, at 7:55 PM, Tom Talpey  wrote:
>>> 
>>> On 11/23/2015 5:13 PM, Chuck Lever wrote:
>>>> Rarely, senders post a Send that is larger than the client's inline
>>>> threshold. That can be due to a bug, or the client and server may
>>>> not have communicated about their inline limits. RPC-over-RDMA
>>>> currently doesn't specify any particular limit on inline size, so
>>>> peers have to guess what it is.
>>>> 
>>>> It is fatal to the connection if the size of a Send is larger than
>>>> the client's receive buffer. The sender is likely to retry with the
>>>> same message size, so the workload is stuck at that point.
>>>> 
>>>> Follow Postel's robustness principal: Be conservative in what you
>>>> do, be liberal in what you accept from others. Increase the size of
>>>> client receive buffers by a safety margin, and add a warning when
>>>> the inline threshold is exceeded during receive.
>>> 
>>> Safety is good, but how do know the chosen value is enough?
>>> Isn't it better to fail the badly-composed request and be done
>>> with it? Even if the stupid sender loops, which it will do
>>> anyway.
>> 
>> It’s good enough to compensate for the most common sender bug,
>> which is that the sender did not account for the 28 bytes of
>> the RPC-over-RDMA header when it built the send buffer. The
>> additional 100 byte margin is gravy.
> 
> I think it's good to have sympathy and resilience to differing
> designs on the other end of the wire, but I fail to have it for
> stupid bugs. Unless this can take down the receiver, fail it fast.
> 
> MHO.

See above: the client can’t tell why it’s failed.

Again, the Send on the server side fails with LOCAL_LEN_ERR
and the connection is terminated. The client sees only the
connection loss. There’s no distinction between this and a
cable pull or a server crash, where you do want the client
to retransmit.

I agree it’s a dumb server bug. But the idea is to preserve
the connection, since NFS retransmits are a hazard.

Just floating this idea here, this is v1. This one can be
dropped.


>> The loop occurs because the server gets a completion error.
>> The client just sees a connection loss. There’s no way for it
>> to know it should fail the RPC, so it keeps trying.
>> 
>> Perhaps the server could remember that the reply failed, and
>> when the client retransmits, it can simply return that XID
>> with an RDMA_ERROR.
>> 
>> 
>>>> Note the Linux server's receive buffers are already page-sized.
>>>> 
>>>> Signed-off-by: Chuck Lever 
>>>> ---
>>>>  net/sunrpc/xprtrdma/rpc_rdma.c  |7 +++
>>>>  net/sunrpc/xprtrdma/verbs.c |6 +-
>>>>  net/sunrpc/xprtrdma/xprt_rdma.h |5 +
>>>>  3 files changed, 17 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c 
>>>> b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> index c10d969..a169252 100644
>>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>>int rdmalen, status;
>>>>unsigned long cwnd;
>>>>u32 credits;
>>>> +  RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>>>> 
>>>>dprintk("RPC:   %s: incoming rep %p\n", __func__, rep);
>>>> 
>>>> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>>>goto out_badstatus;
>>>>if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>>>>goto out_shortreply;
>>>> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>>>> +  cdata = &r_xprt->rx_data;
>>>> +  if (rep->rr_len > cdata->inline_rsize)
>>>> +  pr_warn("RPC: %u byte reply exceeds inline threshold\n",
>>>> +  rep->rr_len);
>>>> +#endif
>>>> 
>>>>headerp = rdmab_to_msg(rep->rr_rdmabuf);
>>>>if (headerp->rm_vers != rpcrdma_version)
>>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>>> index eadd1655..e3f12e2 100644
>>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>&

Re: [PATCH v1 2/8] svcrdma: Define maximum number of backchannel requests

2015-11-23 Thread Chuck Lever


> On Nov 23, 2015, at 8:19 PM, Tom Talpey  wrote:
> 
> On 11/23/2015 8:09 PM, Chuck Lever wrote:
>> 
>>> On Nov 23, 2015, at 7:39 PM, Tom Talpey  wrote:
>>> 
>>> On 11/23/2015 5:20 PM, Chuck Lever wrote:
>>>> Extra resources for handling backchannel requests have to be
>>>> pre-allocated when a transport instance is created. Set a limit.
>>>> 
>>>> Signed-off-by: Chuck Lever 
>>>> ---
>>>>  include/linux/sunrpc/svc_rdma.h  |5 +
>>>>  net/sunrpc/xprtrdma/svc_rdma_transport.c |6 +-
>>>>  2 files changed, 10 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/include/linux/sunrpc/svc_rdma.h 
>>>> b/include/linux/sunrpc/svc_rdma.h
>>>> index f869807..478aa30 100644
>>>> --- a/include/linux/sunrpc/svc_rdma.h
>>>> +++ b/include/linux/sunrpc/svc_rdma.h
>>>> @@ -178,6 +178,11 @@ struct svcxprt_rdma {
>>>>  #define RPCRDMA_SQ_DEPTH_MULT   8
>>>>  #define RPCRDMA_MAX_REQUESTS32
>>>>  #define RPCRDMA_MAX_REQ_SIZE4096
>>>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>>> 
>>> Why is this a config option? Why wouldn't you always want
>>> this? It's needed for any post-1990 NFS dialect.
>> 
>> I think some distros want to be able to compile out NFSv4.x
>> on small systems, and take all the backchannel cruft with it.
> 
> So shouldn't it follow the NFSv4.x config options then?

Setting CONFIG_NFS_V4_1 sets CONFIG_SUNRPC_BACKCHANNEL.
Adding #ifdef CONFIG_NFS_V4_1 in net/sunrpc would be
a layering violation.

I see however that CONFIG_SUNRPC_BACKCHANNEL controls
only the client backchannel capability. Perhaps it is
out of place to use it to enable the server’s backchannel
capability.


>>>> +#define RPCRDMA_MAX_BC_REQUESTS   8
>>> 
>>> Why a constant 8? The forward channel value is apparently
>>> configurable, just a few lines down.
>> 
>> The client side backward direction credit limit, now
>> in 4.4, is already a constant.
>> 
>> The client side ULP uses a constant for the slot table
>> size: NFS4_MAX_BACK_CHANNEL_OPS. I’m not 100% sure but
>> the server seems to just echo that number back to the
>> client.
>> 
>> I’d rather not add an admin knob for this. Why would it
>> be necessary?
> 
> Because no constant is ever correct. Why isn't it "1"? Do
> you allow multiple credits? Why not that value?
> 
> For instance.

There’s no justification for the forward channel credit
limit either.

The code in Linux assumes one session slot in the NFSv4.1
backchannel. When we get around to it, this can be made
more flexible.

It’s much easier to add flexibility and admin control later
than it is to take it away when the knob becomes useless or
badly designed. For now, 8 works, and it doesn’t have to be
permanent.

I could add a comment that says

/* Arbitrary: support up to eight backward credits.
 */


>>>> +#else
>>>> +#define RPCRDMA_MAX_BC_REQUESTS   0
>>>> +#endif
>>>> 
>>>>  #define RPCSVC_MAXPAYLOAD_RDMARPCSVC_MAXPAYLOAD
>>>> 
>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
>>>> b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>> index b348b4a..01c7b36 100644
>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>> @@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct 
>>>> svc_xprt *xprt)
>>>>  (size_t)RPCSVC_MAXPAGES);
>>>>newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
>>>>   RPCSVC_MAXPAGES);
>>>> +  /* XXX: what if HCA can't support enough WRs for bc operation? */
>>>>newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
>>>> - (size_t)svcrdma_max_requests);
>>>> + (size_t)(svcrdma_max_requests +
>>>> + RPCRDMA_MAX_BC_REQUESTS));
>>>>newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;
>>>> 
>>>>/*
>>>> @@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct 
>>>> svc_xprt *xprt)
>>>>qp_attr.event_handler = qp_event_handler;
>>>>qp_attr.qp_context = &newxprt->sc_xprt;
>>>>    qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
>>>> +  qp_attr.cap.max_send_

Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers

2015-11-23 Thread Chuck Lever


> On Nov 23, 2015, at 7:55 PM, Tom Talpey  wrote:
> 
> On 11/23/2015 5:13 PM, Chuck Lever wrote:
>> Rarely, senders post a Send that is larger than the client's inline
>> threshold. That can be due to a bug, or the client and server may
>> not have communicated about their inline limits. RPC-over-RDMA
>> currently doesn't specify any particular limit on inline size, so
>> peers have to guess what it is.
>> 
>> It is fatal to the connection if the size of a Send is larger than
>> the client's receive buffer. The sender is likely to retry with the
>> same message size, so the workload is stuck at that point.
>> 
>> Follow Postel's robustness principal: Be conservative in what you
>> do, be liberal in what you accept from others. Increase the size of
>> client receive buffers by a safety margin, and add a warning when
>> the inline threshold is exceeded during receive.
> 
> Safety is good, but how do know the chosen value is enough?
> Isn't it better to fail the badly-composed request and be done
> with it? Even if the stupid sender loops, which it will do
> anyway.

It’s good enough to compensate for the most common sender bug,
which is that the sender did not account for the 28 bytes of
the RPC-over-RDMA header when it built the send buffer. The
additional 100 byte margin is gravy.

The loop occurs because the server gets a completion error.
The client just sees a connection loss. There’s no way for it
to know it should fail the RPC, so it keeps trying.

Perhaps the server could remember that the reply failed, and
when the client retransmits, it can simply return that XID
with an RDMA_ERROR.


>> Note the Linux server's receive buffers are already page-sized.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>>  net/sunrpc/xprtrdma/rpc_rdma.c  |7 +++
>>  net/sunrpc/xprtrdma/verbs.c |6 +-
>>  net/sunrpc/xprtrdma/xprt_rdma.h |5 +
>>  3 files changed, 17 insertions(+), 1 deletion(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index c10d969..a169252 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>  int rdmalen, status;
>>  unsigned long cwnd;
>>  u32 credits;
>> +RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);
>> 
>>  dprintk("RPC:   %s: incoming rep %p\n", __func__, rep);
>> 
>> @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>  goto out_badstatus;
>>  if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
>>  goto out_shortreply;
>> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
>> +cdata = &r_xprt->rx_data;
>> +if (rep->rr_len > cdata->inline_rsize)
>> +pr_warn("RPC: %u byte reply exceeds inline threshold\n",
>> +rep->rr_len);
>> +#endif
>> 
>>  headerp = rdmab_to_msg(rep->rr_rdmabuf);
>>  if (headerp->rm_vers != rpcrdma_version)
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index eadd1655..e3f12e2 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
>>  if (rep == NULL)
>>  goto out;
>> 
>> -rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
>> +/* The actual size of our receive buffers is increased slightly
>> + * to prevent small receive overruns from killing our connection.
>> + */
>> +rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
>> +   RPCRDMA_RECV_MARGIN,
>> GFP_KERNEL);
>>  if (IS_ERR(rep->rr_rdmabuf)) {
>>  rc = PTR_ERR(rep->rr_rdmabuf);
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h 
>> b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index ac7f8d4..1b72ab1 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
>>  #define RPCRDMA_INLINE_PAD_VALUE(rq)\
>>  rpcx_to_rdmad(rq->rq_xprt).padding
>> 
>> +/* To help prevent spurious connection shutdown, allow senders to
>> + * overrun our receive inline threshold by a small bit.
>> + */
>> +#define RPCRDMA_RECV_MARGIN (128)
>> +
>>  /*
>>   * Statistics for RPCRDMA
>>   */
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 2/8] svcrdma: Define maximum number of backchannel requests

2015-11-23 Thread Chuck Lever


> On Nov 23, 2015, at 7:39 PM, Tom Talpey  wrote:
> 
> On 11/23/2015 5:20 PM, Chuck Lever wrote:
>> Extra resources for handling backchannel requests have to be
>> pre-allocated when a transport instance is created. Set a limit.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>>  include/linux/sunrpc/svc_rdma.h  |5 +
>>  net/sunrpc/xprtrdma/svc_rdma_transport.c |6 +-
>>  2 files changed, 10 insertions(+), 1 deletion(-)
>> 
>> diff --git a/include/linux/sunrpc/svc_rdma.h 
>> b/include/linux/sunrpc/svc_rdma.h
>> index f869807..478aa30 100644
>> --- a/include/linux/sunrpc/svc_rdma.h
>> +++ b/include/linux/sunrpc/svc_rdma.h
>> @@ -178,6 +178,11 @@ struct svcxprt_rdma {
>>  #define RPCRDMA_SQ_DEPTH_MULT   8
>>  #define RPCRDMA_MAX_REQUESTS32
>>  #define RPCRDMA_MAX_REQ_SIZE4096
>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
> 
> Why is this a config option? Why wouldn't you always want
> this? It's needed for any post-1990 NFS dialect.

I think some distros want to be able to compile out NFSv4.x
on small systems, and take all the backchannel cruft with it.


>> +#define RPCRDMA_MAX_BC_REQUESTS 8
> 
> Why a constant 8? The forward channel value is apparently
> configurable, just a few lines down.

The client side backward direction credit limit, now
in 4.4, is already a constant.

The client side ULP uses a constant for the slot table
size: NFS4_MAX_BACK_CHANNEL_OPS. I’m not 100% sure but
the server seems to just echo that number back to the
client.

I’d rather not add an admin knob for this. Why would it
be necessary?


>> +#else
>> +#define RPCRDMA_MAX_BC_REQUESTS 0
>> +#endif
>> 
>>  #define RPCSVC_MAXPAYLOAD_RDMA  RPCSVC_MAXPAYLOAD
>> 
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
>> b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index b348b4a..01c7b36 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
>> *xprt)
>>(size_t)RPCSVC_MAXPAGES);
>>  newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
>> RPCSVC_MAXPAGES);
>> +/* XXX: what if HCA can't support enough WRs for bc operation? */
>>  newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
>> -   (size_t)svcrdma_max_requests);
>> +   (size_t)(svcrdma_max_requests +
>> +   RPCRDMA_MAX_BC_REQUESTS));
>>  newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;
>> 
>>  /*
>> @@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
>> *xprt)
>>  qp_attr.event_handler = qp_event_handler;
>>  qp_attr.qp_context = &newxprt->sc_xprt;
>>  qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
>> +qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS;
>>  qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
>> +qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS;
>>  qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
>>  qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
>>  qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 8/8] svcrdma: Remove svc_rdma_fastreg_mr::access_flags field

2015-11-23 Thread Chuck Lever


> On Nov 23, 2015, at 7:52 PM, Tom Talpey  wrote:
> 
> On 11/23/2015 5:21 PM, Chuck Lever wrote:
>> Clean up: The access_flags field is not used outside of
>> rdma_read_chunk_frmr() and is always set to the same value.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>>  include/linux/sunrpc/svc_rdma.h |1 -
>>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |3 +--
>>  2 files changed, 1 insertion(+), 3 deletions(-)
>> 
>> diff --git a/include/linux/sunrpc/svc_rdma.h 
>> b/include/linux/sunrpc/svc_rdma.h
>> index 243edf4..eee2a0d 100644
>> --- a/include/linux/sunrpc/svc_rdma.h
>> +++ b/include/linux/sunrpc/svc_rdma.h
>> @@ -107,7 +107,6 @@ struct svc_rdma_fastreg_mr {
>>  struct ib_mr *mr;
>>  struct scatterlist *sg;
>>  int sg_nents;
>> -unsigned long access_flags;
>>  enum dma_data_direction direction;
>>  struct list_head frmr_list;
>>  };
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
>> b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> index 9480043..8ab1ab5 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
>> @@ -240,7 +240,6 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
>>  read = min_t(int, (nents << PAGE_SHIFT) - *page_offset, rs_length);
>> 
>>  frmr->direction = DMA_FROM_DEVICE;
>> -frmr->access_flags = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
>>  frmr->sg_nents = nents;
>> 
>>  for (pno = 0; pno < nents; pno++) {
>> @@ -308,7 +307,7 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
>>  reg_wr.wr.num_sge = 0;
>>  reg_wr.mr = frmr->mr;
>>  reg_wr.key = frmr->mr->lkey;
>> -reg_wr.access = frmr->access_flags;
>> +reg_wr.access = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
> 
> Wait, the REMOTE_WRITE is there to support iWARP, but it isn't
> needed for IB or RoCE. Shouldn't this be updated to peek at those
> new attributes to decide, instead of remaining unconditional?

That’s coming in another patch from Christoph.


> 
> 
>>  reg_wr.wr.next = &read_wr.wr;
>> 
>>  /* Prepare RDMA_READ */
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v1 7/8] svcrdma: No need to count WRs in svc_rdma_send()

2015-11-23 Thread Chuck Lever

Minor optimization: Instead of counting WRs in a chain, have callers
pass in the number of WRs they've prepared.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |2 +-
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |9 ++---
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|6 +++---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   17 ++---
 4 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 28d4e46..243edf4 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -218,7 +218,7 @@ extern int svc_rdma_bc_post_send(struct svcxprt_rdma *,
 struct svc_rdma_op_ctxt *, struct xdr_buf *);
 
 /* svc_rdma_transport.c */
-extern int svc_rdma_send(struct svcxprt_rdma *, struct ib_send_wr *);
+extern int svc_rdma_send(struct svcxprt_rdma *, struct ib_send_wr *, int);
 extern void svc_rdma_send_error(struct svcxprt_rdma *, struct rpcrdma_msg *,
enum rpcrdma_errcode);
 extern int svc_rdma_post_recv(struct svcxprt_rdma *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 2b762b5..9480043 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -190,7 +190,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
read_wr.wr.sg_list = ctxt->sge;
read_wr.wr.num_sge = pages_needed;
 
-   ret = svc_rdma_send(xprt, &read_wr.wr);
+   ret = svc_rdma_send(xprt, &read_wr.wr, 1);
if (ret) {
pr_err("svcrdma: Error %d posting RDMA_READ\n", ret);
set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
@@ -227,7 +227,7 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
int nents = PAGE_ALIGN(*page_offset + rs_length) >> PAGE_SHIFT;
struct svc_rdma_op_ctxt *ctxt = svc_rdma_get_context(xprt);
struct svc_rdma_fastreg_mr *frmr = svc_rdma_get_frmr(xprt);
-   int ret, read, pno, dma_nents, n;
+   int ret, read, pno, num_wrs, dma_nents, n;
u32 pg_off = *page_offset;
u32 pg_no = *page_no;
 
@@ -299,6 +299,8 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
ctxt->count = 1;
ctxt->read_hdr = head;
 
+   num_wrs = 2;
+
/* Prepare REG WR */
reg_wr.wr.opcode = IB_WR_REG_MR;
reg_wr.wr.wr_id = 0;
@@ -329,11 +331,12 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
inv_wr.opcode = IB_WR_LOCAL_INV;
inv_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_FENCE;
inv_wr.ex.invalidate_rkey = frmr->mr->lkey;
+   num_wrs++;
}
ctxt->wr_op = read_wr.wr.opcode;
 
/* Post the chain */
-   ret = svc_rdma_send(xprt, ®_wr.wr);
+   ret = svc_rdma_send(xprt, ®_wr.wr, num_wrs);
if (ret) {
pr_err("svcrdma: Error %d posting RDMA_READ\n", ret);
set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 4fe11ea..97f18b5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -292,7 +292,7 @@ static int send_write(struct svcxprt_rdma *xprt, struct 
svc_rqst *rqstp,
 
/* Post It */
atomic_inc(&rdma_stat_write);
-   if (svc_rdma_send(xprt, &write_wr.wr))
+   if (svc_rdma_send(xprt, &write_wr.wr, 1))
goto err;
return write_len - bc;
  err:
@@ -557,7 +557,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
send_wr.opcode = IB_WR_SEND;
send_wr.send_flags =  IB_SEND_SIGNALED;
 
-   ret = svc_rdma_send(rdma, &send_wr);
+   ret = svc_rdma_send(rdma, &send_wr, 1);
if (ret)
goto err;
 
@@ -699,7 +699,7 @@ int svc_rdma_bc_post_send(struct svcxprt_rdma *rdma,
send_wr.opcode = IB_WR_SEND;
send_wr.send_flags = IB_SEND_SIGNALED;
 
-   ret = svc_rdma_send(rdma, &send_wr);
+   ret = svc_rdma_send(rdma, &send_wr, 1);
if (ret) {
svc_rdma_unmap_dma(ctxt);
ret = -EIO;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 3768a7f..40c9c84 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1284,20 +1284,15 @@ static int svc_rdma_secure_port(struct svc_rqst *rqstp)
return 1;
 }
 
-int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr)
+int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr,
+ int wr_count)
 {
-   struct ib_send_wr *bad_wr, *n_wr;
-   int wr_count;
-   int i;
-   int ret;
+   struct ib_send_wr *bad_wr;
+   int i, ret;
 
if (test

[PATCH v1 8/8] svcrdma: Remove svc_rdma_fastreg_mr::access_flags field

2015-11-23 Thread Chuck Lever

Clean up: The access_flags field is not used outside of
rdma_read_chunk_frmr() and is always set to the same value.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h |1 -
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |3 +--
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 243edf4..eee2a0d 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -107,7 +107,6 @@ struct svc_rdma_fastreg_mr {
struct ib_mr *mr;
struct scatterlist *sg;
int sg_nents;
-   unsigned long access_flags;
enum dma_data_direction direction;
struct list_head frmr_list;
 };
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 9480043..8ab1ab5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -240,7 +240,6 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
read = min_t(int, (nents << PAGE_SHIFT) - *page_offset, rs_length);
 
frmr->direction = DMA_FROM_DEVICE;
-   frmr->access_flags = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
frmr->sg_nents = nents;
 
for (pno = 0; pno < nents; pno++) {
@@ -308,7 +307,7 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
reg_wr.wr.num_sge = 0;
reg_wr.mr = frmr->mr;
reg_wr.key = frmr->mr->lkey;
-   reg_wr.access = frmr->access_flags;
+   reg_wr.access = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
reg_wr.wr.next = &read_wr.wr;
 
/* Prepare RDMA_READ */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v1 5/8] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies

2015-11-23 Thread Chuck Lever

To support the NFSv4.1 backchannel on RDMA connections, add a
capability for receiving an RPC/RDMA reply on a connection
established by a client.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   76 +++
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   60 
 net/sunrpc/xprtrdma/xprt_rdma.h |4 ++
 3 files changed, 140 insertions(+)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..fef0623 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -946,3 +946,79 @@ repost:
if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
rpcrdma_recv_buffer_put(rep);
 }
+
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+int
+rpcrdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
+   struct xdr_buf *rcvbuf)
+{
+   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct kvec *dst, *src = &rcvbuf->head[0];
+   struct rpc_rqst *req;
+   unsigned long cwnd;
+   u32 credits;
+   size_t len;
+   __be32 xid;
+   __be32 *p;
+   int ret;
+
+   p = (__be32 *)src->iov_base;
+   len = src->iov_len;
+   xid = rmsgp->rm_xid;
+
+   pr_info("%s: xid=%08x, length=%zu\n",
+   __func__, be32_to_cpu(xid), len);
+   pr_info("%s: RPC/RDMA: %*ph\n",
+   __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
+   pr_info("%s:  RPC: %*ph\n",
+   __func__, (int)len, p);
+
+   ret = -EAGAIN;
+   if (src->iov_len < 24)
+   goto out_shortreply;
+
+   spin_lock_bh(&xprt->transport_lock);
+   req = xprt_lookup_rqst(xprt, xid);
+   if (!req)
+   goto out_notfound;
+
+   dst = &req->rq_private_buf.head[0];
+   memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
+   if (dst->iov_len < len)
+   goto out_unlock;
+   memcpy(dst->iov_base, p, len);
+
+   credits = be32_to_cpu(rmsgp->rm_credit);
+   if (credits == 0)
+   credits = 1;/* don't deadlock */
+   else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
+   credits = r_xprt->rx_buf.rb_bc_max_requests;
+
+   cwnd = xprt->cwnd;
+   xprt->cwnd = credits << RPC_CWNDSHIFT;
+   if (xprt->cwnd > cwnd)
+   xprt_release_rqst_cong(req->rq_task);
+
+   ret = 0;
+   xprt_complete_rqst(req->rq_task, rcvbuf->len);
+   rcvbuf->len = 0;
+
+out_unlock:
+   spin_unlock_bh(&xprt->transport_lock);
+out:
+   return ret;
+
+out_shortreply:
+   pr_info("svcrdma: short bc reply: xprt=%p, len=%zu\n",
+   xprt, src->iov_len);
+   goto out;
+
+out_notfound:
+   pr_info("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n",
+   xprt, be32_to_cpu(xid));
+
+   goto out_unlock;
+}
+
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index ff4f01e..2b762b5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include "xprt_rdma.h"
 
 #define RPCDBG_FACILITYRPCDBG_SVCXPRT
 
@@ -567,6 +568,42 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
return ret;
 }
 
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+/* By convention, backchannel calls arrive via rdma_msg type
+ * messages, and never populate the chunk lists. This makes
+ * the RPC/RDMA header small and fixed in size, so it is
+ * straightforward to check the RPC header's direction field.
+ */
+static bool
+svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
+{
+   __be32 *p = (__be32 *)rmsgp;
+
+   if (!xprt->xpt_bc_xprt)
+   return false;
+
+   if (rmsgp->rm_type != rdma_msg)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
+   return false;
+
+   /* sanity */
+   if (p[7] != rmsgp->rm_xid)
+   return false;
+   /* call direction */
+   if (p[8] == cpu_to_be32(RPC_CALL))
+   return false;
+
+   return true;
+}
+
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */
+
 /*
  * Set up the rqstp thread context to point to the RQ buffer. If
  * necessary, pull additional data from the client with an RDMA_READ
@@ -632,6 +669,17 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
goto close_out;
}
 
+#if defined(CONF

[PATCH v1 6/8] xprtrdma: Add class for RDMA backwards direction transport

2015-11-23 Thread Chuck Lever

To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class for backwards direction
operation.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/xprt.h  |1 
 net/sunrpc/xprt.c|1 
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   14 +-
 net/sunrpc/xprtrdma/transport.c  |  243 ++
 net/sunrpc/xprtrdma/xprt_rdma.h  |2 
 5 files changed, 256 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 69ef5b3..7637ccd 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -85,6 +85,7 @@ struct rpc_rqst {
__u32 * rq_buffer;  /* XDR encode buffer */
size_t  rq_callsize,
rq_rcvsize;
+   void *  rq_privdata; /* xprt-specific per-rqst data */
size_t  rq_xmit_bytes_sent; /* total bytes sent */
size_t  rq_reply_bytes_recvd;   /* total reply bytes */
/* received */
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2e98f4a..37edea6 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1425,3 +1425,4 @@ void xprt_put(struct rpc_xprt *xprt)
if (atomic_dec_and_test(&xprt->count))
xprt_destroy(xprt);
 }
+EXPORT_SYMBOL_GPL(xprt_put);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 58ec362..3768a7f 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1182,12 +1182,14 @@ static void __svc_rdma_free(struct work_struct *work)
 {
struct svcxprt_rdma *rdma =
container_of(work, struct svcxprt_rdma, sc_work);
-   dprintk("svcrdma: svc_rdma_free(%p)\n", rdma);
+   struct svc_xprt *xprt = &rdma->sc_xprt;
+
+   dprintk("svcrdma: %s(%p)\n", __func__, rdma);
 
/* We should only be called from kref_put */
-   if (atomic_read(&rdma->sc_xprt.xpt_ref.refcount) != 0)
+   if (atomic_read(&xprt->xpt_ref.refcount) != 0)
pr_err("svcrdma: sc_xprt still in use? (%d)\n",
-  atomic_read(&rdma->sc_xprt.xpt_ref.refcount));
+  atomic_read(&xprt->xpt_ref.refcount));
 
/*
 * Destroy queued, but not processed read completions. Note
@@ -1222,6 +1224,12 @@ static void __svc_rdma_free(struct work_struct *work)
pr_err("svcrdma: dma still in use? (%d)\n",
   atomic_read(&rdma->sc_dma_used));
 
+   /* Final put of backchannel client transport */
+   if (xprt->xpt_bc_xprt) {
+   xprt_put(xprt->xpt_bc_xprt);
+   xprt->xpt_bc_xprt = NULL;
+   }
+
/* De-allocate fastreg mr */
rdma_dealloc_frmr_q(rdma);
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8c545f7..fda7488 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "xprt_rdma.h"
 
@@ -148,7 +149,10 @@ static struct ctl_table sunrpc_table[] = {
 #define RPCRDMA_MAX_REEST_TO   (30U * HZ)
 #define RPCRDMA_IDLE_DISC_TO   (5U * 60 * HZ)
 
-static struct rpc_xprt_ops xprt_rdma_procs;/* forward reference */
+static struct rpc_xprt_ops xprt_rdma_procs;
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+static struct rpc_xprt_ops xprt_rdma_bc_procs;
+#endif
 
 static void
 xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr *sap)
@@ -499,7 +503,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
if (req == NULL)
return NULL;
 
-   flags = GFP_NOIO | __GFP_NOWARN;
+   flags = RPCRDMA_DEF_GFP;
if (RPC_IS_SWAPPER(task))
flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
 
@@ -684,6 +688,199 @@ xprt_rdma_disable_swap(struct rpc_xprt *xprt)
 {
 }
 
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+/* Server-side transport endpoint wants a whole page for its send
+ * buffer. The client RPC code constructs the RPC header in this
+ * buffer before it invokes ->send_request.
+ */
+static void *
+xprt_rdma_bc_allocate(struct rpc_task *task, size_t size)
+{
+   struct rpc_rqst *rqst = task->tk_rqstp;
+   struct svc_rdma_op_ctxt *ctxt;
+   struct svcxprt_rdma *rdma;
+   struct svc_xprt *sxprt;
+   struct page *page;
+
+   if (size > PAGE_SIZE) {
+   WARN_ONCE(1, "failed to handle buffer allocation (size %zu)\n",
+ size);
+   return NULL;
+   }
+
+   page = alloc_page(RPCRDMA_DEF_GFP);
+   if (!page)
+   return NULL;
+
+   sxprt = rqst-&g

[PATCH v1 3/8] svcrdma: Add svc_rdma_get_context() API that is allowed to fail

2015-11-23 Thread Chuck Lever

To support backward direction calls, I'm going to add an
svc_rdma_get_context() call in the client RDMA transport.

Called from ->buf_alloc(), we can't sleep waiting for memory.
So add an API that can get a server op_ctxt but won't sleep.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |2 ++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   28 +++-
 2 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 478aa30..0355067 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -222,6 +222,8 @@ extern void svc_rdma_send_error(struct svcxprt_rdma *, 
struct rpcrdma_msg *,
 extern int svc_rdma_post_recv(struct svcxprt_rdma *);
 extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *);
 extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
+extern struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *,
+gfp_t);
 extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
 extern void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt);
 extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 01c7b36..58ec362 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -153,17 +153,35 @@ static void svc_rdma_bc_free(struct svc_xprt *xprt)
 }
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
-struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
+static void svc_rdma_init_context(struct svcxprt_rdma *xprt,
+ struct svc_rdma_op_ctxt *ctxt)
 {
-   struct svc_rdma_op_ctxt *ctxt;
-
-   ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
-   GFP_KERNEL | __GFP_NOFAIL);
ctxt->xprt = xprt;
INIT_LIST_HEAD(&ctxt->dto_q);
ctxt->count = 0;
ctxt->frmr = NULL;
atomic_inc(&xprt->sc_ctxt_used);
+}
+
+struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *xprt,
+ gfp_t flags)
+{
+   struct svc_rdma_op_ctxt *ctxt;
+
+   ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep, flags);
+   if (!ctxt)
+   return NULL;
+   svc_rdma_init_context(xprt, ctxt);
+   return ctxt;
+}
+
+struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
+{
+   struct svc_rdma_op_ctxt *ctxt;
+
+   ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
+   GFP_KERNEL | __GFP_NOFAIL);
+   svc_rdma_init_context(xprt, ctxt);
return ctxt;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 8 9 >

1 - 100 of 850 matches

Mail list logo