Re: [PATCH v1 03/18] xprtrdma: Remove completion polling budgets

2015-09-17 Thread Devesh Sharma
On Fri, Sep 18, 2015 at 2:14 AM, Chuck Lever  wrote:
>
> Commit 8301a2c047cc ("xprtrdma: Limit work done by completion
> handler") was supposed to prevent xprtrdma's upcall handlers from
> starving other softIRQ work by letting them return to the provider
> before all CQEs have been polled.
>
> The logic assumes the provider will call the upcall handler again
> immediately if the CQ is re-armed while there are still queued CQEs.
>
> This assumption is invalid. The IBTA spec says that after a CQ is
> armed, the hardware must interrupt only when a new CQE is inserted.
> xprtrdma can't rely on the provider calling again, even though some
> providers do.
>
> Therefore, leaving CQEs on queue makes sense only when there is
> another mechanism that ensures all remaining CQEs are consumed in a
> timely fashion. xprtrdma does not have such a mechanism. If a CQE
> remains queued, the transport can wait forever to send the next RPC.
>
> Finally, move the wcs array back onto the stack to ensure that the
> poll array is always local to the CPU where the completion upcall is
> running.
>
> Fixes: 8301a2c047cc ("xprtrdma: Limit work done by completion ...")
> Signed-off-by: Chuck Lever 
> ---
>  net/sunrpc/xprtrdma/verbs.c |  100 
> ++-
>  net/sunrpc/xprtrdma/xprt_rdma.h |5 --
>  2 files changed, 45 insertions(+), 60 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 8a477e2..f2e3863 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -158,34 +158,37 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
> }
>  }
>
> -static int
> +/* The wc array is on stack: automatic memory is always CPU-local.
> + *
> + * The common case is a single completion is ready. By asking
> + * for two entries, a return code of 1 means there is exactly
> + * one completion and no more. We don't have to poll again to
> + * know that the CQ is now empty.
> + */
> +static void
>  rpcrdma_sendcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>  {
> -   struct ib_wc *wcs;
> -   int budget, count, rc;
> +   struct ib_wc *pos, wcs[2];
> +   int count, rc;
>
> -   budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
> do {
> -   wcs = ep->rep_send_wcs;
> +   pos = wcs;
>
> -   rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
> -   if (rc <= 0)
> -   return rc;
> +   rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
> +   if (rc < 0)
> +   goto out_warn;
>
> count = rc;
> while (count-- > 0)
> -   rpcrdma_sendcq_process_wc(wcs++);
> -   } while (rc == RPCRDMA_POLLSIZE && --budget);
> -   return 0;
> +   rpcrdma_sendcq_process_wc(pos++);
> +   } while (rc == ARRAY_SIZE(wcs));

I think I have missed something and not able to understand the reason
for polling 2 CQEs in one poll? It is possible that in a given poll_cq
call you end up getting on 1 completion, the other completion is
delayed due to some reason. Would it be better to poll for 1 in every
poll call Or
otherwise have this
while ( rc <= ARRAY_SIZE(wcs) && rc);

> +   return;
> +
> +out_warn:
> +   pr_warn("RPC:   %s: ib_poll_cq() failed %i\n", __func__, rc);
>  }
>
> -/*
> - * Handle send, fast_reg_mr, and local_inv completions.
> - *
> - * Send events are typically suppressed and thus do not result
> - * in an upcall. Occasionally one is signaled, however. This
> - * prevents the provider's completion queue from wrapping and
> - * losing a completion.
> +/* Handle provider send completion upcalls.
>   */
>  static void
>  rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
> @@ -193,12 +196,7 @@ rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
> struct rpcrdma_ep *ep = (struct rpcrdma_ep *)cq_context;
> int rc;
>
> -   rc = rpcrdma_sendcq_poll(cq, ep);
> -   if (rc) {
> -   dprintk("RPC:   %s: ib_poll_cq failed: %i\n",
> -   __func__, rc);
> -   return;
> -   }
> +   rpcrdma_sendcq_poll(cq, ep);
>
> rc = ib_req_notify_cq(cq,
> IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
> @@ -247,44 +245,41 @@ out_fail:
> goto out_schedule;
>  }
>
> -static int
> +/* The wc array is on stack: automatic memory is always CPU-local.
> + *
> + * struct ib_wc is 64 bytes, making the poll array potentially
> + * large. But this is at the bottom of the call chain. Further
> + * substantial work is done in another thread.
> + */
> +static void
>  rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>  {
> -   struct list_head sched_list;
> -   struct ib_wc *wcs;
> -   int budget, count, rc;
> +   struct ib_wc *pos, wcs[4];
> +   LIST_HEAD(sched_list);
> +   int count, rc;
>
> -   INIT_LIST_HEAD(&sched_list);
> - 

Re: [PATCH 3/3] IB/mlx4: Report checksum offload cap when query device

2015-09-17 Thread Doug Ledford
On 09/16/2015 11:56 AM, Bodong Wang wrote:
> Signed-off-by: Bodong Wang 
> ---
>  drivers/infiniband/hw/mlx4/main.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/infiniband/hw/mlx4/main.c 
> b/drivers/infiniband/hw/mlx4/main.c
> index 8be6db8..a70ca6a 100644
> --- a/drivers/infiniband/hw/mlx4/main.c
> +++ b/drivers/infiniband/hw/mlx4/main.c
> @@ -217,6 +217,9 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
>   props->device_cap_flags |= IB_DEVICE_MANAGED_FLOW_STEERING;
>   }
>  
> + props->csum_cap.eth_csum_cap |= IB_CSUM_SUPPORT_RAW;
> + props->csum_cap.ib_csum_cap |= IB_CSUM_SUPPORT_UD;
> +
>   props->vendor_id   = be32_to_cpup((__be32 *) (out_mad->data + 
> 36)) &
>   0xff;
>   props->vendor_part_id  = dev->dev->persist->pdev->device;
> 

This patch highlights something I didn't think about on the previous
patch.  Why separate eth/ib if you have per QP flags?  The QP denotes
the ib/eth relationship without the need to separate it into two
different caps.  In other words, you can never have an IB qp type on eth
because the only eth QP types we support other than RAW are all RDMA and
not IP.  Really, there's enough spare bits in ib_device_cap_flags that
you could do away with the new caps entirely.  Right now, we support UD
(which we already have a flag for), we can add two flags (for RAW and
RC) and that should cover all of the foreseeable options as that would
allow us to extend IP CSUM support to cover connected mode and cover all
of the current options.  I don't see us doing IP traffic in any other
situation, so I thing that should suffice.  Bits 25 and 26 could be used
for the two new bits.  Then you just need to extend the bits to user space.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/3] IB/core: Add support of checksum capability reporting in ib verbs

2015-09-17 Thread Doug Ledford
On 09/16/2015 11:56 AM, Bodong Wang wrote:
> A new filed csum_cap is added to both ib_query_device. It contains two 
> members:
> eth_csum_cap and ib_csum_cap, indicates checksum capability of Ethernet and
> Infiniband link layer respectively for different QP types.
> 
> Current checksum caps use the following enum members:
> - IB_CSUM_SUPPORT_UD: device supports validation/calculation of csum for UD 
> QP.
> - IB_CSUM_SUPPORT_RAW: device supports validation/calculation of csum for raw 
> QP.
> 
> Signed-off-by: Bodong Wang 
> ---
>  include/rdma/ib_verbs.h | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index b0f898e..94dbaee 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -183,6 +183,11 @@ struct ib_cq_init_attr {
>   u32 flags;
>  };
>  
> +struct ib_csum_cap_per_link {
> + uint32_t  eth_csum_cap;
> + uint32_t  ib_csum_cap;
> +};
> +

I generally don't like to waste this many bits on this little
information.  64 bits total for what only uses 4 bits right now, and
even on the high side would probably only ever use 8 or 12 bits, is
excessive.

That said, it's cleaner and easier to read than something like a double
shift where ib is lower 16 bits, eth is upper 16 bits, so I won't
request you change it, just register my eyebrow raise over the number of
bits used to record so little information.  (In fairness, I thought
about make you shrink it down, but the area of the struct you are adding
this to is currently 64bit aligned and it is reasonably likely that the
next item will need 64bit alignment, so saving bits only to possibly
loose them to alignment is a exercise in futility)

>  struct ib_device_attr {
>   u64 fw_ver;
>   __be64  sys_image_guid;
> @@ -229,6 +234,7 @@ struct ib_device_attr {
>   struct ib_odp_caps  odp_caps;
>   uint64_ttimestamp_mask;
>   uint64_thca_core_clock; /* in KHZ */
> + struct ib_csum_cap_per_link csum_cap;
>  };
>  
>  enum ib_mtu {
> @@ -868,6 +874,10 @@ enum ib_qp_create_flags {
>   IB_QP_CREATE_RESERVED_END   = 1 << 31,
>  };
>  
> +enum ib_csum_cap_flags {
> + IB_CSUM_SUPPORT_UD  = (1 << IB_QPT_UD),
> + IB_CSUM_SUPPORT_RAW = (1 << IB_QPT_RAW_PACKET),
> +};
>  
>  /*
>   * Note: users may not call ib_close_qp or ib_destroy_qp from the 
> event_handler
> 


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] IB/ucma: check workqueue allocation before usage

2015-09-17 Thread Jason Gunthorpe
On Thu, Sep 17, 2015 at 04:04:19PM -0400, Sasha Levin wrote:
> Allocating a workqueue might fail, which wasn't checked so far and would
> lead to NULL ptr derefs when an attempt to use it was made.

Indeed.

Yishai, please check this and check the other patches you've sent to
see if they have a similar error.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 17/18] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies

2015-09-17 Thread Chuck Lever
To support the NFSv4.1 backchannel on RDMA connections, add a
capability for receiving an RPC/RDMA reply on a connection
established by a client.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   76 +++
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   60 
 net/sunrpc/xprtrdma/xprt_rdma.h |4 ++
 3 files changed, 140 insertions(+)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 3830250..b728f6f 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -946,3 +946,79 @@ repost:
if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
rpcrdma_recv_buffer_put(rep);
 }
+
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+int
+rpcrdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
+   struct xdr_buf *rcvbuf)
+{
+   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct kvec *dst, *src = &rcvbuf->head[0];
+   struct rpc_rqst *req;
+   unsigned long cwnd;
+   u32 credits;
+   size_t len;
+   __be32 xid;
+   __be32 *p;
+   int ret;
+
+   p = (__be32 *)src->iov_base;
+   len = src->iov_len;
+   xid = rmsgp->rm_xid;
+
+   pr_info("%s: xid=%08x, length=%zu\n",
+   __func__, be32_to_cpu(xid), len);
+   pr_info("%s: RPC/RDMA: %*ph\n",
+   __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
+   pr_info("%s:  RPC: %*ph\n",
+   __func__, (int)len, p);
+
+   ret = -EAGAIN;
+   if (src->iov_len < 24)
+   goto out_shortreply;
+
+   spin_lock_bh(&xprt->transport_lock);
+   req = xprt_lookup_rqst(xprt, xid);
+   if (!req)
+   goto out_notfound;
+
+   dst = &req->rq_private_buf.head[0];
+   memcpy(&req->rq_private_buf, &req->rq_rcv_buf, sizeof(struct xdr_buf));
+   if (dst->iov_len < len)
+   goto out_unlock;
+   memcpy(dst->iov_base, p, len);
+
+   credits = be32_to_cpu(rmsgp->rm_credit);
+   if (credits == 0)
+   credits = 1;/* don't deadlock */
+   else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
+   credits = r_xprt->rx_buf.rb_bc_max_requests;
+
+   cwnd = xprt->cwnd;
+   xprt->cwnd = credits << RPC_CWNDSHIFT;
+   if (xprt->cwnd > cwnd)
+   xprt_release_rqst_cong(req->rq_task);
+
+   ret = 0;
+   xprt_complete_rqst(req->rq_task, rcvbuf->len);
+   rcvbuf->len = 0;
+
+out_unlock:
+   spin_unlock_bh(&xprt->transport_lock);
+out:
+   return ret;
+
+out_shortreply:
+   pr_info("svcrdma: short bc reply: xprt=%p, len=%zu\n",
+   xprt, src->iov_len);
+   goto out;
+
+out_notfound:
+   pr_info("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n",
+   xprt, be32_to_cpu(xid));
+
+   goto out_unlock;
+}
+
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 5f6ca47..be75abba 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include "xprt_rdma.h"
 
 #define RPCDBG_FACILITYRPCDBG_SVCXPRT
 
@@ -560,6 +561,42 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
return ret;
 }
 
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+/* By convention, backchannel calls arrive via rdma_msg type
+ * messages, and never populate the chunk lists. This makes
+ * the RPC/RDMA header small and fixed in size, so it is
+ * straightforward to check the RPC header's direction field.
+ */
+static bool
+svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
+{
+   __be32 *p = (__be32 *)rmsgp;
+
+   if (!xprt->xpt_bc_xprt)
+   return false;
+
+   if (rmsgp->rm_type != rdma_msg)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
+   return false;
+
+   /* sanity */
+   if (p[7] != rmsgp->rm_xid)
+   return false;
+   /* call direction */
+   if (p[8] == cpu_to_be32(RPC_CALL))
+   return false;
+
+   return true;
+}
+
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */
+
 /*
  * Set up the rqstp thread context to point to the RQ buffer. If
  * necessary, pull additional data from the client with an RDMA_READ
@@ -625,6 +662,17 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
goto close_out;
}
 
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+   if (svc_rdma_is_backchannel_reply(xprt, rmsgp)) {
+   ret = rpcrdma_handle_bc_reply(xprt->xpt_bc_xprt, rmsgp,
+ &rqstp->rq_arg);
+   svc_rdma_put_con

[PATCH v1 18/18] xprtrdma: Add class for RDMA backwards direction transport

2015-09-17 Thread Chuck Lever
To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class for backwards direction
operation.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/xprt.h  |1 
 net/sunrpc/xprt.c|1 
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   14 +-
 net/sunrpc/xprtrdma/transport.c  |  243 ++
 net/sunrpc/xprtrdma/xprt_rdma.h  |2 
 5 files changed, 256 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 6156491..4f1b0b6 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -83,6 +83,7 @@ struct rpc_rqst {
__u32 * rq_buffer;  /* XDR encode buffer */
size_t  rq_callsize,
rq_rcvsize;
+   void *  rq_privdata; /* xprt-specific per-rqst data */
size_t  rq_xmit_bytes_sent; /* total bytes sent */
size_t  rq_reply_bytes_recvd;   /* total reply bytes */
/* received */
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index ab5dd62..9480354 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1419,3 +1419,4 @@ void xprt_put(struct rpc_xprt *xprt)
if (atomic_dec_and_test(&xprt->count))
xprt_destroy(xprt);
 }
+EXPORT_SYMBOL_GPL(xprt_put);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index c4083a3..6bd4c1e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1194,12 +1194,14 @@ static void __svc_rdma_free(struct work_struct *work)
 {
struct svcxprt_rdma *rdma =
container_of(work, struct svcxprt_rdma, sc_work);
-   dprintk("svcrdma: svc_rdma_free(%p)\n", rdma);
+   struct svc_xprt *xprt = &rdma->sc_xprt;
+
+   dprintk("svcrdma: %s(%p)\n", __func__, rdma);
 
/* We should only be called from kref_put */
-   if (atomic_read(&rdma->sc_xprt.xpt_ref.refcount) != 0)
+   if (atomic_read(&xprt->xpt_ref.refcount) != 0)
pr_err("svcrdma: sc_xprt still in use? (%d)\n",
-  atomic_read(&rdma->sc_xprt.xpt_ref.refcount));
+  atomic_read(&xprt->xpt_ref.refcount));
 
/*
 * Destroy queued, but not processed read completions. Note
@@ -1234,6 +1236,12 @@ static void __svc_rdma_free(struct work_struct *work)
pr_err("svcrdma: dma still in use? (%d)\n",
   atomic_read(&rdma->sc_dma_used));
 
+   /* Final put of backchannel client transport */
+   if (xprt->xpt_bc_xprt) {
+   xprt_put(xprt->xpt_bc_xprt);
+   xprt->xpt_bc_xprt = NULL;
+   }
+
/* De-allocate fastreg mr */
rdma_dealloc_frmr_q(rdma);
 
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 7d6c06f..1030425 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "xprt_rdma.h"
 
@@ -148,7 +149,10 @@ static struct ctl_table sunrpc_table[] = {
 #define RPCRDMA_MAX_REEST_TO   (30U * HZ)
 #define RPCRDMA_IDLE_DISC_TO   (5U * 60 * HZ)
 
-static struct rpc_xprt_ops xprt_rdma_procs;/* forward reference */
+static struct rpc_xprt_ops xprt_rdma_procs;
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+static struct rpc_xprt_ops xprt_rdma_bc_procs;
+#endif
 
 static void
 xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr *sap)
@@ -500,7 +504,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
if (req == NULL)
return NULL;
 
-   flags = GFP_NOIO | __GFP_NOWARN;
+   flags = RPCRDMA_DEF_GFP;
if (RPC_IS_SWAPPER(task))
flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
 
@@ -685,6 +689,199 @@ xprt_rdma_disable_swap(struct rpc_xprt *xprt)
 {
 }
 
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+/* Server-side transport endpoint wants a whole page for its send
+ * buffer. The client RPC code constructs the RPC header in this
+ * buffer before it invokes ->send_request.
+ */
+static void *
+xprt_rdma_bc_allocate(struct rpc_task *task, size_t size)
+{
+   struct rpc_rqst *rqst = task->tk_rqstp;
+   struct svc_rdma_op_ctxt *ctxt;
+   struct svcxprt_rdma *rdma;
+   struct svc_xprt *sxprt;
+   struct page *page;
+
+   if (size > PAGE_SIZE) {
+   WARN_ONCE(1, "failed to handle buffer allocation (size %zu)\n",
+ size);
+   return NULL;
+   }
+
+   page = alloc_page(RPCRDMA_DEF_GFP);
+   if (!page)
+   return NULL;
+
+   sxprt = rqst->rq_xprt->bc_xprt;
+   rdma = container_of(sxprt, struct svcxprt_rdma, sc_xprt);
+   ctxt = svc_rdma_get_context_gfp(rdma, RPCRDMA_DEF_GF

[PATCH v1 16/18] svcrdma: Add infrastructure to send backwards direction RPC/RDMA calls

2015-09-17 Thread Chuck Lever
To support the NFSv4.1 backchannel on RDMA connections, add a
mechanism for sending a backwards-direction RPC/RDMA call on a
connection established by a client.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h   |2 +
 net/sunrpc/xprtrdma/svc_rdma_sendto.c |   63 +
 2 files changed, 65 insertions(+)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 2500dd1..42262dd 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -216,6 +216,8 @@ extern int rdma_read_chunk_frmr(struct svcxprt_rdma *, 
struct svc_rqst *,
 extern int svc_rdma_sendto(struct svc_rqst *);
 extern struct rpcrdma_read_chunk *
svc_rdma_get_read_chunk(struct rpcrdma_msg *);
+extern int svc_rdma_bc_post_send(struct svcxprt_rdma *,
+struct svc_rdma_op_ctxt *, struct xdr_buf *);
 
 /* svc_rdma_transport.c */
 extern int svc_rdma_send(struct svcxprt_rdma *, struct ib_send_wr *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 1dfae83..0bda3a5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -641,3 +641,66 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
svc_rdma_put_context(ctxt, 0);
return ret;
 }
+
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+/* Send a backwards direction RPC call.
+ *
+ * Caller holds the connection's mutex and has already marshaled the
+ * RPC/RDMA request. Before sending the request, this API also posts
+ * an extra receive buffer to catch the bc reply for this request.
+ */
+int svc_rdma_bc_post_send(struct svcxprt_rdma *rdma,
+ struct svc_rdma_op_ctxt *ctxt, struct xdr_buf *sndbuf)
+{
+   struct svc_rdma_req_map *vec;
+   struct ib_send_wr send_wr;
+   int ret;
+
+   vec = svc_rdma_get_req_map();
+   ret = map_xdr(rdma, sndbuf, vec);
+   if (ret)
+   goto out;
+
+   /* Post a recv buffer to handle reply for this request */
+   ret = svc_rdma_post_recv(rdma);
+   if (ret) {
+   pr_err("svcrdma: Failed to post bc receive buffer, err=%d. "
+  "Closing transport %p.\n", ret, rdma);
+   set_bit(XPT_CLOSE, &rdma->sc_xprt.xpt_flags);
+   ret = -ENOTCONN;
+   goto out;
+   }
+
+   ctxt->wr_op = IB_WR_SEND;
+   ctxt->direction = DMA_TO_DEVICE;
+   ctxt->sge[0].lkey = rdma->sc_dma_lkey;
+   ctxt->sge[0].length = sndbuf->len;
+   ctxt->sge[0].addr =
+   ib_dma_map_page(rdma->sc_cm_id->device, ctxt->pages[0], 0,
+   sndbuf->len, DMA_TO_DEVICE);
+   if (ib_dma_mapping_error(rdma->sc_cm_id->device, ctxt->sge[0].addr)) {
+   svc_rdma_unmap_dma(ctxt);
+   ret = -EIO;
+   goto out;
+   }
+   atomic_inc(&rdma->sc_dma_used);
+
+   memset(&send_wr, 0, sizeof send_wr);
+   send_wr.wr_id = (unsigned long)ctxt;
+   send_wr.sg_list = ctxt->sge;
+   send_wr.num_sge = 1;
+   send_wr.opcode = IB_WR_SEND;
+   send_wr.send_flags = IB_SEND_SIGNALED;
+
+   ret = svc_rdma_send(rdma, &send_wr);
+   if (ret) {
+   svc_rdma_unmap_dma(ctxt);
+   ret = -EIO;
+   goto out;
+   }
+out:
+   svc_rdma_put_req_map(vec);
+   pr_info("svcrdma: %s returns %d\n", __func__, ret);
+   return ret;
+}
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 13/18] NFS: Enable client side NFSv4.1 backchannel to use other transports

2015-09-17 Thread Chuck Lever
Pass the correct backchannel transport class to svc_create_xprt()
when setting up an NFSv4.1 backchannel transport.

Signed-off-by: Chuck Lever 
---
 fs/nfs/callback.c   |   33 +
 include/linux/sunrpc/xprt.h |1 +
 net/sunrpc/xprtrdma/transport.c |1 +
 net/sunrpc/xprtsock.c   |1 +
 4 files changed, 24 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index 75f7c0a..46ed2c5 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -99,15 +99,22 @@ nfs4_callback_up(struct svc_serv *serv)
 }
 
 #if defined(CONFIG_NFS_V4_1)
-static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net)
+/*
+ * Create an svc_sock for the back channel service that shares the
+ * fore channel connection.
+ * Returns the input port (0) and sets the svc_serv bc_xprt on success
+ */
+static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net,
+struct rpc_xprt *xprt)
 {
-   /*
-* Create an svc_sock for the back channel service that shares the
-* fore channel connection.
-* Returns the input port (0) and sets the svc_serv bc_xprt on success
-*/
-   return svc_create_xprt(serv, "tcp-bc", net, PF_INET, 0,
- SVC_SOCK_ANONYMOUS);
+   int ret = -EPROTONOSUPPORT;
+
+   if (xprt->bc_name)
+   ret = svc_create_xprt(serv, xprt->bc_name, net, PF_INET, 0,
+ SVC_SOCK_ANONYMOUS);
+   dprintk("NFS: svc_create_xprt(%s) returned %d\n",
+   xprt->bc_name, ret);
+   return ret;
 }
 
 /*
@@ -184,7 +191,8 @@ static inline void nfs_callback_bc_serv(u32 minorversion, 
struct rpc_xprt *xprt,
xprt->bc_serv = serv;
 }
 #else
-static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net)
+static int nfs41_callback_up_net(struct svc_serv *serv, struct net *net,
+struct rpc_xprt *xprt)
 {
return 0;
 }
@@ -259,7 +267,8 @@ static void nfs_callback_down_net(u32 minorversion, struct 
svc_serv *serv, struc
svc_shutdown_net(serv, net);
 }
 
-static int nfs_callback_up_net(int minorversion, struct svc_serv *serv, struct 
net *net)
+static int nfs_callback_up_net(int minorversion, struct svc_serv *serv,
+  struct net *net, struct rpc_xprt *xprt)
 {
struct nfs_net *nn = net_generic(net, nfs_net_id);
int ret;
@@ -281,7 +290,7 @@ static int nfs_callback_up_net(int minorversion, struct 
svc_serv *serv, struct n
break;
case 1:
case 2:
-   ret = nfs41_callback_up_net(serv, net);
+   ret = nfs41_callback_up_net(serv, net, xprt);
break;
default:
printk(KERN_ERR "NFS: unknown callback version: %d\n",
@@ -364,7 +373,7 @@ int nfs_callback_up(u32 minorversion, struct rpc_xprt *xprt)
goto err_create;
}
 
-   ret = nfs_callback_up_net(minorversion, serv, net);
+   ret = nfs_callback_up_net(minorversion, serv, net, xprt);
if (ret < 0)
goto err_net;
 
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 025198d..6156491 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -168,6 +168,7 @@ struct rpc_xprt {
struct sockaddr_storage addr;   /* server address */
size_t  addrlen;/* size of server address */
int prot;   /* IP protocol */
+   char*bc_name;   /* backchannel transport */
 
unsigned long   cong;   /* current congestion */
unsigned long   cwnd;   /* congestion window */
diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index e3871a6..7d6c06f 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -337,6 +337,7 @@ xprt_setup_rdma(struct xprt_create *args)
/* Ensure xprt->addr holds valid server TCP (not RDMA)
 * address, for any side protocols which peek at it */
xprt->prot = IPPROTO_TCP;
+   xprt->bc_name = "rdma-bc";
xprt->addrlen = args->addrlen;
memcpy(&xprt->addr, sap, xprt->addrlen);
 
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index d2ad732..3ff123d 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2851,6 +2851,7 @@ static struct rpc_xprt *xs_setup_tcp(struct xprt_create 
*args)
transport = container_of(xprt, struct sock_xprt, xprt);
 
xprt->prot = IPPROTO_TCP;
+   xprt->bc_name = "tcp-bc";
xprt->tsh_size = sizeof(rpc_fraghdr) / sizeof(u32);
xprt->max_payload = RPC_MAX_FRAGMENT_SIZE;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a me

[PATCH v1 07/18] xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers

2015-09-17 Thread Chuck Lever
xprtrdma's backward direction send and receive buffers are the same
size as the forechannel's inline threshold, and must be pre-
registered.

The consumer has no control over which receive buffer the adapter
chooses to catch an incoming backwards-direction call. Any receive
buffer can be used for either a forward reply or a backward call.
Thus both types of RPC message must all be the same size.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/Makefile  |1 
 net/sunrpc/xprtrdma/backchannel.c |  204 +
 net/sunrpc/xprtrdma/transport.c   |7 +
 net/sunrpc/xprtrdma/verbs.c   |   92 ++---
 net/sunrpc/xprtrdma/xprt_rdma.h   |   20 
 5 files changed, 309 insertions(+), 15 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/backchannel.c

diff --git a/net/sunrpc/xprtrdma/Makefile b/net/sunrpc/xprtrdma/Makefile
index 48913de..33f99d3 100644
--- a/net/sunrpc/xprtrdma/Makefile
+++ b/net/sunrpc/xprtrdma/Makefile
@@ -5,3 +5,4 @@ rpcrdma-y := transport.o rpc_rdma.o verbs.o \
svc_rdma.o svc_rdma_transport.o \
svc_rdma_marshal.o svc_rdma_sendto.o svc_rdma_recvfrom.o \
module.o
+rpcrdma-$(CONFIG_SUNRPC_BACKCHANNEL) += backchannel.o
diff --git a/net/sunrpc/xprtrdma/backchannel.c 
b/net/sunrpc/xprtrdma/backchannel.c
new file mode 100644
index 000..c0a42ad
--- /dev/null
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -0,0 +1,204 @@
+/*
+ * Copyright (c) 2015 Oracle.  All rights reserved.
+ *
+ * Support for backward direction RPCs on RPC/RDMA.
+ */
+
+#include 
+
+#include "xprt_rdma.h"
+
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+# define RPCDBG_FACILITY   RPCDBG_TRANS
+#endif
+
+static void rpcrdma_bc_free_rqst(struct rpcrdma_xprt *r_xprt,
+struct rpc_rqst *rqst)
+{
+   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
+   struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
+
+   spin_lock(&buf->rb_reqslock);
+   list_del(&req->rl_all);
+   spin_unlock(&buf->rb_reqslock);
+
+   rpcrdma_destroy_req(&r_xprt->rx_ia, req);
+
+   kfree(rqst);
+}
+
+static int rpcrdma_bc_setup_rqst(struct rpcrdma_xprt *r_xprt,
+struct rpc_rqst *rqst)
+{
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_regbuf *rb;
+   struct rpcrdma_req *req;
+   struct xdr_buf *buf;
+   size_t size;
+
+   req = rpcrdma_create_req(r_xprt);
+   if (!req)
+   return -ENOMEM;
+   req->rl_backchannel = true;
+
+   size = RPCRDMA_INLINE_WRITE_THRESHOLD(rqst);
+   rb = rpcrdma_alloc_regbuf(ia, size, GFP_KERNEL);
+   if (IS_ERR(rb))
+   goto out_fail;
+   req->rl_rdmabuf = rb;
+
+   size += RPCRDMA_INLINE_READ_THRESHOLD(rqst);
+   rb = rpcrdma_alloc_regbuf(ia, size, GFP_KERNEL);
+   if (IS_ERR(rb))
+   goto out_fail;
+   rb->rg_owner = req;
+   req->rl_sendbuf = rb;
+   /* so that rpcr_to_rdmar works when receiving a request */
+   rqst->rq_buffer = (void *)req->rl_sendbuf->rg_base;
+
+   buf = &rqst->rq_snd_buf;
+   buf->head[0].iov_base = rqst->rq_buffer;
+   buf->head[0].iov_len = 0;
+   buf->tail[0].iov_base = NULL;
+   buf->tail[0].iov_len = 0;
+   buf->page_len = 0;
+   buf->len = 0;
+   buf->buflen = size;
+
+   return 0;
+
+out_fail:
+   rpcrdma_bc_free_rqst(r_xprt, rqst);
+   return -ENOMEM;
+}
+
+/* Allocate and add receive buffers to the rpcrdma_buffer's existing
+ * list of rep's. These are released when the transport is destroyed. */
+static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
+unsigned int count)
+{
+   struct rpcrdma_buffer *buffers = &r_xprt->rx_buf;
+   struct rpcrdma_rep *rep;
+   unsigned long flags;
+   int rc = 0;
+
+   while (count--) {
+   rep = rpcrdma_create_rep(r_xprt);
+   if (IS_ERR(rep)) {
+   pr_err("RPC:   %s: reply buffer alloc failed\n",
+  __func__);
+   rc = PTR_ERR(rep);
+   break;
+   }
+
+   spin_lock_irqsave(&buffers->rb_lock, flags);
+   list_add(&rep->rr_list, &buffers->rb_recv_bufs);
+   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   }
+
+   return rc;
+}
+
+/**
+ * xprt_rdma_bc_setup - Pre-allocate resources for handling backchannel 
requests
+ * @xprt: transport associated with these backchannel resources
+ * @reqs: number of concurrent incoming requests to expect
+ *
+ * Returns 0 on success; otherwise a negative errno
+ */
+int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int reqs)
+{
+   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct rpcrdma_buffer *buffer = &r_xprt->rx_buf;
+   struct rpc_rqst *rqst;
+   unsigned int i;
+   int rc;
+
+   /* The backchannel reply path returns each rpc_rqst to t

[PATCH v1 12/18] SUNRPC: Remove the TCP-only restriction in bc_svc_process()

2015-09-17 Thread Chuck Lever
Allow the use of other transport classes when handling a backward
direction RPC call.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/svc.c |5 -
 1 file changed, 5 deletions(-)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index a8f579d..bc5b7b5 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -1367,11 +1367,6 @@ bc_svc_process(struct svc_serv *serv, struct rpc_rqst 
*req,
/* reset result send buffer "put" position */
resv->iov_len = 0;
 
-   if (rqstp->rq_prot != IPPROTO_TCP) {
-   printk(KERN_ERR "No support for Non-TCP transports!\n");
-   BUG();
-   }
-
/*
 * Skip the next two words because they've already been
 * processed in the transport

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 09/18] xprtrdma: Add support for sending backward direction RPC replies

2015-09-17 Thread Chuck Lever
Backward direction RPC replies are sent via the client transport's
send_request method, the same way forward direction RPC calls are
sent.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/backchannel.c |   45 +
 net/sunrpc/xprtrdma/rpc_rdma.c|5 
 net/sunrpc/xprtrdma/xprt_rdma.h   |1 +
 3 files changed, 51 insertions(+)

diff --git a/net/sunrpc/xprtrdma/backchannel.c 
b/net/sunrpc/xprtrdma/backchannel.c
index f5c7122..cc9c762 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -168,6 +168,51 @@ out_err:
 }
 
 /**
+ * rpcrdma_bc_marshal_reply - Send backwards direction reply
+ * @rqst: buffer containing RPC reply data
+ *
+ * Returns zero on success.
+ */
+int rpcrdma_bc_marshal_reply(struct rpc_rqst *rqst)
+{
+   struct rpc_xprt *xprt = rqst->rq_xprt;
+   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
+   struct rpcrdma_msg *headerp;
+   size_t rpclen;
+
+   headerp = rdmab_to_msg(req->rl_rdmabuf);
+   headerp->rm_xid = rqst->rq_xid;
+   headerp->rm_vers = rpcrdma_version;
+   headerp->rm_credit =
+   cpu_to_be32(r_xprt->rx_buf.rb_bc_srv_max_requests);
+   headerp->rm_type = rdma_msg;
+   headerp->rm_body.rm_chunks[0] = xdr_zero;
+   headerp->rm_body.rm_chunks[1] = xdr_zero;
+   headerp->rm_body.rm_chunks[2] = xdr_zero;
+
+   rpclen = rqst->rq_svec[0].iov_len;
+
+   pr_info("RPC:   %s: rpclen %zd headerp 0x%p lkey 0x%x\n",
+   __func__, rpclen, headerp, rdmab_lkey(req->rl_rdmabuf));
+   pr_info("RPC:   %s: RPC/RDMA: %*ph\n",
+   __func__, (int)RPCRDMA_HDRLEN_MIN, headerp);
+   pr_info("RPC:   %s:  RPC: %*ph\n",
+   __func__, (int)rpclen, rqst->rq_svec[0].iov_base);
+
+   req->rl_send_iov[0].addr = rdmab_addr(req->rl_rdmabuf);
+   req->rl_send_iov[0].length = RPCRDMA_HDRLEN_MIN;
+   req->rl_send_iov[0].lkey = rdmab_lkey(req->rl_rdmabuf);
+
+   req->rl_send_iov[1].addr = rdmab_addr(req->rl_sendbuf);
+   req->rl_send_iov[1].length = rpclen;
+   req->rl_send_iov[1].lkey = rdmab_lkey(req->rl_sendbuf);
+
+   req->rl_niovs = 2;
+   return 0;
+}
+
+/**
  * xprt_rdma_bc_destroy - Release resources for handling backchannel requests
  * @xprt: transport associated with these backchannel resources
  * @reqs: number of incoming requests to destroy; ignored
diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index 287c874..d0dbbf7 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -441,6 +441,11 @@ rpcrdma_marshal_req(struct rpc_rqst *rqst)
enum rpcrdma_chunktype rtype, wtype;
struct rpcrdma_msg *headerp;
 
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+   if (test_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state))
+   return rpcrdma_bc_marshal_reply(rqst);
+#endif
+
/*
 * rpclen gets amount of data in first buffer, which is the
 * pre-registered buffer.
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 37d0d7f..a59ce18 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -516,6 +516,7 @@ void xprt_rdma_cleanup(void);
 #if defined(CONFIG_SUNRPC_BACKCHANNEL)
 int xprt_rdma_bc_setup(struct rpc_xprt *, unsigned int);
 int rpcrdma_bc_post_recv(struct rpcrdma_xprt *, unsigned int);
+int rpcrdma_bc_marshal_reply(struct rpc_rqst *);
 void xprt_rdma_bc_free_rqst(struct rpc_rqst *);
 void xprt_rdma_bc_destroy(struct rpc_xprt *, unsigned int);
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 11/18] svcrdma: Add backward direction service for RPC/RDMA transport

2015-09-17 Thread Chuck Lever
On NFSv4.1 mount points, the Linux NFS client uses this transport
endpoint to receive backward direction calls and route replies back
to the NFSv4.1 server.

Signed-off-by: Chuck Lever 
Acked-by: "J. Bruce Fields" 
---
 include/linux/sunrpc/svc_rdma.h  |6 +++
 include/linux/sunrpc/xprt.h  |1 +
 net/sunrpc/xprtrdma/svc_rdma.c   |6 +++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   58 ++
 4 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 7ccc961..fb4013e 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -228,9 +228,13 @@ extern void svc_rdma_put_frmr(struct svcxprt_rdma *,
  struct svc_rdma_fastreg_mr *);
 extern void svc_sq_reap(struct svcxprt_rdma *);
 extern void svc_rq_reap(struct svcxprt_rdma *);
-extern struct svc_xprt_class svc_rdma_class;
 extern void svc_rdma_prep_reply_hdr(struct svc_rqst *);
 
+extern struct svc_xprt_class svc_rdma_class;
+#ifdef CONFIG_SUNRPC_BACKCHANNEL
+extern struct svc_xprt_class svc_rdma_bc_class;
+#endif
+
 /* svc_rdma.c */
 extern int svc_rdma_init(void);
 extern void svc_rdma_cleanup(void);
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 81e3433..025198d 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -156,6 +156,7 @@ enum xprt_transports {
XPRT_TRANSPORT_TCP  = IPPROTO_TCP,
XPRT_TRANSPORT_BC_TCP   = IPPROTO_TCP | XPRT_TRANSPORT_BC,
XPRT_TRANSPORT_RDMA = 256,
+   XPRT_TRANSPORT_BC_RDMA  = XPRT_TRANSPORT_RDMA | XPRT_TRANSPORT_BC,
XPRT_TRANSPORT_LOCAL= 257,
 };
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index 2cd252f..1b7051b 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -239,6 +239,9 @@ void svc_rdma_cleanup(void)
unregister_sysctl_table(svcrdma_table_header);
svcrdma_table_header = NULL;
}
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+   svc_unreg_xprt_class(&svc_rdma_bc_class);
+#endif
svc_unreg_xprt_class(&svc_rdma_class);
kmem_cache_destroy(svc_rdma_map_cachep);
kmem_cache_destroy(svc_rdma_ctxt_cachep);
@@ -286,6 +289,9 @@ int svc_rdma_init(void)
 
/* Register RDMA with the SVC transport switch */
svc_reg_xprt_class(&svc_rdma_class);
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+   svc_reg_xprt_class(&svc_rdma_bc_class);
+#endif
return 0;
  err1:
kmem_cache_destroy(svc_rdma_map_cachep);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index fcc3eb8..a133b1e 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -56,6 +56,7 @@
 
 #define RPCDBG_FACILITYRPCDBG_SVCXPRT
 
+static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *, int);
 static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
struct net *net,
struct sockaddr *sa, int salen,
@@ -95,6 +96,63 @@ struct svc_xprt_class svc_rdma_class = {
.xcl_ident = XPRT_TRANSPORT_RDMA,
 };
 
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+static struct svc_xprt *svc_rdma_bc_create(struct svc_serv *, struct net *,
+  struct sockaddr *, int, int);
+static void svc_rdma_bc_detach(struct svc_xprt *);
+static void svc_rdma_bc_free(struct svc_xprt *);
+
+static struct svc_xprt_ops svc_rdma_bc_ops = {
+   .xpo_create = svc_rdma_bc_create,
+   .xpo_detach = svc_rdma_bc_detach,
+   .xpo_free = svc_rdma_bc_free,
+   .xpo_prep_reply_hdr = svc_rdma_prep_reply_hdr,
+   .xpo_secure_port = svc_rdma_secure_port,
+};
+
+struct svc_xprt_class svc_rdma_bc_class = {
+   .xcl_name = "rdma-bc",
+   .xcl_owner = THIS_MODULE,
+   .xcl_ops = &svc_rdma_bc_ops,
+   .xcl_max_payload = (1024 - RPCRDMA_HDRLEN_MIN)
+};
+
+static struct svc_xprt *svc_rdma_bc_create(struct svc_serv *serv,
+  struct net *net,
+  struct sockaddr *sa, int salen,
+  int flags)
+{
+   struct svcxprt_rdma *cma_xprt;
+   struct svc_xprt *xprt;
+
+   cma_xprt = rdma_create_xprt(serv, 0);
+   if (!cma_xprt)
+   return ERR_PTR(-ENOMEM);
+   xprt = &cma_xprt->sc_xprt;
+
+   svc_xprt_init(net, &svc_rdma_bc_class, xprt, serv);
+   serv->sv_bc_xprt = xprt;
+
+   dprintk("svcrdma: %s(%p)\n", __func__, xprt);
+   return xprt;
+}
+
+static void svc_rdma_bc_detach(struct svc_xprt *xprt)
+{
+   dprintk("svcrdma: %s(%p)\n", __func__, xprt);
+}
+
+static void svc_rdma_bc_free(struct svc_xprt *xprt)
+{
+   struct svcxprt_rdma *rdma =
+   container_of(xprt, stru

[PATCH v1 10/18] xprtrdma: Handle incoming backward direction RPC calls

2015-09-17 Thread Chuck Lever
Introduce a code path in the rpcrdma_reply_handler() to catch
incoming backward direction RPC calls and route them to the ULP's
backchannel server.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/backchannel.c |  115 +
 net/sunrpc/xprtrdma/rpc_rdma.c|   41 +
 net/sunrpc/xprtrdma/xprt_rdma.h   |2 +
 3 files changed, 158 insertions(+)

diff --git a/net/sunrpc/xprtrdma/backchannel.c 
b/net/sunrpc/xprtrdma/backchannel.c
index cc9c762..2eee18a 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -5,6 +5,8 @@
  */
 
 #include 
+#include 
+#include 
 
 #include "xprt_rdma.h"
 
@@ -12,6 +14,8 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
+#define RPCRDMA_BACKCHANNEL_DEBUG
+
 static void rpcrdma_bc_free_rqst(struct rpcrdma_xprt *r_xprt,
 struct rpc_rqst *rqst)
 {
@@ -251,3 +255,114 @@ void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
list_add_tail(&rqst->rq_bc_pa_list, &xprt->bc_pa_list);
spin_unlock_bh(&xprt->bc_pa_lock);
 }
+
+/**
+ * rpcrdma_bc_receive_call - Handle a backward direction call
+ * @xprt: transport receiving the call
+ * @rep: receive buffer containing the call
+ *
+ * Called in the RPC reply handler, which runs in a tasklet.
+ * Be quick about it.
+ *
+ * Operational assumptions:
+ *o Backchannel credits are ignored, just as the NFS server
+ *  forechannel currently does
+ *o The ULP manages a replay cache (eg, NFSv4.1 sessions).
+ *  No replay detection is done at the transport level
+ */
+void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
+struct rpcrdma_rep *rep)
+{
+   struct rpc_xprt *xprt = &r_xprt->rx_xprt;
+   struct rpcrdma_msg *headerp;
+   struct svc_serv *bc_serv;
+   struct rpcrdma_req *req;
+   struct rpc_rqst *rqst;
+   struct xdr_buf *buf;
+   size_t size;
+   __be32 *p;
+
+   headerp = rdmab_to_msg(rep->rr_rdmabuf);
+#ifdef RPCRDMA_BACKCHANNEL_DEBUG
+   pr_info("RPC:   %s: callback XID %08x, length=%u\n",
+   __func__, be32_to_cpu(headerp->rm_xid), rep->rr_len);
+   pr_info("RPC:   %s: %*ph\n", __func__, rep->rr_len, headerp);
+#endif
+
+   /* Sanity check:
+* Need at least enough bytes for RPC/RDMA header, as code
+* here references the header fields by array offset. Also,
+* backward calls are always inline, so ensure there
+* are some bytes beyond the RPC/RDMA header.
+*/
+   if (rep->rr_len < RPCRDMA_HDRLEN_MIN + 24)
+   goto out_short;
+   p = (__be32 *)((unsigned char *)headerp + RPCRDMA_HDRLEN_MIN);
+   size = rep->rr_len - RPCRDMA_HDRLEN_MIN;
+
+   /* Grab a free bc rqst */
+   spin_lock(&xprt->bc_pa_lock);
+   if (list_empty(&xprt->bc_pa_list)) {
+   spin_unlock(&xprt->bc_pa_lock);
+   goto out_overflow;
+   }
+   rqst = list_first_entry(&xprt->bc_pa_list,
+   struct rpc_rqst, rq_bc_pa_list);
+   list_del(&rqst->rq_bc_pa_list);
+   spin_unlock(&xprt->bc_pa_lock);
+#ifdef RPCRDMA_BACKCHANNEL_DEBUG
+   pr_info("RPC:   %s: using rqst %p\n", __func__, rqst);
+#endif
+
+   /* Prepare rqst */
+   rqst->rq_reply_bytes_recvd = 0;
+   rqst->rq_bytes_sent = 0;
+   rqst->rq_xid = headerp->rm_xid;
+   set_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state);
+
+   buf = &rqst->rq_rcv_buf;
+   memset(buf, 0, sizeof(*buf));
+   buf->head[0].iov_base = p;
+   buf->head[0].iov_len = size;
+   buf->len = size;
+
+   /* The receive buffer has to be hooked to the rpcrdma_req so that
+* it can be reposted after the server is done parsing it but just
+* before sending the backward direction reply. */
+   req = rpcr_to_rdmar(rqst);
+#ifdef RPCRDMA_BACKCHANNEL_DEBUG
+   pr_info("RPC:   %s: attaching rep %p to req %p\n",
+   __func__, rep, req);
+#endif
+   req->rl_reply = rep;
+
+   /* Defeat the retransmit detection logic in send_request */
+   req->rl_connect_cookie = 0;
+
+   /* Queue rqst for ULP's callback service */
+   bc_serv = xprt->bc_serv;
+   spin_lock(&bc_serv->sv_cb_lock);
+   list_add(&rqst->rq_bc_list, &bc_serv->sv_cb_list);
+   spin_unlock(&bc_serv->sv_cb_lock);
+
+   wake_up(&bc_serv->sv_cb_waitq);
+
+   r_xprt->rx_stats.bcall_count++;
+   return;
+
+out_overflow:
+   pr_warn("RPC/RDMA backchannel overflow\n");
+   xprt_disconnect_done(xprt);
+   /* This receive buffer gets reposted automatically
+* when the connection is re-established. */
+   return;
+
+out_short:
+   pr_warn("RPC/RDMA short backward direction call\n");
+
+   if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
+   xprt_disconnect_done(xprt);
+   else
+   pr_warn("RPC:   %s: reposting rep %p\n",

[PATCH v1 14/18] svcrdma: Define maximum number of backchannel requests

2015-09-17 Thread Chuck Lever
Extra resources for handling backchannel requests have to be
pre-allocated when a transport instance is created. Set a limit.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |5 +
 net/sunrpc/xprtrdma/svc_rdma_transport.c |6 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index fb4013e..6ce7495 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -180,6 +180,11 @@ struct svcxprt_rdma {
 #define RPCRDMA_SQ_DEPTH_MULT   8
 #define RPCRDMA_MAX_REQUESTS32
 #define RPCRDMA_MAX_REQ_SIZE4096
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+#define RPCRDMA_MAX_BC_REQUESTS8
+#else
+#define RPCRDMA_MAX_BC_REQUESTS0
+#endif
 
 #define RPCSVC_MAXPAYLOAD_RDMA RPCSVC_MAXPAYLOAD
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index a133b1e..23aba30 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -935,8 +935,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
  (size_t)RPCSVC_MAXPAGES);
newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
   RPCSVC_MAXPAGES);
+   /* XXX: what if HCA can't support enough WRs for bc operation? */
newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
-  (size_t)svcrdma_max_requests);
+  (size_t)(svcrdma_max_requests +
+  RPCRDMA_MAX_BC_REQUESTS));
newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;
 
/*
@@ -976,7 +978,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
qp_attr.event_handler = qp_event_handler;
qp_attr.qp_context = &newxprt->sc_xprt;
qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
+   qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS;
qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
+   qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS;
qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 15/18] svcrdma: Add svc_rdma_get_context() API that is allowed to fail

2015-09-17 Thread Chuck Lever
To support backward direction calls, I'm going to add an
svc_rdma_get_context() call in the client RDMA transport.

Called from ->buf_alloc(), we can't sleep waiting for memory.
So add an API that can get a server op_ctxt but won't sleep.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/svc_rdma.h  |2 ++
 net/sunrpc/xprtrdma/svc_rdma_transport.c |   28 +++-
 2 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 6ce7495..2500dd1 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -224,6 +224,8 @@ extern void svc_rdma_send_error(struct svcxprt_rdma *, 
struct rpcrdma_msg *,
 extern int svc_rdma_post_recv(struct svcxprt_rdma *);
 extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *);
 extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
+extern struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *,
+gfp_t);
 extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
 extern void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt);
 extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 23aba30..c4083a3 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -153,17 +153,35 @@ static void svc_rdma_bc_free(struct svc_xprt *xprt)
 }
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 
-struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
+static void svc_rdma_init_context(struct svcxprt_rdma *xprt,
+ struct svc_rdma_op_ctxt *ctxt)
 {
-   struct svc_rdma_op_ctxt *ctxt;
-
-   ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
-   GFP_KERNEL | __GFP_NOFAIL);
ctxt->xprt = xprt;
INIT_LIST_HEAD(&ctxt->dto_q);
ctxt->count = 0;
ctxt->frmr = NULL;
atomic_inc(&xprt->sc_ctxt_used);
+}
+
+struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *xprt,
+ gfp_t flags)
+{
+   struct svc_rdma_op_ctxt *ctxt;
+
+   ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep, flags);
+   if (!ctxt)
+   return NULL;
+   svc_rdma_init_context(xprt, ctxt);
+   return ctxt;
+}
+
+struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
+{
+   struct svc_rdma_op_ctxt *ctxt;
+
+   ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
+   GFP_KERNEL | __GFP_NOFAIL);
+   svc_rdma_init_context(xprt, ctxt);
return ctxt;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 08/18] xprtrdma: Pre-allocate Work Requests for backchannel

2015-09-17 Thread Chuck Lever
Pre-allocate extra send and receive Work Requests needed to handle
backchannel receives and sends.

The transport doesn't know how many extra WRs to pre-allocate until
the xprt_setup_backchannel() call, but that's long after the WRs are
allocated during forechannel setup.

So, use a fixed value for now.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/backchannel.c |4 
 net/sunrpc/xprtrdma/verbs.c   |   14 --
 net/sunrpc/xprtrdma/xprt_rdma.h   |   10 ++
 3 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c 
b/net/sunrpc/xprtrdma/backchannel.c
index c0a42ad..f5c7122 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -123,6 +123,9 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int 
reqs)
 * Twice as many rpc_rqsts are prepared to ensure there is
 * always an rpc_rqst available as soon as a reply is sent.
 */
+   if (reqs > RPCRDMA_BACKWARD_WRS >> 1)
+   goto out_err;
+
for (i = 0; i < (reqs << 1); i++) {
rqst = kzalloc(sizeof(*rqst), GFP_KERNEL);
if (!rqst) {
@@ -159,6 +162,7 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int 
reqs)
 out_free:
xprt_rdma_bc_destroy(xprt, reqs);
 
+out_err:
pr_err("RPC:   %s: setup backchannel transport failed\n", __func__);
return -ENOMEM;
 }
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 1e4a948..133c720 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -614,6 +614,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia 
*ia,
struct ib_device_attr *devattr = &ia->ri_devattr;
struct ib_cq *sendcq, *recvcq;
struct ib_cq_init_attr cq_attr = {};
+   unsigned int max_qp_wr;
int rc, err;
 
if (devattr->max_sge < RPCRDMA_MAX_IOVS) {
@@ -622,18 +623,27 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
rpcrdma_ia *ia,
return -ENOMEM;
}
 
+   if (devattr->max_qp_wr <= RPCRDMA_BACKWARD_WRS) {
+   dprintk("RPC:   %s: insufficient wqe's available\n",
+   __func__);
+   return -ENOMEM;
+   }
+   max_qp_wr = devattr->max_qp_wr - RPCRDMA_BACKWARD_WRS;
+
/* check provider's send/recv wr limits */
-   if (cdata->max_requests > devattr->max_qp_wr)
-   cdata->max_requests = devattr->max_qp_wr;
+   if (cdata->max_requests > max_qp_wr)
+   cdata->max_requests = max_qp_wr;
 
ep->rep_attr.event_handler = rpcrdma_qp_async_error_upcall;
ep->rep_attr.qp_context = ep;
ep->rep_attr.srq = NULL;
ep->rep_attr.cap.max_send_wr = cdata->max_requests;
+   ep->rep_attr.cap.max_send_wr += RPCRDMA_BACKWARD_WRS;
rc = ia->ri_ops->ro_open(ia, ep, cdata);
if (rc)
return rc;
ep->rep_attr.cap.max_recv_wr = cdata->max_requests;
+   ep->rep_attr.cap.max_recv_wr += RPCRDMA_BACKWARD_WRS;
ep->rep_attr.cap.max_send_sge = RPCRDMA_MAX_IOVS;
ep->rep_attr.cap.max_recv_sge = 1;
ep->rep_attr.cap.max_inline_data = 0;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 2ca0567..37d0d7f 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -101,6 +101,16 @@ struct rpcrdma_ep {
  */
 #define RPCRDMA_IGNORE_COMPLETION  (0ULL)
 
+/* Pre-allocate extra Work Requests for handling backward receives
+ * and sends. This is a fixed value because the Work Queues are
+ * allocated when the forward channel is set up.
+ */
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+#define RPCRDMA_BACKWARD_WRS   (8)
+#else
+#define RPCRDMA_BACKWARD_WRS   (0)
+#endif
+
 /* Registered buffer -- registered kmalloc'd memory for RDMA SEND/RECV
  *
  * The below structure appears at the front of a large region of kmalloc'd

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 06/18] SUNRPC: Abstract backchannel operations

2015-09-17 Thread Chuck Lever
xprt_{setup,destroy}_backchannel() won't be adequate for RPC/RMDA
bi-direction. In particular, receive buffers have to be pre-
registered and posted in order to receive incoming backchannel
requests.

Add a virtual function call to allow the insertion of appropriate
backchannel setup and destruction methods for each transport.

In addition, freeing a backchannel request is a little different
for RPC/RDMA. Introduce an rpc_xprt_op to handle the difference.

Signed-off-by: Chuck Lever 
---
 include/linux/sunrpc/bc_xprt.h |5 +
 include/linux/sunrpc/xprt.h|3 +++
 net/sunrpc/backchannel_rqst.c  |   24 ++--
 net/sunrpc/xprtsock.c  |   15 +++
 4 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/sunrpc/bc_xprt.h b/include/linux/sunrpc/bc_xprt.h
index 8df43c9f..4397a48 100644
--- a/include/linux/sunrpc/bc_xprt.h
+++ b/include/linux/sunrpc/bc_xprt.h
@@ -38,6 +38,11 @@ void xprt_free_bc_request(struct rpc_rqst *req);
 int xprt_setup_backchannel(struct rpc_xprt *, unsigned int min_reqs);
 void xprt_destroy_backchannel(struct rpc_xprt *, unsigned int max_reqs);
 
+/* Socket backchannel transport methods */
+int xprt_setup_bc(struct rpc_xprt *xprt, unsigned int min_reqs);
+void xprt_destroy_bc(struct rpc_xprt *xprt, unsigned int max_reqs);
+void xprt_free_bc_rqst(struct rpc_rqst *req);
+
 /*
  * Determine if a shared backchannel is in use
  */
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 0fb9acb..81e3433 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -136,6 +136,9 @@ struct rpc_xprt_ops {
int (*enable_swap)(struct rpc_xprt *xprt);
void(*disable_swap)(struct rpc_xprt *xprt);
void(*inject_disconnect)(struct rpc_xprt *xprt);
+   int (*bc_setup)(struct rpc_xprt *xprt, unsigned int 
min_reqs);
+   void(*bc_free_rqst)(struct rpc_rqst *rqst);
+   void(*bc_destroy)(struct rpc_xprt *xprt, unsigned int 
max_reqs);
 };
 
 /*
diff --git a/net/sunrpc/backchannel_rqst.c b/net/sunrpc/backchannel_rqst.c
index 6255d14..c823be6 100644
--- a/net/sunrpc/backchannel_rqst.c
+++ b/net/sunrpc/backchannel_rqst.c
@@ -138,6 +138,14 @@ out_free:
  */
 int xprt_setup_backchannel(struct rpc_xprt *xprt, unsigned int min_reqs)
 {
+   if (!xprt->ops->bc_setup)
+   return -ENOSYS;
+   return xprt->ops->bc_setup(xprt, min_reqs);
+}
+EXPORT_SYMBOL_GPL(xprt_setup_backchannel);
+
+int xprt_setup_bc(struct rpc_xprt *xprt, unsigned int min_reqs)
+{
struct rpc_rqst *req;
struct list_head tmp_list;
int i;
@@ -192,7 +200,6 @@ out_free:
dprintk("RPC:   setup backchannel transport failed\n");
return -ENOMEM;
 }
-EXPORT_SYMBOL_GPL(xprt_setup_backchannel);
 
 /**
  * xprt_destroy_backchannel - Destroys the backchannel preallocated structures.
@@ -205,6 +212,13 @@ EXPORT_SYMBOL_GPL(xprt_setup_backchannel);
  */
 void xprt_destroy_backchannel(struct rpc_xprt *xprt, unsigned int max_reqs)
 {
+   if (xprt->ops->bc_destroy)
+   xprt->ops->bc_destroy(xprt, max_reqs);
+}
+EXPORT_SYMBOL_GPL(xprt_destroy_backchannel);
+
+void xprt_destroy_bc(struct rpc_xprt *xprt, unsigned int max_reqs)
+{
struct rpc_rqst *req = NULL, *tmp = NULL;
 
dprintk("RPC:destroy backchannel transport\n");
@@ -227,7 +241,6 @@ out:
dprintk("RPC:backchannel list empty= %s\n",
list_empty(&xprt->bc_pa_list) ? "true" : "false");
 }
-EXPORT_SYMBOL_GPL(xprt_destroy_backchannel);
 
 static struct rpc_rqst *xprt_alloc_bc_request(struct rpc_xprt *xprt, __be32 
xid)
 {
@@ -264,6 +277,13 @@ void xprt_free_bc_request(struct rpc_rqst *req)
 {
struct rpc_xprt *xprt = req->rq_xprt;
 
+   xprt->ops->bc_free_rqst(req);
+}
+
+void xprt_free_bc_rqst(struct rpc_rqst *req)
+{
+   struct rpc_xprt *xprt = req->rq_xprt;
+
dprintk("RPC:   free backchannel req=%p\n", req);
 
req->rq_connect_cookie = xprt->connect_cookie - 1;
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 7be90bc..d2ad732 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2532,6 +2532,11 @@ static struct rpc_xprt_ops xs_local_ops = {
.print_stats= xs_local_print_stats,
.enable_swap= xs_enable_swap,
.disable_swap   = xs_disable_swap,
+#ifdef CONFIG_SUNRPC_BACKCHANNEL
+   .bc_setup   = xprt_setup_bc,
+   .bc_free_rqst   = xprt_free_bc_rqst,
+   .bc_destroy = xprt_destroy_bc,
+#endif
 };
 
 static struct rpc_xprt_ops xs_udp_ops = {
@@ -2554,6 +2559,11 @@ static struct rpc_xprt_ops xs_udp_ops = {
.enable_swap= xs_enable_swap,
.disable_swap   = xs_disable_swap,
.inject_disconnect  = xs_inject_disconnect,
+#ifdef CONFIG_SUNRPC_BACKCHANNEL
+   .bc_setup   = 

[PATCH v1 05/18] xprtrdma: Replace send and receive arrays

2015-09-17 Thread Chuck Lever
The rb_send_bufs and rb_recv_bufs arrays are used to implement a
pair of stacks for keeping track of free rpcrdma_req and rpcrdma_rep
structs. Replace those arrays with free lists.

To allow more than 512 RPCs in-flight at once, each of these arrays
would be larger than a page (assuming 8-byte addresses and 4KB
pages). Allowing up to 64K in-flight RPCs (as TCP now does), each
buffer array would have to be 128 pages. That's an order-6
allocation. (Not that we're going there.)

A list is easier to expand dynamically. Instead of allocating a
larger array of pointers and copying the existing pointers to the
new array, simply append more buffers to each list.

This also makes it simpler to manage receive buffers that might
catch backwards-direction calls, or to post receive buffers in
bulk to amortize the overhead of ib_post_recv.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |  141 +--
 net/sunrpc/xprtrdma/xprt_rdma.h |9 +-
 2 files changed, 66 insertions(+), 84 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index ac1345b..8d99214 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -962,44 +962,18 @@ rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
 {
struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
-   struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
-   char *p;
-   size_t len;
int i, rc;
 
-   buf->rb_max_requests = cdata->max_requests;
+   buf->rb_max_requests = r_xprt->rx_data.max_requests;
spin_lock_init(&buf->rb_lock);
 
-   /* Need to allocate:
-*   1.  arrays for send and recv pointers
-*   2.  arrays of struct rpcrdma_req to fill in pointers
-*   3.  array of struct rpcrdma_rep for replies
-* Send/recv buffers in req/rep need to be registered
-*/
-   len = buf->rb_max_requests *
-   (sizeof(struct rpcrdma_req *) + sizeof(struct rpcrdma_rep *));
-
-   p = kzalloc(len, GFP_KERNEL);
-   if (p == NULL) {
-   dprintk("RPC:   %s: req_t/rep_t/pad kzalloc(%zd) failed\n",
-   __func__, len);
-   rc = -ENOMEM;
-   goto out;
-   }
-   buf->rb_pool = p;   /* for freeing it later */
-
-   buf->rb_send_bufs = (struct rpcrdma_req **) p;
-   p = (char *) &buf->rb_send_bufs[buf->rb_max_requests];
-   buf->rb_recv_bufs = (struct rpcrdma_rep **) p;
-   p = (char *) &buf->rb_recv_bufs[buf->rb_max_requests];
-
rc = ia->ri_ops->ro_init(r_xprt);
if (rc)
goto out;
 
+   INIT_LIST_HEAD(&buf->rb_send_bufs);
for (i = 0; i < buf->rb_max_requests; i++) {
struct rpcrdma_req *req;
-   struct rpcrdma_rep *rep;
 
req = rpcrdma_create_req(r_xprt);
if (IS_ERR(req)) {
@@ -1008,7 +982,12 @@ rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
rc = PTR_ERR(req);
goto out;
}
-   buf->rb_send_bufs[i] = req;
+   list_add(&req->rl_free, &buf->rb_send_bufs);
+   }
+
+   INIT_LIST_HEAD(&buf->rb_recv_bufs);
+   for (i = 0; i < buf->rb_max_requests + 2; i++) {
+   struct rpcrdma_rep *rep;
 
rep = rpcrdma_create_rep(r_xprt);
if (IS_ERR(rep)) {
@@ -1017,7 +996,7 @@ rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
rc = PTR_ERR(rep);
goto out;
}
-   buf->rb_recv_bufs[i] = rep;
+   list_add(&rep->rr_list, &buf->rb_recv_bufs);
}
 
return 0;
@@ -1051,25 +1030,26 @@ void
 rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
 {
struct rpcrdma_ia *ia = rdmab_to_ia(buf);
-   int i;
 
-   /* clean up in reverse order from create
-*   1.  recv mr memory (mr free, then kfree)
-*   2.  send mr memory (mr free, then kfree)
-*   3.  MWs
-*/
-   dprintk("RPC:   %s: entering\n", __func__);
+   while (!list_empty(&buf->rb_recv_bufs)) {
+   struct rpcrdma_rep *rep = list_entry(buf->rb_recv_bufs.next,
+struct rpcrdma_rep,
+rr_list);
 
-   for (i = 0; i < buf->rb_max_requests; i++) {
-   if (buf->rb_recv_bufs)
-   rpcrdma_destroy_rep(ia, buf->rb_recv_bufs[i]);
-   if (buf->rb_send_bufs)
-   rpcrdma_destroy_req(ia, buf->rb_send_bufs[i]);
+   list_del(&rep->rr_list);
+   rpcrdma_destroy_rep(ia, rep);
}
 
-   ia->ri_ops->ro_destroy(buf);
+   while (!list_empty(&buf->rb_send_bufs)) {
+   struct rpcrdma_req *req = list_entry(buf->rb_send_bufs.next,
+

[PATCH v1 03/18] xprtrdma: Remove completion polling budgets

2015-09-17 Thread Chuck Lever
Commit 8301a2c047cc ("xprtrdma: Limit work done by completion
handler") was supposed to prevent xprtrdma's upcall handlers from
starving other softIRQ work by letting them return to the provider
before all CQEs have been polled.

The logic assumes the provider will call the upcall handler again
immediately if the CQ is re-armed while there are still queued CQEs.

This assumption is invalid. The IBTA spec says that after a CQ is
armed, the hardware must interrupt only when a new CQE is inserted.
xprtrdma can't rely on the provider calling again, even though some
providers do.

Therefore, leaving CQEs on queue makes sense only when there is
another mechanism that ensures all remaining CQEs are consumed in a
timely fashion. xprtrdma does not have such a mechanism. If a CQE
remains queued, the transport can wait forever to send the next RPC.

Finally, move the wcs array back onto the stack to ensure that the
poll array is always local to the CPU where the completion upcall is
running.

Fixes: 8301a2c047cc ("xprtrdma: Limit work done by completion ...")
Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/verbs.c |  100 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h |5 --
 2 files changed, 45 insertions(+), 60 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 8a477e2..f2e3863 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -158,34 +158,37 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
}
 }
 
-static int
+/* The wc array is on stack: automatic memory is always CPU-local.
+ *
+ * The common case is a single completion is ready. By asking
+ * for two entries, a return code of 1 means there is exactly
+ * one completion and no more. We don't have to poll again to
+ * know that the CQ is now empty.
+ */
+static void
 rpcrdma_sendcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
 {
-   struct ib_wc *wcs;
-   int budget, count, rc;
+   struct ib_wc *pos, wcs[2];
+   int count, rc;
 
-   budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
do {
-   wcs = ep->rep_send_wcs;
+   pos = wcs;
 
-   rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
-   if (rc <= 0)
-   return rc;
+   rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
+   if (rc < 0)
+   goto out_warn;
 
count = rc;
while (count-- > 0)
-   rpcrdma_sendcq_process_wc(wcs++);
-   } while (rc == RPCRDMA_POLLSIZE && --budget);
-   return 0;
+   rpcrdma_sendcq_process_wc(pos++);
+   } while (rc == ARRAY_SIZE(wcs));
+   return;
+
+out_warn:
+   pr_warn("RPC:   %s: ib_poll_cq() failed %i\n", __func__, rc);
 }
 
-/*
- * Handle send, fast_reg_mr, and local_inv completions.
- *
- * Send events are typically suppressed and thus do not result
- * in an upcall. Occasionally one is signaled, however. This
- * prevents the provider's completion queue from wrapping and
- * losing a completion.
+/* Handle provider send completion upcalls.
  */
 static void
 rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
@@ -193,12 +196,7 @@ rpcrdma_sendcq_upcall(struct ib_cq *cq, void *cq_context)
struct rpcrdma_ep *ep = (struct rpcrdma_ep *)cq_context;
int rc;
 
-   rc = rpcrdma_sendcq_poll(cq, ep);
-   if (rc) {
-   dprintk("RPC:   %s: ib_poll_cq failed: %i\n",
-   __func__, rc);
-   return;
-   }
+   rpcrdma_sendcq_poll(cq, ep);
 
rc = ib_req_notify_cq(cq,
IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
@@ -247,44 +245,41 @@ out_fail:
goto out_schedule;
 }
 
-static int
+/* The wc array is on stack: automatic memory is always CPU-local.
+ *
+ * struct ib_wc is 64 bytes, making the poll array potentially
+ * large. But this is at the bottom of the call chain. Further
+ * substantial work is done in another thread.
+ */
+static void
 rpcrdma_recvcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
 {
-   struct list_head sched_list;
-   struct ib_wc *wcs;
-   int budget, count, rc;
+   struct ib_wc *pos, wcs[4];
+   LIST_HEAD(sched_list);
+   int count, rc;
 
-   INIT_LIST_HEAD(&sched_list);
-   budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
do {
-   wcs = ep->rep_recv_wcs;
+   pos = wcs;
 
-   rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
-   if (rc <= 0)
-   goto out_schedule;
+   rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
+   if (rc < 0)
+   goto out_warn;
 
count = rc;
while (count-- > 0)
-   rpcrdma_recvcq_process_wc(wcs++, &sched_list);
-   } while (rc == RPCRDMA_POLLSIZE && --budget);
-   rc = 0;
+   rpcrdma_recvcq_

[PATCH v1 02/18] xprtrdma: Replace global lkey with lkey local to PD

2015-09-17 Thread Chuck Lever
The core API has changed so that devices that do not have a global
DMA lkey automatically create an mr, per-PD, and make that lkey
available. The global DMA lkey interface is going away in favor of
the per-PD DMA lkey.

The per-PD DMA lkey is always available. Convert xprtrdma to use the
device's per-PD DMA lkey for regbufs, no matter which memory
registration scheme is in use.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/fmr_ops.c  |   19 ---
 net/sunrpc/xprtrdma/frwr_ops.c |5 -
 net/sunrpc/xprtrdma/physical_ops.c |   10 +-
 net/sunrpc/xprtrdma/verbs.c|2 +-
 net/sunrpc/xprtrdma/xprt_rdma.h|1 -
 5 files changed, 2 insertions(+), 35 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index cb25c89..f1e8daf 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -39,25 +39,6 @@ static int
 fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
struct rpcrdma_create_data_internal *cdata)
 {
-   struct ib_device_attr *devattr = &ia->ri_devattr;
-   struct ib_mr *mr;
-
-   /* Obtain an lkey to use for the regbufs, which are
-* protected from remote access.
-*/
-   if (devattr->device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY) {
-   ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
-   } else {
-   mr = ib_get_dma_mr(ia->ri_pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(mr)) {
-   pr_err("%s: ib_get_dma_mr for failed with %lX\n",
-  __func__, PTR_ERR(mr));
-   return -ENOMEM;
-   }
-   ia->ri_dma_lkey = ia->ri_dma_mr->lkey;
-   ia->ri_dma_mr = mr;
-   }
-
return 0;
 }
 
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 21b3efb..004f1ad 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -189,11 +189,6 @@ frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
struct ib_device_attr *devattr = &ia->ri_devattr;
int depth, delta;
 
-   /* Obtain an lkey to use for the regbufs, which are
-* protected from remote access.
-*/
-   ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
-
ia->ri_max_frmr_depth =
min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
  devattr->max_fast_reg_page_list_len);
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index 72cf8b1..617b76f 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -23,7 +23,6 @@ static int
 physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
 struct rpcrdma_create_data_internal *cdata)
 {
-   struct ib_device_attr *devattr = &ia->ri_devattr;
struct ib_mr *mr;
 
/* Obtain an rkey to use for RPC data payloads.
@@ -37,15 +36,8 @@ physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep 
*ep,
   __func__, PTR_ERR(mr));
return -ENOMEM;
}
-   ia->ri_dma_mr = mr;
-
-   /* Obtain an lkey to use for regbufs.
-*/
-   if (devattr->device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
-   ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
-   else
-   ia->ri_dma_lkey = ia->ri_dma_mr->lkey;
 
+   ia->ri_dma_mr = mr;
return 0;
 }
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 01a314a..8a477e2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1255,7 +1255,7 @@ rpcrdma_alloc_regbuf(struct rpcrdma_ia *ia, size_t size, 
gfp_t flags)
goto out_free;
 
iov->length = size;
-   iov->lkey = ia->ri_dma_lkey;
+   iov->lkey = ia->ri_pd->local_dma_lkey;
rb->rg_size = size;
rb->rg_owner = NULL;
return rb;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 0251222..c09414e 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -65,7 +65,6 @@ struct rpcrdma_ia {
struct rdma_cm_id   *ri_id;
struct ib_pd*ri_pd;
struct ib_mr*ri_dma_mr;
-   u32 ri_dma_lkey;
struct completion   ri_done;
int ri_async_rc;
unsigned intri_max_frmr_depth;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 04/18] xprtrdma: Refactor reply handler error handling

2015-09-17 Thread Chuck Lever
Clean up: The error cases in rpcrdma_reply_handler() almost never
execute. Ensure the compiler places them out of the hot path.

No behavior change expected.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/rpc_rdma.c  |   90 ++-
 net/sunrpc/xprtrdma/verbs.c |2 -
 net/sunrpc/xprtrdma/xprt_rdma.h |2 +
 3 files changed, 54 insertions(+), 40 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index bc8bd65..287c874 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -741,52 +741,27 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
unsigned long cwnd;
u32 credits;
 
-   /* Check status. If bad, signal disconnect and return rep to pool */
-   if (rep->rr_len == ~0U) {
-   rpcrdma_recv_buffer_put(rep);
-   if (r_xprt->rx_ep.rep_connected == 1) {
-   r_xprt->rx_ep.rep_connected = -EIO;
-   rpcrdma_conn_func(&r_xprt->rx_ep);
-   }
-   return;
-   }
-   if (rep->rr_len < RPCRDMA_HDRLEN_MIN) {
-   dprintk("RPC:   %s: short/invalid reply\n", __func__);
-   goto repost;
-   }
+   dprintk("RPC:   %s: incoming rep %p\n", __func__, rep);
+
+   if (rep->rr_len == RPCRDMA_BAD_LEN)
+   goto out_badstatus;
+   if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
+   goto out_shortreply;
+
headerp = rdmab_to_msg(rep->rr_rdmabuf);
-   if (headerp->rm_vers != rpcrdma_version) {
-   dprintk("RPC:   %s: invalid version %d\n",
-   __func__, be32_to_cpu(headerp->rm_vers));
-   goto repost;
-   }
+   if (headerp->rm_vers != rpcrdma_version)
+   goto out_badversion;
 
/* Get XID and try for a match. */
spin_lock(&xprt->transport_lock);
rqst = xprt_lookup_rqst(xprt, headerp->rm_xid);
-   if (rqst == NULL) {
-   spin_unlock(&xprt->transport_lock);
-   dprintk("RPC:   %s: reply 0x%p failed "
-   "to match any request xid 0x%08x len %d\n",
-   __func__, rep, be32_to_cpu(headerp->rm_xid),
-   rep->rr_len);
-repost:
-   r_xprt->rx_stats.bad_reply_count++;
-   if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
-   rpcrdma_recv_buffer_put(rep);
-
-   return;
-   }
+   if (!rqst)
+   goto out_nomatch;
 
/* get request object */
req = rpcr_to_rdmar(rqst);
-   if (req->rl_reply) {
-   spin_unlock(&xprt->transport_lock);
-   dprintk("RPC:   %s: duplicate reply 0x%p to RPC "
-   "request 0x%p: xid 0x%08x\n", __func__, rep, req,
-   be32_to_cpu(headerp->rm_xid));
-   goto repost;
-   }
+   if (req->rl_reply)
+   goto out_duplicate;
 
dprintk("RPC:   %s: reply 0x%p completes request 0x%p\n"
"   RPC request 0x%p xid 0x%08x\n",
@@ -883,8 +858,45 @@ badheader:
if (xprt->cwnd > cwnd)
xprt_release_rqst_cong(rqst->rq_task);
 
+   xprt_complete_rqst(rqst->rq_task, status);
+   spin_unlock(&xprt->transport_lock);
dprintk("RPC:   %s: xprt_complete_rqst(0x%p, 0x%p, %d)\n",
__func__, xprt, rqst, status);
-   xprt_complete_rqst(rqst->rq_task, status);
+   return;
+
+out_badstatus:
+   rpcrdma_recv_buffer_put(rep);
+   if (r_xprt->rx_ep.rep_connected == 1) {
+   r_xprt->rx_ep.rep_connected = -EIO;
+   rpcrdma_conn_func(&r_xprt->rx_ep);
+   }
+   return;
+
+out_shortreply:
+   dprintk("RPC:   %s: short/invalid reply\n", __func__);
+   goto repost;
+
+out_badversion:
+   dprintk("RPC:   %s: invalid version %d\n",
+   __func__, be32_to_cpu(headerp->rm_vers));
+   goto repost;
+
+out_nomatch:
+   spin_unlock(&xprt->transport_lock);
+   dprintk("RPC:   %s: reply 0x%p failed "
+   "to match any request xid 0x%08x len %d\n",
+   __func__, rep, be32_to_cpu(headerp->rm_xid),
+   rep->rr_len);
+   goto repost;
+
+out_duplicate:
spin_unlock(&xprt->transport_lock);
+   dprintk("RPC:   %s: duplicate reply 0x%p to RPC "
+   "request 0x%p: xid 0x%08x\n", __func__, rep, req,
+   be32_to_cpu(headerp->rm_xid));
+
+repost:
+   r_xprt->rx_stats.bad_reply_count++;
+   if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
+   rpcrdma_recv_buffer_put(rep);
 }
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index f2e3863..ac1345b 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -241,7 +241,7 @@ out_fail:
if (wc->status != I

[PATCH v1 00/18] RFC NFS/RDMA patches for merging into v4.4

2015-09-17 Thread Chuck Lever
This series begins with the usual fixes, then introduces patches
that add support for bi-directional RPC/RDMA. Bi-directional
RPC/RDMA is a pre-requisite for NFSv4.1 on RDMA transports. It
includes both client and server side support, though the server side
is not as far along as I had hoped, and could be postponed to 4.5.

This v1 is an initial request for review, not a "these suckers are
ready to be merged."

Also available in the "nfs-rdma-for-4.4" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.4

---

Chuck Lever (18):
  xprtrdma: Enable swap-on-NFS/RDMA
  xprtrdma: Replace global lkey with lkey local to PD
  xprtrdma: Remove completion polling budgets
  xprtrdma: Refactor reply handler error handling
  xprtrdma: Replace send and receive arrays
  SUNRPC: Abstract backchannel operations
  xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers
  xprtrdma: Pre-allocate Work Requests for backchannel
  xprtrdma: Add support for sending backward direction RPC replies
  xprtrdma: Handle incoming backward direction RPC calls
  svcrdma: Add backward direction service for RPC/RDMA transport
  SUNRPC: Remove the TCP-only restriction in bc_svc_process()
  NFS: Enable client side NFSv4.1 backchannel to use other transports
  svcrdma: Define maximum number of backchannel requests
  svcrdma: Add svc_rdma_get_context() API that is allowed to fail
  svcrdma: Add infrastructure to send backwards direction RPC/RDMA calls
  svcrdma: Add infrastructure to receive backwards direction RPC/RDMA 
replies
  xprtrdma: Add class for RDMA backwards direction transport


 fs/nfs/callback.c|   33 ++-
 include/linux/sunrpc/bc_xprt.h   |5 
 include/linux/sunrpc/svc_rdma.h  |   15 +
 include/linux/sunrpc/xprt.h  |6 
 net/sunrpc/backchannel_rqst.c|   24 ++
 net/sunrpc/svc.c |5 
 net/sunrpc/xprt.c|1 
 net/sunrpc/xprtrdma/Makefile |1 
 net/sunrpc/xprtrdma/backchannel.c|  368 ++
 net/sunrpc/xprtrdma/fmr_ops.c|   19 --
 net/sunrpc/xprtrdma/frwr_ops.c   |5 
 net/sunrpc/xprtrdma/physical_ops.c   |   10 -
 net/sunrpc/xprtrdma/rpc_rdma.c   |  212 ++---
 net/sunrpc/xprtrdma/svc_rdma.c   |6 
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |   60 +
 net/sunrpc/xprtrdma/svc_rdma_sendto.c|   63 +
 net/sunrpc/xprtrdma/svc_rdma_transport.c |  104 
 net/sunrpc/xprtrdma/transport.c  |  253 -
 net/sunrpc/xprtrdma/verbs.c  |  341 
 net/sunrpc/xprtrdma/xprt_rdma.h  |   56 -
 net/sunrpc/xprtsock.c|   16 +
 21 files changed, 1341 insertions(+), 262 deletions(-)
 create mode 100644 net/sunrpc/xprtrdma/backchannel.c

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 01/18] xprtrdma: Enable swap-on-NFS/RDMA

2015-09-17 Thread Chuck Lever
After adding a swapfile on an NFS/RDMA mount and removing the
normal swap partition, I was able to push the NFS client well
into swap without any issue.

I forgot to swapoff the NFS file before rebooting. This pinned
the NFS mount and the IB core and provider, causing shutdown to
hang. I think this is expected and safe behavior. Probably
shutdown scripts should "swapoff -a" before unmounting any
filesystems.

Signed-off-by: Chuck Lever 
---
 net/sunrpc/xprtrdma/transport.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 41e452b..e9e5ed7 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -676,7 +676,7 @@ static void xprt_rdma_print_stats(struct rpc_xprt *xprt, 
struct seq_file *seq)
 static int
 xprt_rdma_enable_swap(struct rpc_xprt *xprt)
 {
-   return -EINVAL;
+   return 0;
 }
 
 static void

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ucma: check workqueue allocation before usage

2015-09-17 Thread Sasha Levin
On 09/17/2015 04:10 PM, Hefty, Sean wrote:
> What kernel is this patch against?

Patch is against linux-next.


Thanks,
Sasha

>> Allocating a workqueue might fail, which wasn't checked so far and would
>> lead to NULL ptr derefs when an attempt to use it was made.
>>
>> Signed-off-by: Sasha Levin 
>> ---
>>  drivers/infiniband/core/ucma.c |7 ++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/infiniband/core/ucma.c
>> b/drivers/infiniband/core/ucma.c
>> index a53fc9b..30467d1 100644
>> --- a/drivers/infiniband/core/ucma.c
>> +++ b/drivers/infiniband/core/ucma.c
>> @@ -1624,11 +1624,16 @@ static int ucma_open(struct inode *inode, struct
>> file *filp)
>>  if (!file)
>>  return -ENOMEM;
>>
>> +file->close_wq = create_singlethread_workqueue("ucma_close_id");
>> +if (!file->close_wq) {
>> +kfree(file);
>> +return -ENOMEM;
>> +}
>> +
>>  INIT_LIST_HEAD(&file->event_list);
>>  INIT_LIST_HEAD(&file->ctx_list);
>>  init_waitqueue_head(&file->poll_wait);
>>  mutex_init(&file->mut);
>> -file->close_wq = create_singlethread_workqueue("ucma_close_id");
>>
>>  filp->private_data = file;
>>  file->filp = filp;
>> --
>> 1.7.10.4
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] IB/ucma: check workqueue allocation before usage

2015-09-17 Thread Hefty, Sean
What kernel is this patch against?

> Allocating a workqueue might fail, which wasn't checked so far and would
> lead to NULL ptr derefs when an attempt to use it was made.
> 
> Signed-off-by: Sasha Levin 
> ---
>  drivers/infiniband/core/ucma.c |7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/core/ucma.c
> b/drivers/infiniband/core/ucma.c
> index a53fc9b..30467d1 100644
> --- a/drivers/infiniband/core/ucma.c
> +++ b/drivers/infiniband/core/ucma.c
> @@ -1624,11 +1624,16 @@ static int ucma_open(struct inode *inode, struct
> file *filp)
>   if (!file)
>   return -ENOMEM;
> 
> + file->close_wq = create_singlethread_workqueue("ucma_close_id");
> + if (!file->close_wq) {
> + kfree(file);
> + return -ENOMEM;
> + }
> +
>   INIT_LIST_HEAD(&file->event_list);
>   INIT_LIST_HEAD(&file->ctx_list);
>   init_waitqueue_head(&file->poll_wait);
>   mutex_init(&file->mut);
> - file->close_wq = create_singlethread_workqueue("ucma_close_id");
> 
>   filp->private_data = file;
>   file->filp = filp;
> --
> 1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/ucma: check workqueue allocation before usage

2015-09-17 Thread Sasha Levin
Allocating a workqueue might fail, which wasn't checked so far and would
lead to NULL ptr derefs when an attempt to use it was made.

Signed-off-by: Sasha Levin 
---
 drivers/infiniband/core/ucma.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index a53fc9b..30467d1 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -1624,11 +1624,16 @@ static int ucma_open(struct inode *inode, struct file 
*filp)
if (!file)
return -ENOMEM;
 
+   file->close_wq = create_singlethread_workqueue("ucma_close_id");
+   if (!file->close_wq) {
+   kfree(file);
+   return -ENOMEM;
+   }
+
INIT_LIST_HEAD(&file->event_list);
INIT_LIST_HEAD(&file->ctx_list);
init_waitqueue_head(&file->poll_wait);
mutex_init(&file->mut);
-   file->close_wq = create_singlethread_workqueue("ucma_close_id");
 
filp->private_data = file;
file->filp = filp;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications

2015-09-17 Thread Or Gerlitz
On Thu, Sep 17, 2015 at 5:42 PM, Christoph Lameter  wrote:
> Could we simplify it a bit. [...] but avoids all the
> generalizations and workqueues. Had to export two new functions from
> ipoib_multicast.c though.

Do you find some over complexity in Erez's implementation? what? as I
said, he's pretty busy, but I hope he can get to review your proposal
early next week.

> This compiles

and... shouldn't be too hard to test it out, e.g with ping -b or iperf
mcast sender etc
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications

2015-09-17 Thread Or Gerlitz
On Thu, Sep 17, 2015 at 5:42 PM, Christoph Lameter  wrote:
> Could we simplify it a bit. This compiles but avoids all the
> generalizations and workqueues.

Do you find some over complexity in Erez's implementation? what? as I
said, he's pretty busy, but I hope he can get to review your proposal


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] IB/hfi: Properly set permissions for user device files

2015-09-17 Thread ira . weiny
From: Ira Weiny 

Some of the device files are required to be user accessible for PSM while
most should remain accessible only by root.

Add a parameter to hfi1_cdev_init which controls if the user should have access
to this device which places it in a different class with the appropriate
devnode callback.

In addition set the devnode call back for the existing class to be a bit more
explicit for those permissions.

Finally remove the unnecessary null check before class_destroy

Tested-by: Donald Dutile 
Signed-off-by: Haralanov, Mitko (mitko.harala...@intel.com)
Signed-off-by: Ira Weiny 

---

Changes from V1:
Fixed typo in error message
added missing goto on error
Remove null checks for class
set user_class to NULL on error

 drivers/staging/rdma/hfi1/device.c   | 54 
 drivers/staging/rdma/hfi1/device.h   |  3 +-
 drivers/staging/rdma/hfi1/diag.c |  5 ++--
 drivers/staging/rdma/hfi1/file_ops.c |  9 --
 4 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/device.c 
b/drivers/staging/rdma/hfi1/device.c
index 07c87a87775f..bc26a5392712 100644
--- a/drivers/staging/rdma/hfi1/device.c
+++ b/drivers/staging/rdma/hfi1/device.c
@@ -57,11 +57,13 @@
 #include "device.h"
 
 static struct class *class;
+static struct class *user_class;
 static dev_t hfi1_dev;
 
 int hfi1_cdev_init(int minor, const char *name,
   const struct file_operations *fops,
-  struct cdev *cdev, struct device **devp)
+  struct cdev *cdev, struct device **devp,
+  bool user_accessible)
 {
const dev_t dev = MKDEV(MAJOR(hfi1_dev), minor);
struct device *device = NULL;
@@ -78,7 +80,11 @@ int hfi1_cdev_init(int minor, const char *name,
goto done;
}
 
-   device = device_create(class, NULL, dev, NULL, "%s", name);
+   if (user_accessible)
+   device = device_create(user_class, NULL, dev, NULL, "%s", name);
+   else
+   device = device_create(class, NULL, dev, NULL, "%s", name);
+
if (!IS_ERR(device))
goto done;
ret = PTR_ERR(device);
@@ -110,6 +116,26 @@ const char *class_name(void)
return hfi1_class_name;
 }
 
+static char *hfi1_devnode(struct device *dev, umode_t *mode)
+{
+   if (mode)
+   *mode = 0600;
+   return kasprintf(GFP_KERNEL, "%s", dev_name(dev));
+}
+
+static const char *hfi1_class_name_user = "hfi1_user";
+const char *class_name_user(void)
+{
+   return hfi1_class_name_user;
+}
+
+static char *hfi1_user_devnode(struct device *dev, umode_t *mode)
+{
+   if (mode)
+   *mode = 0666;
+   return kasprintf(GFP_KERNEL, "%s", dev_name(dev));
+}
+
 int __init dev_init(void)
 {
int ret;
@@ -125,7 +151,22 @@ int __init dev_init(void)
ret = PTR_ERR(class);
pr_err("Could not create device class (err %d)\n", -ret);
unregister_chrdev_region(hfi1_dev, HFI1_NMINORS);
+   goto done;
}
+   class->devnode = hfi1_devnode;
+
+   user_class = class_create(THIS_MODULE, class_name_user());
+   if (IS_ERR(user_class)) {
+   ret = PTR_ERR(user_class);
+   pr_err("Could not create device class for user accessible files 
(err %d)\n",
+  -ret);
+   class_destroy(class);
+   class = NULL;
+   user_class = NULL;
+   unregister_chrdev_region(hfi1_dev, HFI1_NMINORS);
+   goto done;
+   }
+   user_class->devnode = hfi1_user_devnode;
 
 done:
return ret;
@@ -133,10 +174,11 @@ done:
 
 void dev_cleanup(void)
 {
-   if (class) {
-   class_destroy(class);
-   class = NULL;
-   }
+   class_destroy(class);
+   class = NULL;
+
+   class_destroy(user_class);
+   user_class = NULL;
 
unregister_chrdev_region(hfi1_dev, HFI1_NMINORS);
 }
diff --git a/drivers/staging/rdma/hfi1/device.h 
b/drivers/staging/rdma/hfi1/device.h
index 98caecd3d807..2850ff739d81 100644
--- a/drivers/staging/rdma/hfi1/device.h
+++ b/drivers/staging/rdma/hfi1/device.h
@@ -52,7 +52,8 @@
 
 int hfi1_cdev_init(int minor, const char *name,
   const struct file_operations *fops,
-  struct cdev *cdev, struct device **devp);
+  struct cdev *cdev, struct device **devp,
+  bool user_accessible);
 void hfi1_cdev_cleanup(struct cdev *cdev, struct device **devp);
 const char *class_name(void);
 int __init dev_init(void);
diff --git a/drivers/staging/rdma/hfi1/diag.c b/drivers/staging/rdma/hfi1/diag.c
index 6777d6b659cf..b87e4e942ae6 100644
--- a/drivers/staging/rdma/hfi1/diag.c
+++ b/drivers/staging/rdma/hfi1/diag.c
@@ -292,7 +292,7 @@ int hfi1_diag_add(struct hfi1_devdata *dd)
if (atomic_inc_return(&diagpkt_count) == 1) {
ret = hfi1_cdev_i

RE: [PATCH] IB/hfi: Properly set permissions for user device files

2015-09-17 Thread Weiny, Ira
> 
> On Thu, Sep 17, 2015 at 06:18:15PM +0200, Michal Schmidt wrote:
> > On 09/16/2015 11:41 PM, ira.we...@intel.com wrote:
> > > @@ -125,7 +151,20 @@ int __init dev_init(void)
> > >   ret = PTR_ERR(class);
> > >   pr_err("Could not create device class (err %d)\n", -ret);
> > >   unregister_chrdev_region(hfi1_dev, HFI1_NMINORS);
> > > + goto done;
> > >   }
> > > + class->devnode = hfi1_devnode;
> > > +
> > > + user_class = class_create(THIS_MODULE, class_name_user());
> > > + if (IS_ERR(user_class)) {
> > > + ret = PTR_ERR(user_class);
> > > + pr_err("Could not create device class for user accisble files
> > > +(err %d)\n",
> >
> >  Typo in error message.
> 
> And what is the deal with all these pr_err's? This is a driver, it needs to 
> use
> dev_err and related always. Does thatneed to go in the todo list?
> 
> I'm also skeptical we need a print on every error case :|
> 

This is very early in the driver code and we don't have a struct device at this 
point.

The bulk of the driver uses macros which use dev_*.  So no I don't think we 
need to add anything to the todo list.

Ira

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/hfi: Properly set permissions for user device files

2015-09-17 Thread Jason Gunthorpe
On Thu, Sep 17, 2015 at 06:18:15PM +0200, Michal Schmidt wrote:
> On 09/16/2015 11:41 PM, ira.we...@intel.com wrote:
> > @@ -125,7 +151,20 @@ int __init dev_init(void)
> > ret = PTR_ERR(class);
> > pr_err("Could not create device class (err %d)\n", -ret);
> > unregister_chrdev_region(hfi1_dev, HFI1_NMINORS);
> > +   goto done;
> > }
> > +   class->devnode = hfi1_devnode;
> > +
> > +   user_class = class_create(THIS_MODULE, class_name_user());
> > +   if (IS_ERR(user_class)) {
> > +   ret = PTR_ERR(user_class);
> > +   pr_err("Could not create device class for user accisble files 
> > (err %d)\n",
>
> Typo in error message.

And what is the deal with all these pr_err's? This is a driver, it
needs to use dev_err and related always. Does thatneed to go in the
todo list?

I'm also skeptical we need a print on every error case :|

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/hfi: Properly set permissions for user device files

2015-09-17 Thread Michal Schmidt
On 09/16/2015 11:41 PM, ira.we...@intel.com wrote:
> @@ -125,7 +151,20 @@ int __init dev_init(void)
>   ret = PTR_ERR(class);
>   pr_err("Could not create device class (err %d)\n", -ret);
>   unregister_chrdev_region(hfi1_dev, HFI1_NMINORS);
> + goto done;
>   }
> + class->devnode = hfi1_devnode;
> +
> + user_class = class_create(THIS_MODULE, class_name_user());
> + if (IS_ERR(user_class)) {
> + ret = PTR_ERR(user_class);
> + pr_err("Could not create device class for user accisble files 
> (err %d)\n",
   
Typo in error message.

> +-ret);
> + class_destroy(class);
> + class = NULL;
> + unregister_chrdev_region(hfi1_dev, HFI1_NMINORS);

Missing "goto done"? Otherwise this:

> + }
> + user_class->devnode = hfi1_user_devnode;

... will explode.

>  
>  done:
>   return ret;
> @@ -138,5 +177,10 @@ void dev_cleanup(void)
>   class = NULL;
>   }
>  
> + if (user_class) {
> + class_destroy(user_class);
> + user_class = NULL;
> + }
> +

It's actually harmless to call class_destroy(NULL). No need to check.

Michal
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/hfi: Properly set permissions for user device files

2015-09-17 Thread Don Dutile

On 09/16/2015 05:41 PM, ira.we...@intel.com wrote:

From: Ira Weiny 

Some of the device files are required to be user accessible for PSM while
most should remain accessible only by root.

Add a parameter to hfi1_cdev_init which controls if the user should have access
to this device which places it in a different class with the appropriate
devnode callback.

In addition set the devnode call back for the existing class to be a bit more
explicit for those permissions.

Signed-off-by: Haralanov, Mitko 
Signed-off-by: Ira Weiny 
---
  drivers/staging/rdma/hfi1/device.c   | 48 ++--
  drivers/staging/rdma/hfi1/device.h   |  3 ++-
  drivers/staging/rdma/hfi1/diag.c |  5 ++--
  drivers/staging/rdma/hfi1/file_ops.c |  9 ---
  4 files changed, 57 insertions(+), 8 deletions(-)


Can add my
Tested-by: Donald Dutile 

Verified that permissions were modified as expected, and now
work with OPA libs that failed due to previous permission settings.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications

2015-09-17 Thread Christoph Lameter
Could we simplify it a bit. This compiles but avoids all the
generalizations and workqueues. Had to export two new functions from
ipoib_multicast.c though.



Subject: ipoib: Expire sendonly multicast joins on neighbor expiration

Add mcast_leave functionality to __ipoib_reap_neighbor.

Based on Erez work.

Signed-off-by: Christoph Lameter 

Index: linux/drivers/infiniband/ulp/ipoib/ipoib_main.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_main.c2015-09-09 
13:14:03.412350354 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_main.c 2015-09-17 
09:34:03.169844055 -0500
@@ -1149,6 +1149,8 @@ static void __ipoib_reap_neigh(struct ip
unsigned long dt;
unsigned long flags;
int i;
+   LIST_HEAD(remove_list);
+   struct ipoib_mcast *mcast, *tmcast;

if (test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
return;
@@ -1176,6 +1178,18 @@ static void __ipoib_reap_neigh(struct ip
  
lockdep_is_held(&priv->lock))) != NULL) {
/* was the neigh idle for two GC periods */
if (time_after(neigh_obsolete, neigh->alive)) {
+
+   /* Is this multicast ? */
+   if (neigh->daddr[4] == 0xff) {
+   mcast = __ipoib_mcast_find(priv->dev, 
neigh->daddr + 4);
+
+   if (mcast && 
test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
+   list_del(&mcast->list);
+   rb_erase(&mcast->rb_node, 
&priv->multicast_tree);
+   list_add_tail(&mcast->list, 
&remove_list);
+   }
+   }
+
rcu_assign_pointer(*np,
   
rcu_dereference_protected(neigh->hnext,
 
lockdep_is_held(&priv->lock)));
@@ -1191,6 +1205,8 @@ static void __ipoib_reap_neigh(struct ip

 out_unlock:
spin_unlock_irqrestore(&priv->lock, flags);
+   list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
+   ipoib_mcast_leave(priv->dev, mcast);
 }

 static void ipoib_reap_neigh(struct work_struct *work)
Index: linux/drivers/infiniband/ulp/ipoib/ipoib.h
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib.h 2015-09-09 
13:14:03.412350354 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib.h  2015-09-17 09:36:17.342455845 
-0500
@@ -548,6 +548,8 @@ void ipoib_path_iter_read(struct ipoib_p

 int ipoib_mcast_attach(struct net_device *dev, u16 mlid,
   union ib_gid *mgid, int set_qkey);
+int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast);
+struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid);

 int ipoib_init_qp(struct net_device *dev);
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
Index: linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- linux.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c   2015-09-09 
13:14:03.412350354 -0500
+++ linux/drivers/infiniband/ulp/ipoib/ipoib_multicast.c2015-09-17 
09:36:55.305497262 -0500
@@ -153,7 +153,7 @@ static struct ipoib_mcast *ipoib_mcast_a
return mcast;
 }

-static struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void 
*mgid)
+struct ipoib_mcast *__ipoib_mcast_find(struct net_device *dev, void *mgid)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct rb_node *n = priv->multicast_tree.rb_node;
@@ -675,7 +675,7 @@ int ipoib_mcast_stop_thread(struct net_d
return 0;
 }

-static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
+int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
int ret = 0;
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 7/9] IB/cma: Add configfs for rdma_cm

2015-09-17 Thread Matan Barak



On 8/13/2015 7:03 PM, Matan Barak wrote:

Users would like to control the behaviour of rdma_cm.
For example, old applications which don't set the
required RoCE gid type could be executed on RoCE V2
network types. In order to support this configuration,
we implement a configfs for rdma_cm.

In order to use the configfs, one needs to mount it and
mkdir  inside rdma_cm directory.

The patch adds support for a single configuration file,
default_roce_mode. The mode can either be "IB/RoCE v1" or
"RoCE v2".

Signed-off-by: Matan Barak 
---
  drivers/infiniband/Kconfig |   9 +
  drivers/infiniband/core/Makefile   |   2 +
  drivers/infiniband/core/cache.c|  24 +++
  drivers/infiniband/core/cma.c  |  95 -
  drivers/infiniband/core/cma_configfs.c | 353 +
  drivers/infiniband/core/core_priv.h|  24 +++
  6 files changed, 503 insertions(+), 4 deletions(-)
  create mode 100644 drivers/infiniband/core/cma_configfs.c

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index da4c697..9ee82a2 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -54,6 +54,15 @@ config INFINIBAND_ADDR_TRANS
depends on INFINIBAND
default y

+config INFINIBAND_ADDR_TRANS_CONFIGFS
+   bool
+   depends on INFINIBAND_ADDR_TRANS && CONFIGFS_FS
+   default y
+   ---help---
+ ConfigFS support for RDMA communication manager (CM).
+ This allows the user to config the default GID type that the CM
+ uses for each device, when initiaing new connections.
+
  source "drivers/infiniband/hw/mthca/Kconfig"
  source "drivers/infiniband/hw/qib/Kconfig"
  source "drivers/infiniband/hw/ehca/Kconfig"
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index d43a899..7922fa7 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -24,6 +24,8 @@ iw_cm-y :=iwcm.o iwpm_util.o iwpm_msg.o

  rdma_cm-y :=  cma.o

+rdma_cm-$(CONFIG_INFINIBAND_ADDR_TRANS_CONFIGFS) += cma_configfs.o
+
  rdma_ucm-y := ucma.o

  ib_addr-y :=  addr.o
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index ddd0406..66090ce 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -127,6 +127,30 @@ const char *ib_cache_gid_type_str(enum ib_gid_type 
gid_type)
  }
  EXPORT_SYMBOL(ib_cache_gid_type_str);

+int ib_cache_gid_parse_type_str(const char *buf)
+{
+   unsigned int i;
+   size_t len;
+   int err = -EINVAL;
+
+   len = strlen(buf);
+   if (len == 0)
+   return -EINVAL;
+
+   if (buf[len - 1] == '\n')
+   len--;
+
+   for (i = 0; i < ARRAY_SIZE(gid_type_str); ++i)
+   if (gid_type_str[i] && !strncmp(buf, gid_type_str[i], len) &&
+   len == strlen(gid_type_str[i])) {
+   err = i;
+   break;
+   }
+
+   return err;
+}
+EXPORT_SYMBOL(ib_cache_gid_parse_type_str);
+
  static int write_gid(struct ib_device *ib_dev, u8 port,
 struct ib_gid_table *table, int ix,
 const union ib_gid *gid,
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 22003dd..e4f4d23 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -121,6 +121,7 @@ struct cma_device {
struct completion   comp;
atomic_trefcount;
struct list_headid_list;
+   enum ib_gid_type*default_gid_type;
  };

  struct rdma_bind_list {
@@ -138,6 +139,62 @@ void cma_ref_dev(struct cma_device *cma_dev)
atomic_inc(&cma_dev->refcount);
  }

+struct cma_device *cma_enum_devices_by_ibdev(cma_device_filter filter,
+void   *cookie)
+{
+   struct cma_device *cma_dev;
+   struct cma_device *found_cma_dev = NULL;
+
+   mutex_lock(&lock);
+
+   list_for_each_entry(cma_dev, &dev_list, list)
+   if (filter(cma_dev->device, cookie)) {
+   found_cma_dev = cma_dev;
+   break;
+   }
+
+   if (found_cma_dev)
+   cma_ref_dev(found_cma_dev);
+   mutex_unlock(&lock);
+   return found_cma_dev;
+}
+
+int cma_get_default_gid_type(struct cma_device *cma_dev,
+unsigned int port)
+{
+   if (port < rdma_start_port(cma_dev->device) ||
+   port > rdma_end_port(cma_dev->device))
+   return -EINVAL;
+
+   return cma_dev->default_gid_type[port - 
rdma_start_port(cma_dev->device)];
+}
+
+int cma_set_default_gid_type(struct cma_device *cma_dev,
+unsigned int port,
+enum ib_gid_type default_gid_type)
+{
+   unsigned long supported_gids;
+
+   if (port < rdma_st

RE: [PATCH] IB/hfi: Properly set permissions for user device files

2015-09-17 Thread Marciniszyn, Mike
> Subject: [PATCH] IB/hfi: Properly set permissions for user device files
> 
> From: Ira Weiny 
> Signed-off-by: Haralanov, Mitko 
> Signed-off-by: Ira Weiny 

Acked-by: Mike Marciniszyn 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-rc 2/2] IB/ipoib: Add cleanup to sendonly multicast objects

2015-09-17 Thread Or Gerlitz
On Thu, Sep 17, 2015 at 1:38 PM, Or Gerlitz  wrote:
> From: Erez Shitrit 
>
> Sendonly multicast group entries are potentially created by the driver during
> the xmit flow. Their objects remain in the driver memory, plus the related 
> group
> existing in the SM and the fabric till the driver goes down, even if no one
> uses that multicast entry anymore.
>
> Since this is sendonly, they are also not part of the kernel decvice multicast
> list and hence invocation of the set_rx_mode ndo will not cleam them up 
> either.

oops, Doug, I see few typos here... need to s/decvice/device/ and s/cleam/clean/
-- just in case you are to pick V0
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3] IB/ipoib: add module option for auto-creating mcast groups

2015-09-17 Thread Or Gerlitz

On 9/17/2015 3:48 AM, Christoph Lameter wrote:

On Wed, 16 Sep 2015, Or Gerlitz wrote:


Could you please post here a few (say 2-4) liner summary of what is
still missing or done wrong in 4.3-rc1 and what is your suggestion how
to resolve that.

With Doug's patch here the only thing that is left to be done is to
properly leave the multicast group. And it seems that Erez patch does just that.


sent it out now


And then there are the 20 other things that I have pending with Mellanox
but those are different issues that do not belong here. This one is a
critical bug for us.



so 19 left?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications

2015-09-17 Thread Or Gerlitz
Patches from Erez, to be used for cleaning up send-only objects and multicast 
group SM registrations.

Or.

Erez Shitrit (2):
  IB/ipoib: Add mechanism for ipoib neigh state change notifications
  IB/ipoib: Add cleanup to sendonly multicast objects

 drivers/infiniband/ulp/ipoib/ipoib.h   | 17 +
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |  4 ++
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 87 ++
 3 files changed, 108 insertions(+)

-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH rdma-rc 1/2] IB/ipoib: Add mechanism for ipoib neigh state change notifications

2015-09-17 Thread Or Gerlitz
From: Erez Shitrit 

Add callback function to the ipoib_neigh struct in order to add the
ability to inform the object that holds the neigh on its current state.

Each neigh object is kept by one and only one object (from type
path_record or multicast object), now this object can act accordingly
the change in the neigh state.

The callback should pay attention to the context it runs on, and should
act/run according to that context limitation, for example on the neigh
reap flow, the neigh calls the callback under spinlock etc.

Signed-off-by: Erez Shitrit 
Signed-off-by: Or Gerlitz 
---
 drivers/infiniband/ulp/ipoib/ipoib.h  | 11 +++
 drivers/infiniband/ulp/ipoib/ipoib_main.c |  4 
 2 files changed, 15 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index ca28736..5b719e2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -417,6 +417,14 @@ struct ipoib_path {
int   valid;
 };
 
+enum ipoib_neigh_state {
+   IPOIB_NEIGH_CREATED,
+   IPOIB_NEIGH_REMOVED,
+};
+
+typedef int (*state_callback_fn)(struct ipoib_dev_priv *priv,
+enum ipoib_neigh_state state, void *context);
+
 struct ipoib_neigh {
struct ipoib_ah*ah;
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
@@ -432,6 +440,9 @@ struct ipoib_neigh {
struct rcu_head rcu;
atomic_trefcnt;
unsigned long   alive;
+   /* add the ability to notify the objects that hold that neigh */
+   state_callback_fn state_callback;
+   void *context;
 };
 
 #define IPOIB_UD_MTU(ib_mtu)   (ib_mtu - IPOIB_ENCAP_LEN)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 36536ce..6176441 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1179,6 +1179,10 @@ static void __ipoib_reap_neigh(struct ipoib_dev_priv 
*priv)
rcu_assign_pointer(*np,
   
rcu_dereference_protected(neigh->hnext,
 
lockdep_is_held(&priv->lock)));
+   /* inform state if requested */
+   if (neigh->state_callback != NULL)
+   neigh->state_callback(priv, 
IPOIB_NEIGH_REMOVED, neigh->context);
+
/* remove from path/mc list */
list_del(&neigh->list);
call_rcu(&neigh->rcu, ipoib_neigh_reclaim);
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH rdma-rc 2/2] IB/ipoib: Add cleanup to sendonly multicast objects

2015-09-17 Thread Or Gerlitz
From: Erez Shitrit 

Sendonly multicast group entries are potentially created by the driver during
the xmit flow. Their objects remain in the driver memory, plus the related group
existing in the SM and the fabric till the driver goes down, even if no one
uses that multicast entry anymore.

Since this is sendonly, they are also not part of the kernel decvice multicast
list and hence invocation of the set_rx_mode ndo will not cleam them up either.

Each multicast entry has at least one neigh object, hence we can clean the
sendonly mcast object / leave the group by using the existing neigh notification
mechanism initiated from __ipoib_reap_neigh().

Signed-off-by: Erez Shitrit 
Signed-off-by: Or Gerlitz 
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  6 ++
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 87 ++
 2 files changed, 93 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 5b719e2..7cbd7d1 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -417,6 +417,12 @@ struct ipoib_path {
int   valid;
 };
 
+struct ipoib_free_sendonly_task {
+   struct work_struct work;
+   struct ipoib_mcast *mcast;
+   struct ipoib_dev_priv *priv;
+};
+
 enum ipoib_neigh_state {
IPOIB_NEIGH_CREATED,
IPOIB_NEIGH_REMOVED,
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 09a1748..e3d035e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -702,6 +702,91 @@ static int ipoib_mcast_leave(struct net_device *dev, 
struct ipoib_mcast *mcast)
return 0;
 }
 
+/* leave / free sendonly mcast */
+static void ipoib_sendonly_free_work(struct work_struct *work)
+{
+   unsigned long flags;
+   struct ipoib_mcast *tmcast;
+   bool found = false;
+   struct ipoib_free_sendonly_task *so_work =
+   container_of(work, struct ipoib_free_sendonly_task, work);
+   struct ipoib_mcast *mcast = so_work->mcast;
+   struct ipoib_dev_priv *priv = so_work->priv;
+
+   spin_lock_irqsave(&priv->lock, flags);
+   /*
+* check the mcast is still in the list.
+* make sure we are not racing against ipoib_mcast_dev_flush
+*/
+   list_for_each_entry(tmcast, &priv->multicast_list, list)
+   if (!memcmp(tmcast->mcmember.mgid.raw,
+   mcast->mcmember.mgid.raw,
+   sizeof(union ib_gid)))
+   found = true;
+
+   if (!found) {
+   pr_info("%s mcast: %pI6 already removed\n", __func__,
+   mcast->mcmember.mgid.raw);
+   spin_unlock(&priv->lock);
+   local_irq_restore(flags);
+   goto out;
+   }
+
+   /* delete from multicast_list and rb_tree */
+   rb_erase(&mcast->rb_node, &priv->multicast_tree);
+   list_del(&mcast->list);
+
+   spin_unlock_irqrestore(&priv->lock, flags);
+
+   /*
+* make sure the in-flight joins have finished before we attempt
+* to leave
+*/
+   if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
+   wait_for_completion(&mcast->done);
+
+   ipoib_mcast_leave(mcast->dev, mcast);
+   ipoib_mcast_free(mcast);
+
+out:
+   kfree(so_work);
+}
+
+/* get notification from the neigh that connected to mcast on its state */
+static int handle_neigh_state_change(struct ipoib_dev_priv *priv,
+enum ipoib_neigh_state state, void 
*context)
+{
+   struct ipoib_mcast *mcast = context;
+
+   switch (state) {
+   case IPOIB_NEIGH_REMOVED:
+   /* In sendonly the kernel doesn't clean mcast groups, so we use
+* the gc mechanism of the neigh that connected to that mcast in
+* order to clean them
+*/
+   if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
+   struct ipoib_free_sendonly_task *sendonly_mcast_work;
+
+   sendonly_mcast_work = 
kzalloc(sizeof(*sendonly_mcast_work), GFP_KERNEL);
+   if (!sendonly_mcast_work)
+   return -ENOMEM;
+
+   INIT_WORK(&sendonly_mcast_work->work,
+ ipoib_sendonly_free_work);
+   sendonly_mcast_work->mcast = mcast;
+   sendonly_mcast_work->priv = priv;
+   queue_work(priv->wq, &sendonly_mcast_work->work);
+   }
+   break;
+   default:
+   pr_info("%s doesn't handle state %d for mcast: %pI6\n",
+   __func__, state, mcast->mcmember.mgid.raw);
+   break;
+   }
+
+   return 0;
+}
+
 void ipoib_mcast_send(struct net_device *dev, u8 *d

[PATCH v1 21/24] IB/qib: Remove old FRWR API

2015-09-17 Thread Sagi Grimberg
No ULP uses it anymore, go ahead and remove it.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/qib/qib_keys.c  | 56 ---
 drivers/infiniband/hw/qib/qib_mr.c| 32 +---
 drivers/infiniband/hw/qib/qib_verbs.c |  8 -
 drivers/infiniband/hw/qib/qib_verbs.h |  7 -
 4 files changed, 1 insertion(+), 102 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_keys.c 
b/drivers/infiniband/hw/qib/qib_keys.c
index a5057efc7faf..3f61bda77e0e 100644
--- a/drivers/infiniband/hw/qib/qib_keys.c
+++ b/drivers/infiniband/hw/qib/qib_keys.c
@@ -338,62 +338,6 @@ bail:
 /*
  * Initialize the memory region specified by the work reqeust.
  */
-int qib_fast_reg_mr(struct qib_qp *qp, struct ib_send_wr *send_wr)
-{
-   struct ib_fast_reg_wr *wr = fast_reg_wr(send_wr);
-   struct qib_lkey_table *rkt = &to_idev(qp->ibqp.device)->lk_table;
-   struct qib_pd *pd = to_ipd(qp->ibqp.pd);
-   struct qib_mregion *mr;
-   u32 rkey = wr->rkey;
-   unsigned i, n, m;
-   int ret = -EINVAL;
-   unsigned long flags;
-   u64 *page_list;
-   size_t ps;
-
-   spin_lock_irqsave(&rkt->lock, flags);
-   if (pd->user || rkey == 0)
-   goto bail;
-
-   mr = rcu_dereference_protected(
-   rkt->table[(rkey >> (32 - ib_qib_lkey_table_size))],
-   lockdep_is_held(&rkt->lock));
-   if (unlikely(mr == NULL || qp->ibqp.pd != mr->pd))
-   goto bail;
-
-   if (wr->page_list_len > mr->max_segs)
-   goto bail;
-
-   ps = 1UL << wr->page_shift;
-   if (wr->length > ps * wr->page_list_len)
-   goto bail;
-
-   mr->user_base = wr->iova_start;
-   mr->iova = wr->iova_start;
-   mr->lkey = rkey;
-   mr->length = wr->length;
-   mr->access_flags = wr->access_flags;
-   page_list = wr->page_list->page_list;
-   m = 0;
-   n = 0;
-   for (i = 0; i < wr->page_list_len; i++) {
-   mr->map[m]->segs[n].vaddr = (void *) page_list[i];
-   mr->map[m]->segs[n].length = ps;
-   if (++n == QIB_SEGSZ) {
-   m++;
-   n = 0;
-   }
-   }
-
-   ret = 0;
-bail:
-   spin_unlock_irqrestore(&rkt->lock, flags);
-   return ret;
-}
-
-/*
- * Initialize the memory region specified by the work reqeust.
- */
 int qib_reg_mr(struct qib_qp *qp, struct ib_reg_wr *wr)
 {
struct qib_lkey_table *rkt = &to_idev(qp->ibqp.device)->lk_table;
diff --git a/drivers/infiniband/hw/qib/qib_mr.c 
b/drivers/infiniband/hw/qib/qib_mr.c
index 0fa4b0de8074..73f78c0f9522 100644
--- a/drivers/infiniband/hw/qib/qib_mr.c
+++ b/drivers/infiniband/hw/qib/qib_mr.c
@@ -324,7 +324,7 @@ out:
 
 /*
  * Allocate a memory region usable with the
- * IB_WR_FAST_REG_MR send work request.
+ * IB_WR_REG_MR send work request.
  *
  * Return the memory region on success, otherwise return an errno.
  */
@@ -375,36 +375,6 @@ int qib_map_mr_sg(struct ib_mr *ibmr,
return ib_sg_to_pages(ibmr, sg, sg_nents, qib_set_page);
 }
 
-struct ib_fast_reg_page_list *
-qib_alloc_fast_reg_page_list(struct ib_device *ibdev, int page_list_len)
-{
-   unsigned size = page_list_len * sizeof(u64);
-   struct ib_fast_reg_page_list *pl;
-
-   if (size > PAGE_SIZE)
-   return ERR_PTR(-EINVAL);
-
-   pl = kzalloc(sizeof(*pl), GFP_KERNEL);
-   if (!pl)
-   return ERR_PTR(-ENOMEM);
-
-   pl->page_list = kzalloc(size, GFP_KERNEL);
-   if (!pl->page_list)
-   goto err_free;
-
-   return pl;
-
-err_free:
-   kfree(pl);
-   return ERR_PTR(-ENOMEM);
-}
-
-void qib_free_fast_reg_page_list(struct ib_fast_reg_page_list *pl)
-{
-   kfree(pl->page_list);
-   kfree(pl);
-}
-
 /**
  * qib_alloc_fmr - allocate a fast memory region
  * @pd: the protection domain for this memory region
diff --git a/drivers/infiniband/hw/qib/qib_verbs.c 
b/drivers/infiniband/hw/qib/qib_verbs.c
index a1e53d7b662b..de6cb6fcda8d 100644
--- a/drivers/infiniband/hw/qib/qib_verbs.c
+++ b/drivers/infiniband/hw/qib/qib_verbs.c
@@ -365,9 +365,6 @@ static int qib_post_one_send(struct qib_qp *qp, struct 
ib_send_wr *wr,
if (wr->opcode == IB_WR_REG_MR) {
if (qib_reg_mr(qp, reg_wr(wr)))
goto bail_inval;
-   } else if (wr->opcode == IB_WR_FAST_REG_MR) {
-   if (qib_fast_reg_mr(qp, wr))
-   goto bail_inval;
} else if (qp->ibqp.qp_type == IB_QPT_UC) {
if ((unsigned) wr->opcode >= IB_WR_RDMA_READ)
goto bail_inval;
@@ -407,9 +404,6 @@ static int qib_post_one_send(struct qib_qp *qp, struct 
ib_send_wr *wr,
else if (wr->opcode == IB_WR_REG_MR)
memcpy(&wqe->reg_wr, reg_wr(wr),
sizeof(wqe->reg_wr));
-   else if (wr->opcode == IB_WR_FAST_REG_MR)
-   memcpy(&wqe->fast_reg_wr, fast_reg_wr(wr

[PATCH v1 24/24] IB/core: Remove old fast registration API

2015-09-17 Thread Sagi Grimberg
No callers and no providers left, go ahead and remove it.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/core/verbs.c | 25 
 include/rdma/ib_verbs.h | 52 -
 2 files changed, 77 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index d99f57f1f737..bbbfd597f060 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1253,31 +1253,6 @@ struct ib_mr *ib_alloc_mr(struct ib_pd *pd,
 }
 EXPORT_SYMBOL(ib_alloc_mr);
 
-struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(struct ib_device 
*device,
- int max_page_list_len)
-{
-   struct ib_fast_reg_page_list *page_list;
-
-   if (!device->alloc_fast_reg_page_list)
-   return ERR_PTR(-ENOSYS);
-
-   page_list = device->alloc_fast_reg_page_list(device, max_page_list_len);
-
-   if (!IS_ERR(page_list)) {
-   page_list->device = device;
-   page_list->max_page_list_len = max_page_list_len;
-   }
-
-   return page_list;
-}
-EXPORT_SYMBOL(ib_alloc_fast_reg_page_list);
-
-void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
-{
-   page_list->device->free_fast_reg_page_list(page_list);
-}
-EXPORT_SYMBOL(ib_free_fast_reg_page_list);
-
 /* Memory windows */
 
 struct ib_mw *ib_alloc_mw(struct ib_pd *pd, enum ib_mw_type type)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 97c73359ade8..ed3f181407ff 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1068,12 +1068,6 @@ struct ib_sge {
u32 lkey;
 };
 
-struct ib_fast_reg_page_list {
-   struct ib_device   *device;
-   u64*page_list;
-   unsigned intmax_page_list_len;
-};
-
 /**
  * struct ib_mw_bind_info - Parameters for a memory window bind operation.
  * @mr: A memory region to bind the memory window to.
@@ -1147,22 +1141,6 @@ static inline struct ib_ud_wr *ud_wr(struct ib_send_wr 
*wr)
return container_of(wr, struct ib_ud_wr, wr);
 }
 
-struct ib_fast_reg_wr {
-   struct ib_send_wr   wr;
-   u64 iova_start;
-   struct ib_fast_reg_page_list *page_list;
-   unsigned intpage_shift;
-   unsigned intpage_list_len;
-   u32 length;
-   int access_flags;
-   u32 rkey;
-};
-
-static inline struct ib_fast_reg_wr *fast_reg_wr(struct ib_send_wr *wr)
-{
-   return container_of(wr, struct ib_fast_reg_wr, wr);
-}
-
 struct ib_reg_wr {
struct ib_send_wr   wr;
struct ib_mr*mr;
@@ -1777,9 +1755,6 @@ struct ib_device {
int(*map_mr_sg)(struct ib_mr *mr,
struct scatterlist *sg,
unsigned int sg_nents);
-   struct ib_fast_reg_page_list * (*alloc_fast_reg_page_list)(struct 
ib_device *device,
-  int 
page_list_len);
-   void   (*free_fast_reg_page_list)(struct 
ib_fast_reg_page_list *page_list);
int(*rereg_phys_mr)(struct ib_mr *mr,
int mr_rereg_mask,
struct ib_pd *pd,
@@ -2888,33 +2863,6 @@ struct ib_mr *ib_alloc_mr(struct ib_pd *pd,
  u32 max_num_sg);
 
 /**
- * ib_alloc_fast_reg_page_list - Allocates a page list array
- * @device - ib device pointer.
- * @page_list_len - size of the page list array to be allocated.
- *
- * This allocates and returns a struct ib_fast_reg_page_list * and a
- * page_list array that is at least page_list_len in size.  The actual
- * size is returned in max_page_list_len.  The caller is responsible
- * for initializing the contents of the page_list array before posting
- * a send work request with the IB_WC_FAST_REG_MR opcode.
- *
- * The page_list array entries must be translated using one of the
- * ib_dma_*() functions just like the addresses passed to
- * ib_map_phys_fmr().  Once the ib_post_send() is issued, the struct
- * ib_fast_reg_page_list must not be modified by the caller until the
- * IB_WC_FAST_REG_MR work request completes.
- */
-struct ib_fast_reg_page_list *ib_alloc_fast_reg_page_list(
-   struct ib_device *device, int page_list_len);
-
-/**
- * ib_free_fast_reg_page_list - Deallocates a previously allocated
- *   page list array.
- * @page_list - struct ib_fast_reg_page_list pointer to be deallocated.
- */
-void ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
-
-/**
  * ib_update_fast_reg_key - updates the key portion of the fast_reg MR
  *   R_Key and L_Key.
  * @mr - struct ib_mr pointer to be updated.
-- 
1.

[PATCH v1 17/24] IB/mlx4: Remove old FRWR API support

2015-09-17 Thread Sagi Grimberg
No ULP uses it anymore, go ahead and remove it.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/mlx4/cq.c  |  3 +--
 drivers/infiniband/hw/mlx4/main.c|  2 --
 drivers/infiniband/hw/mlx4/mlx4_ib.h | 15 ---
 drivers/infiniband/hw/mlx4/mr.c  | 48 
 drivers/infiniband/hw/mlx4/qp.c  | 31 ---
 5 files changed, 1 insertion(+), 98 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index b62236e24708..84ff03618e31 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -818,8 +818,7 @@ repoll:
wc->opcode= IB_WC_LSO;
break;
case MLX4_OPCODE_FMR:
-   wc->opcode= IB_WC_FAST_REG_MR;
-   /* TODO: wc->opcode= IB_WC_REG_MR; */
+   wc->opcode= IB_WC_REG_MR;
break;
case MLX4_OPCODE_LOCAL_INVAL:
wc->opcode= IB_WC_LOCAL_INV;
diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index bb82f5fa1612..a25048bf9913 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -2248,8 +2248,6 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
ibdev->ib_dev.dereg_mr  = mlx4_ib_dereg_mr;
ibdev->ib_dev.alloc_mr  = mlx4_ib_alloc_mr;
ibdev->ib_dev.map_mr_sg = mlx4_ib_map_mr_sg;
-   ibdev->ib_dev.alloc_fast_reg_page_list = 
mlx4_ib_alloc_fast_reg_page_list;
-   ibdev->ib_dev.free_fast_reg_page_list  = 
mlx4_ib_free_fast_reg_page_list;
ibdev->ib_dev.attach_mcast  = mlx4_ib_mcg_attach;
ibdev->ib_dev.detach_mcast  = mlx4_ib_mcg_detach;
ibdev->ib_dev.process_mad   = mlx4_ib_process_mad;
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h 
b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 07fcf3a49256..de6eab38b024 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -144,12 +144,6 @@ struct mlx4_ib_mw {
struct mlx4_mw  mmw;
 };
 
-struct mlx4_ib_fast_reg_page_list {
-   struct ib_fast_reg_page_listibfrpl;
-   __be64 *mapped_page_list;
-   dma_addr_t  map;
-};
-
 struct mlx4_ib_fmr {
struct ib_fmr   ibfmr;
struct mlx4_fmr mfmr;
@@ -642,11 +636,6 @@ static inline struct mlx4_ib_mw *to_mmw(struct ib_mw *ibmw)
return container_of(ibmw, struct mlx4_ib_mw, ibmw);
 }
 
-static inline struct mlx4_ib_fast_reg_page_list *to_mfrpl(struct 
ib_fast_reg_page_list *ibfrpl)
-{
-   return container_of(ibfrpl, struct mlx4_ib_fast_reg_page_list, ibfrpl);
-}
-
 static inline struct mlx4_ib_fmr *to_mfmr(struct ib_fmr *ibfmr)
 {
return container_of(ibfmr, struct mlx4_ib_fmr, ibfmr);
@@ -713,10 +702,6 @@ struct ib_mr *mlx4_ib_alloc_mr(struct ib_pd *pd,
 int mlx4_ib_map_mr_sg(struct ib_mr *ibmr,
  struct scatterlist *sg,
  unsigned int sg_nents);
-struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct 
ib_device *ibdev,
-  int 
page_list_len);
-void mlx4_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
-
 int mlx4_ib_modify_cq(struct ib_cq *cq, u16 cq_count, u16 cq_period);
 int mlx4_ib_resize_cq(struct ib_cq *ibcq, int entries, struct ib_udata *udata);
 struct ib_cq *mlx4_ib_create_cq(struct ib_device *ibdev,
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 6ed745798ad3..dc255dc4548d 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -425,54 +425,6 @@ err_free:
return ERR_PTR(err);
 }
 
-struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct 
ib_device *ibdev,
-  int 
page_list_len)
-{
-   struct mlx4_ib_dev *dev = to_mdev(ibdev);
-   struct mlx4_ib_fast_reg_page_list *mfrpl;
-   int size = page_list_len * sizeof (u64);
-
-   if (page_list_len > MLX4_MAX_FAST_REG_PAGES)
-   return ERR_PTR(-EINVAL);
-
-   mfrpl = kmalloc(sizeof *mfrpl, GFP_KERNEL);
-   if (!mfrpl)
-   return ERR_PTR(-ENOMEM);
-
-   mfrpl->ibfrpl.page_list = kmalloc(size, GFP_KERNEL);
-   if (!mfrpl->ibfrpl.page_list)
-   goto err_free;
-
-   mfrpl->mapped_page_list = dma_alloc_coherent(&dev->dev->persist->
-pdev->dev,
-size, &mfrpl->map,
-GFP_KERNEL);
-   if (!mfrpl->mapped_page_list)
-   goto err_free;
-
-   WARN_ON(mfrpl->map & 0x3f);
-
-   return &mfrpl->ibfrpl;
-
-err_free:
-   kfree(mfrpl-

[PATCH v1 23/24] IB/hfi1: Remove Old fast registraion API support

2015-09-17 Thread Sagi Grimberg
It wasn't supported before as well (post_send
returned -EINVAL for IB_WR_FAST_REG_MR). Perhaps
the new API adoption will be joint for all IB SW
implementations.

Signed-off-by: Sagi Grimberg 
---
 drivers/staging/hfi1/keys.c  | 55 
 drivers/staging/hfi1/mr.c| 32 +-
 drivers/staging/hfi1/verbs.c |  9 +---
 drivers/staging/hfi1/verbs.h |  8 ---
 4 files changed, 2 insertions(+), 102 deletions(-)

diff --git a/drivers/staging/hfi1/keys.c b/drivers/staging/hfi1/keys.c
index 82c21b1c0263..cb4e6087dfdb 100644
--- a/drivers/staging/hfi1/keys.c
+++ b/drivers/staging/hfi1/keys.c
@@ -354,58 +354,3 @@ bail:
rcu_read_unlock();
return 0;
 }
-
-/*
- * Initialize the memory region specified by the work request.
- */
-int hfi1_fast_reg_mr(struct hfi1_qp *qp, struct ib_fast_reg_wr *wr)
-{
-   struct hfi1_lkey_table *rkt = &to_idev(qp->ibqp.device)->lk_table;
-   struct hfi1_pd *pd = to_ipd(qp->ibqp.pd);
-   struct hfi1_mregion *mr;
-   u32 rkey = wr->rkey;
-   unsigned i, n, m;
-   int ret = -EINVAL;
-   unsigned long flags;
-   u64 *page_list;
-   size_t ps;
-
-   spin_lock_irqsave(&rkt->lock, flags);
-   if (pd->user || rkey == 0)
-   goto bail;
-
-   mr = rcu_dereference_protected(
-   rkt->table[(rkey >> (32 - hfi1_lkey_table_size))],
-   lockdep_is_held(&rkt->lock));
-   if (unlikely(mr == NULL || qp->ibqp.pd != mr->pd))
-   goto bail;
-
-   if (wr->page_list_len > mr->max_segs)
-   goto bail;
-
-   ps = 1UL << wr->page_shift;
-   if (wr->length > ps * wr->page_list_len)
-   goto bail;
-
-   mr->user_base = wr->iova_start;
-   mr->iova = wr->iova_start;
-   mr->lkey = rkey;
-   mr->length = wr->length;
-   mr->access_flags = wr->access_flags;
-   page_list = wr->page_list->page_list;
-   m = 0;
-   n = 0;
-   for (i = 0; i < wr->page_list_len; i++) {
-   mr->map[m]->segs[n].vaddr = (void *) page_list[i];
-   mr->map[m]->segs[n].length = ps;
-   if (++n == HFI1_SEGSZ) {
-   m++;
-   n = 0;
-   }
-   }
-
-   ret = 0;
-bail:
-   spin_unlock_irqrestore(&rkt->lock, flags);
-   return ret;
-}
diff --git a/drivers/staging/hfi1/mr.c b/drivers/staging/hfi1/mr.c
index bd64e4f986f9..3f5623add3df 100644
--- a/drivers/staging/hfi1/mr.c
+++ b/drivers/staging/hfi1/mr.c
@@ -344,7 +344,7 @@ out:
 
 /*
  * Allocate a memory region usable with the
- * IB_WR_FAST_REG_MR send work request.
+ * IB_WR_REG_MR send work request.
  *
  * Return the memory region on success, otherwise return an errno.
  */
@@ -364,36 +364,6 @@ struct ib_mr *hfi1_alloc_mr(struct ib_pd *pd,
return &mr->ibmr;
 }
 
-struct ib_fast_reg_page_list *
-hfi1_alloc_fast_reg_page_list(struct ib_device *ibdev, int page_list_len)
-{
-   unsigned size = page_list_len * sizeof(u64);
-   struct ib_fast_reg_page_list *pl;
-
-   if (size > PAGE_SIZE)
-   return ERR_PTR(-EINVAL);
-
-   pl = kzalloc(sizeof(*pl), GFP_KERNEL);
-   if (!pl)
-   return ERR_PTR(-ENOMEM);
-
-   pl->page_list = kzalloc(size, GFP_KERNEL);
-   if (!pl->page_list)
-   goto err_free;
-
-   return pl;
-
-err_free:
-   kfree(pl);
-   return ERR_PTR(-ENOMEM);
-}
-
-void hfi1_free_fast_reg_page_list(struct ib_fast_reg_page_list *pl)
-{
-   kfree(pl->page_list);
-   kfree(pl);
-}
-
 /**
  * hfi1_alloc_fmr - allocate a fast memory region
  * @pd: the protection domain for this memory region
diff --git a/drivers/staging/hfi1/verbs.c b/drivers/staging/hfi1/verbs.c
index 542ad803bfce..a6c9bf88a4c4 100644
--- a/drivers/staging/hfi1/verbs.c
+++ b/drivers/staging/hfi1/verbs.c
@@ -380,9 +380,7 @@ static int post_one_send(struct hfi1_qp *qp, struct 
ib_send_wr *wr)
 * undefined operations.
 * Make sure buffer is large enough to hold the result for atomics.
 */
-   if (wr->opcode == IB_WR_FAST_REG_MR) {
-   return -EINVAL;
-   } else if (qp->ibqp.qp_type == IB_QPT_UC) {
+   if (qp->ibqp.qp_type == IB_QPT_UC) {
if ((unsigned) wr->opcode >= IB_WR_RDMA_READ)
return -EINVAL;
} else if (qp->ibqp.qp_type != IB_QPT_RC) {
@@ -417,9 +415,6 @@ static int post_one_send(struct hfi1_qp *qp, struct 
ib_send_wr *wr)
if (qp->ibqp.qp_type != IB_QPT_UC &&
qp->ibqp.qp_type != IB_QPT_RC)
memcpy(&wqe->ud_wr, ud_wr(wr), sizeof(wqe->ud_wr));
-   else if (wr->opcode == IB_WR_FAST_REG_MR)
-   memcpy(&wqe->fast_reg_wr, fast_reg_wr(wr),
-   sizeof(wqe->fast_reg_wr));
else if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM ||
 wr->opcode == IB_WR_RDMA_WRITE ||
 wr->opcode == IB_WR_RDMA_READ)
@@ -206

[PATCH v1 18/24] RDMA/ocrdma: Remove old FRWR API

2015-09-17 Thread Sagi Grimberg
No ULP uses it anymore, go ahead and remove it.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/ocrdma/ocrdma_main.c  |   2 -
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 102 
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h |   4 --
 3 files changed, 108 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index 874beb4b07a1..9bf430ef8eb6 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -183,8 +183,6 @@ static int ocrdma_register_device(struct ocrdma_dev *dev)
 
dev->ibdev.alloc_mr = ocrdma_alloc_mr;
dev->ibdev.map_mr_sg = ocrdma_map_mr_sg;
-   dev->ibdev.alloc_fast_reg_page_list = ocrdma_alloc_frmr_page_list;
-   dev->ibdev.free_fast_reg_page_list = ocrdma_free_frmr_page_list;
 
/* mandatory to support user space verbs consumer. */
dev->ibdev.alloc_ucontext = ocrdma_alloc_ucontext;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 853746e17d5c..2deaa2ac4a1c 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -2133,41 +2133,6 @@ static void ocrdma_build_read(struct ocrdma_qp *qp, 
struct ocrdma_hdr_wqe *hdr,
ext_rw->len = hdr->total_len;
 }
 
-static void build_frmr_pbes(struct ib_fast_reg_wr *wr,
-   struct ocrdma_pbl *pbl_tbl,
-   struct ocrdma_hw_mr *hwmr)
-{
-   int i;
-   u64 buf_addr = 0;
-   int num_pbes;
-   struct ocrdma_pbe *pbe;
-
-   pbe = (struct ocrdma_pbe *)pbl_tbl->va;
-   num_pbes = 0;
-
-   /* go through the OS phy regions & fill hw pbe entries into pbls. */
-   for (i = 0; i < wr->page_list_len; i++) {
-   /* number of pbes can be more for one OS buf, when
-* buffers are of different sizes.
-* split the ib_buf to one or more pbes.
-*/
-   buf_addr = wr->page_list->page_list[i];
-   pbe->pa_lo = cpu_to_le32((u32) (buf_addr & PAGE_MASK));
-   pbe->pa_hi = cpu_to_le32((u32) upper_32_bits(buf_addr));
-   num_pbes += 1;
-   pbe++;
-
-   /* if the pbl is full storing the pbes,
-* move to next pbl.
-   */
-   if (num_pbes == (hwmr->pbl_size/sizeof(u64))) {
-   pbl_tbl++;
-   pbe = (struct ocrdma_pbe *)pbl_tbl->va;
-   }
-   }
-   return;
-}
-
 static int get_encoded_page_size(int pg_sz)
 {
/* Max size is 256M 4096 << 16 */
@@ -2233,50 +2198,6 @@ static int ocrdma_build_reg(struct ocrdma_qp *qp,
return 0;
 }
 
-static int ocrdma_build_fr(struct ocrdma_qp *qp, struct ocrdma_hdr_wqe *hdr,
-  struct ib_send_wr *send_wr)
-{
-   u64 fbo;
-   struct ib_fast_reg_wr *wr = fast_reg_wr(send_wr);
-   struct ocrdma_ewqe_fr *fast_reg = (struct ocrdma_ewqe_fr *)(hdr + 1);
-   struct ocrdma_mr *mr;
-   struct ocrdma_dev *dev = get_ocrdma_dev(qp->ibqp.device);
-   u32 wqe_size = sizeof(*fast_reg) + sizeof(*hdr);
-
-   wqe_size = roundup(wqe_size, OCRDMA_WQE_ALIGN_BYTES);
-
-   if (wr->page_list_len > dev->attr.max_pages_per_frmr)
-   return -EINVAL;
-
-   hdr->cw |= (OCRDMA_FR_MR << OCRDMA_WQE_OPCODE_SHIFT);
-   hdr->cw |= ((wqe_size / OCRDMA_WQE_STRIDE) << OCRDMA_WQE_SIZE_SHIFT);
-
-   if (wr->page_list_len == 0)
-   BUG();
-   if (wr->access_flags & IB_ACCESS_LOCAL_WRITE)
-   hdr->rsvd_lkey_flags |= OCRDMA_LKEY_FLAG_LOCAL_WR;
-   if (wr->access_flags & IB_ACCESS_REMOTE_WRITE)
-   hdr->rsvd_lkey_flags |= OCRDMA_LKEY_FLAG_REMOTE_WR;
-   if (wr->access_flags & IB_ACCESS_REMOTE_READ)
-   hdr->rsvd_lkey_flags |= OCRDMA_LKEY_FLAG_REMOTE_RD;
-   hdr->lkey = wr->rkey;
-   hdr->total_len = wr->length;
-
-   fbo = wr->iova_start - (wr->page_list->page_list[0] & PAGE_MASK);
-
-   fast_reg->va_hi = upper_32_bits(wr->iova_start);
-   fast_reg->va_lo = (u32) (wr->iova_start & 0x);
-   fast_reg->fbo_hi = upper_32_bits(fbo);
-   fast_reg->fbo_lo = (u32) fbo & 0x;
-   fast_reg->num_sges = wr->page_list_len;
-   fast_reg->size_sge =
-   get_encoded_page_size(1 << wr->page_shift);
-   mr = (struct ocrdma_mr *) (unsigned long)
-   dev->stag_arr[(hdr->lkey >> 8) & (OCRDMA_MAX_STAG - 1)];
-   build_frmr_pbes(wr, mr->hwmr.pbl_table, &mr->hwmr);
-   return 0;
-}
-
 static void ocrdma_ring_sq_db(struct ocrdma_qp *qp)
 {
u32 val = qp->sq.dbid | (1 << OCRDMA_DB_SQ_SHIFT);
@@ -2356,9 +2277,6 @@ int ocrdma_post_send(struct ib_qp *ibqp, struct 
ib_send_wr *wr,
OCRDMA_WQE_STRIDE) << OCRDMA_WQE_SIZE_SHIFT;
hdr->lk

[PATCH v1 15/24] IB/srp: Convert to new memory registration API

2015-09-17 Thread Sagi Grimberg
Since SRP supports both FMRs and FRWR, the new API conversion
includes splitting the sg list mapping routines in srp_map_data to
srp_map_sg_fr that works with the new memory registration API,
srp_map_sg_fmr which constructs a page vector and calls
ib_fmr_pool_map_phys, and srp_map_sg_dma which is used only
if no FRWR nor FMR are supported (which I'm not sure is a valid
use-case anymore).

The srp protocol is able to pass to the target multiple descriptors
for remote access, so it basically calls registers muleitple sg list
partials the entire sg list is mapped and registered (each time maps
a prefix of an sg list).

Note that now the per request page vector is allocated only when FMR
mode is used as it is not needed for the new registration API.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/ulp/srp/ib_srp.c | 248 +---
 drivers/infiniband/ulp/srp/ib_srp.h |  11 +-
 2 files changed, 156 insertions(+), 103 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index f8b9c18da03d..35cddbb120ea 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -340,8 +340,6 @@ static void srp_destroy_fr_pool(struct srp_fr_pool *pool)
return;
 
for (i = 0, d = &pool->desc[0]; i < pool->size; i++, d++) {
-   if (d->frpl)
-   ib_free_fast_reg_page_list(d->frpl);
if (d->mr)
ib_dereg_mr(d->mr);
}
@@ -362,7 +360,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct 
ib_device *device,
struct srp_fr_pool *pool;
struct srp_fr_desc *d;
struct ib_mr *mr;
-   struct ib_fast_reg_page_list *frpl;
int i, ret = -EINVAL;
 
if (pool_size <= 0)
@@ -385,12 +382,6 @@ static struct srp_fr_pool *srp_create_fr_pool(struct 
ib_device *device,
goto destroy_pool;
}
d->mr = mr;
-   frpl = ib_alloc_fast_reg_page_list(device, max_page_list_len);
-   if (IS_ERR(frpl)) {
-   ret = PTR_ERR(frpl);
-   goto destroy_pool;
-   }
-   d->frpl = frpl;
list_add_tail(&d->entry, &pool->free_list);
}
 
@@ -887,14 +878,16 @@ static int srp_alloc_req_data(struct srp_rdma_ch *ch)
  GFP_KERNEL);
if (!mr_list)
goto out;
-   if (srp_dev->use_fast_reg)
+   if (srp_dev->use_fast_reg) {
req->fr_list = mr_list;
-   else
+   } else {
req->fmr_list = mr_list;
-   req->map_page = kmalloc(srp_dev->max_pages_per_mr *
-   sizeof(void *), GFP_KERNEL);
-   if (!req->map_page)
-   goto out;
+   req->map_page = kmalloc(srp_dev->max_pages_per_mr *
+   sizeof(void *), GFP_KERNEL);
+   if (!req->map_page)
+   goto out;
+   }
+
req->indirect_desc = kmalloc(target->indirect_size, GFP_KERNEL);
if (!req->indirect_desc)
goto out;
@@ -1283,6 +1276,15 @@ static int srp_map_finish_fmr(struct srp_map_state 
*state,
struct ib_pool_fmr *fmr;
u64 io_addr = 0;
 
+   if (state->npages == 0)
+   return 0;
+
+   if (state->npages == 1 && target->global_mr) {
+   srp_map_desc(state, state->base_dma_addr, state->dma_len,
+target->global_mr->rkey);
+   return 0;
+   }
+
if (WARN_ON_ONCE(state->fmr.next >= state->fmr.end))
return -ENOMEM;
 
@@ -1297,6 +1299,9 @@ static int srp_map_finish_fmr(struct srp_map_state *state,
srp_map_desc(state, state->base_dma_addr & ~dev->mr_page_mask,
 state->dma_len, fmr->fmr->rkey);
 
+   state->npages = 0;
+   state->dma_len = 0;
+
return 0;
 }
 
@@ -1306,9 +1311,17 @@ static int srp_map_finish_fr(struct srp_map_state *state,
struct srp_target_port *target = ch->target;
struct srp_device *dev = target->srp_host->srp_dev;
struct ib_send_wr *bad_wr;
-   struct ib_fast_reg_wr wr;
+   struct ib_reg_wr wr;
struct srp_fr_desc *desc;
u32 rkey;
+   int n, err;
+
+   if (state->sg_nents == 1 && target->global_mr) {
+   srp_map_desc(state, sg_dma_address(state->sg),
+sg_dma_len(state->sg),
+target->global_mr->rkey);
+   return 1;
+   }
 
if (WARN_ON_ONCE(state->fr.next >= state->fr.end))
return -ENOMEM;
@@ -1320,56 +1333,32 @@ static int srp_map_finish_fr(struct srp_map_state 
*state,
rkey = ib_inc_r

[PATCH v1 05/24] RDMA/ocrdma: Support the new memory registration API

2015-09-17 Thread Sagi Grimberg
Support the new memory registration API by allocating a
private page list array in ocrdma_mr and populate it when
ocrdma_map_mr_sg is invoked. Also, support IB_WR_REG_MR
by duplicating IB_WR_FAST_REG_MR, but take the needed
information from different places:
- page_size, iova, length, access flags (ib_mr)
- page array (ocrdma_mr)
- key (ib_reg_wr)

The IB_WR_FAST_REG_MR handlers will be removed later when
all the ULPs will be converted.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/ocrdma/ocrdma.h   |  2 +
 drivers/infiniband/hw/ocrdma/ocrdma_main.c  |  1 +
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 89 +
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h |  3 +
 4 files changed, 95 insertions(+)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h 
b/drivers/infiniband/hw/ocrdma/ocrdma.h
index b4091ab48db0..c2f3af5d5194 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma.h
@@ -193,6 +193,8 @@ struct ocrdma_mr {
struct ib_mr ibmr;
struct ib_umem *umem;
struct ocrdma_hw_mr hwmr;
+   u64 *pages;
+   u32 npages;
 };
 
 struct ocrdma_stats {
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_main.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
index 87aa55df7c82..874beb4b07a1 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_main.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_main.c
@@ -182,6 +182,7 @@ static int ocrdma_register_device(struct ocrdma_dev *dev)
dev->ibdev.reg_user_mr = ocrdma_reg_user_mr;
 
dev->ibdev.alloc_mr = ocrdma_alloc_mr;
+   dev->ibdev.map_mr_sg = ocrdma_map_mr_sg;
dev->ibdev.alloc_fast_reg_page_list = ocrdma_alloc_frmr_page_list;
dev->ibdev.free_fast_reg_page_list = ocrdma_free_frmr_page_list;
 
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index eb09e224acb9..853746e17d5c 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -1013,6 +1013,7 @@ int ocrdma_dereg_mr(struct ib_mr *ib_mr)
 
(void) ocrdma_mbx_dealloc_lkey(dev, mr->hwmr.fr_mr, mr->hwmr.lkey);
 
+   kfree(mr->pages);
ocrdma_free_mr_pbl_tbl(dev, &mr->hwmr);
 
/* it could be user registered memory. */
@@ -2177,6 +2178,60 @@ static int get_encoded_page_size(int pg_sz)
return i;
 }
 
+static int ocrdma_build_reg(struct ocrdma_qp *qp,
+   struct ocrdma_hdr_wqe *hdr,
+   struct ib_reg_wr *wr)
+{
+   u64 fbo;
+   struct ocrdma_ewqe_fr *fast_reg = (struct ocrdma_ewqe_fr *)(hdr + 1);
+   struct ocrdma_mr *mr = get_ocrdma_mr(wr->mr);
+   struct ocrdma_pbl *pbl_tbl = mr->hwmr.pbl_table;
+   struct ocrdma_pbe *pbe;
+   u32 wqe_size = sizeof(*fast_reg) + sizeof(*hdr);
+   int num_pbes = 0, i;
+
+   wqe_size = roundup(wqe_size, OCRDMA_WQE_ALIGN_BYTES);
+
+   hdr->cw |= (OCRDMA_FR_MR << OCRDMA_WQE_OPCODE_SHIFT);
+   hdr->cw |= ((wqe_size / OCRDMA_WQE_STRIDE) << OCRDMA_WQE_SIZE_SHIFT);
+
+   if (wr->access & IB_ACCESS_LOCAL_WRITE)
+   hdr->rsvd_lkey_flags |= OCRDMA_LKEY_FLAG_LOCAL_WR;
+   if (wr->access & IB_ACCESS_REMOTE_WRITE)
+   hdr->rsvd_lkey_flags |= OCRDMA_LKEY_FLAG_REMOTE_WR;
+   if (wr->access & IB_ACCESS_REMOTE_READ)
+   hdr->rsvd_lkey_flags |= OCRDMA_LKEY_FLAG_REMOTE_RD;
+   hdr->lkey = wr->key;
+   hdr->total_len = mr->ibmr.length;
+
+   fbo = mr->ibmr.iova - mr->pages[0];
+
+   fast_reg->va_hi = upper_32_bits(mr->ibmr.iova);
+   fast_reg->va_lo = (u32) (mr->ibmr.iova & 0x);
+   fast_reg->fbo_hi = upper_32_bits(fbo);
+   fast_reg->fbo_lo = (u32) fbo & 0x;
+   fast_reg->num_sges = mr->npages;
+   fast_reg->size_sge = get_encoded_page_size(mr->ibmr.page_size);
+
+   pbe = pbl_tbl->va;
+   for (i = 0; i < mr->npages; i++) {
+   u64 buf_addr = mr->pages[i];
+   pbe->pa_lo = cpu_to_le32((u32) (buf_addr & PAGE_MASK));
+   pbe->pa_hi = cpu_to_le32((u32) upper_32_bits(buf_addr));
+   num_pbes += 1;
+   pbe++;
+
+   /* if the pbl is full storing the pbes,
+* move to next pbl.
+   */
+   if (num_pbes == (mr->hwmr.pbl_size/sizeof(u64))) {
+   pbl_tbl++;
+   pbe = (struct ocrdma_pbe *)pbl_tbl->va;
+   }
+   }
+
+   return 0;
+}
 
 static int ocrdma_build_fr(struct ocrdma_qp *qp, struct ocrdma_hdr_wqe *hdr,
   struct ib_send_wr *send_wr)
@@ -2304,6 +2359,9 @@ int ocrdma_post_send(struct ib_qp *ibqp, struct 
ib_send_wr *wr,
case IB_WR_FAST_REG_MR:
status = ocrdma_build_fr(qp, hdr, wr);
break;
+   case IB_WR_REG_MR:
+   status = ocrdma_build_reg(qp, hdr, reg_wr(wr));
+   

[PATCH v1 01/24] IB/core: Introduce new fast registration API

2015-09-17 Thread Sagi Grimberg
The new fast registration  verg ib_map_mr_sg receives a scatterlist
and converts it to a page list under the verbs API thus hiding
the specific HW mapping details away from the consumer.

The provider drivers are provided with a generic helper ib_sg_to_pages
that converts a scatterlist into a vector of page addresses. The
drivers can still perform any HW specific page address setting
by passing a set_page function pointer which will be invoked for
each page address. This allows drivers to avoid keeping a shadow
page vectors and convert them to HW specific translations by doing
extra copies.

This API will allow ULPs to remove the duplicated code of constructing
a page vector from a given sg list.

The send work request ib_reg_wr also shrinks as it will contain only
mr, key and access flags in addition.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/core/verbs.c | 107 
 include/rdma/ib_verbs.h |  30 +++
 2 files changed, 137 insertions(+)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index e1f2c9887f3f..d99f57f1f737 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1469,3 +1469,110 @@ int ib_check_mr_status(struct ib_mr *mr, u32 check_mask,
mr->device->check_mr_status(mr, check_mask, mr_status) : 
-ENOSYS;
 }
 EXPORT_SYMBOL(ib_check_mr_status);
+
+/**
+ * ib_map_mr_sg() - Map a memory region with the largest prefix of
+ * a dma mapped SG list
+ * @mr:memory region
+ * @sg:dma mapped scatterlist
+ * @sg_nents:  number of entries in sg
+ * @page_size: page vector desired page size
+ *
+ * Constraints:
+ * - The first sg element is allowed to have an offset.
+ * - Each sg element must be aligned to page_size (or physically
+ *   contiguous to the previous element). In case an sg element has a
+ *   non contiguous offset, the mapping prefix will not include it.
+ * - The last sg element is allowed to have length less than page_size.
+ * - If sg_nents total byte length exceeds the mr max_num_sge * page_size
+ *   then only max_num_sg entries will be mapped.
+ *
+ * Returns the number of sg elements that were mapped to the memory region.
+ *
+ * After this completes successfully, the  memory region
+ * is ready for registration.
+ */
+int ib_map_mr_sg(struct ib_mr *mr,
+struct scatterlist *sg,
+unsigned int sg_nents,
+unsigned int page_size)
+{
+   if (unlikely(!mr->device->map_mr_sg))
+   return -ENOSYS;
+
+   mr->page_size = page_size;
+
+   return mr->device->map_mr_sg(mr, sg, sg_nents);
+}
+EXPORT_SYMBOL(ib_map_mr_sg);
+
+/**
+ * ib_sg_to_pages() - Convert the largest prefix of a sg list
+ * to a page vector
+ * @mr:memory region
+ * @sgl:   dma mapped scatterlist
+ * @sg_nents:  number of entries in sg
+ * @set_page:  driver page assignment function pointer
+ *
+ * Core service helper for drivers to covert the largest
+ * prefix of given sg list to a page vector. The sg list
+ * prefix converted is the prefix that meet the requirements
+ * of ib_map_mr_sg.
+ *
+ * Returns the number of sg elements that were assigned to
+ * a page vector.
+ */
+int ib_sg_to_pages(struct ib_mr *mr,
+  struct scatterlist *sgl,
+  unsigned int sg_nents,
+  int (*set_page)(struct ib_mr *, u64))
+{
+   struct scatterlist *sg;
+   u64 last_end_dma_addr = 0, last_page_addr = 0;
+   unsigned int last_page_off = 0;
+   u64 page_mask = ~((u64)mr->page_size - 1);
+   int i;
+
+   mr->iova = sg_dma_address(&sgl[0]);
+   mr->length = 0;
+
+   for_each_sg(sgl, sg, sg_nents, i) {
+   u64 dma_addr = sg_dma_address(sg);
+   unsigned int dma_len = sg_dma_len(sg);
+   u64 end_dma_addr = dma_addr + dma_len;
+   u64 page_addr = dma_addr & page_mask;
+
+   if (i && page_addr != dma_addr) {
+   if (last_end_dma_addr != dma_addr) {
+   /* gap */
+   goto done;
+
+   } else if (last_page_off + dma_len < mr->page_size) {
+   /* chunk this fragment with the last */
+   last_end_dma_addr += dma_len;
+   last_page_off += dma_len;
+   mr->length += dma_len;
+   continue;
+   } else {
+   /* map starting from the next page */
+   page_addr = last_page_addr + mr->page_size;
+   dma_len -= mr->page_size - last_page_off;
+   }
+   }
+
+   do {
+   if (unlikely(set_page(mr, page_addr)))
+   goto done;
+   

[PATCH v1 16/24] IB/mlx5: Remove old FRWR API support

2015-09-17 Thread Sagi Grimberg
No ULP uses it anymore, go ahead and remove it.
Keep only the local invalidate part of the handlers.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/mlx5/cq.c  |  3 --
 drivers/infiniband/hw/mlx5/main.c|  2 -
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 14 --
 drivers/infiniband/hw/mlx5/mr.c  | 42 
 drivers/infiniband/hw/mlx5/qp.c  | 97 
 5 files changed, 9 insertions(+), 149 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 90daf791d51d..640c54ef5eed 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -111,9 +111,6 @@ static enum ib_wc_opcode get_umr_comp(struct mlx5_ib_wq 
*wq, int idx)
case IB_WR_REG_MR:
return IB_WC_REG_MR;
 
-   case IB_WR_FAST_REG_MR:
-   return IB_WC_FAST_REG_MR;
-
default:
pr_warn("unknown completion status\n");
return 0;
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 7ebce545daf1..32f20d0fd632 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1433,8 +1433,6 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
dev->ib_dev.process_mad = mlx5_ib_process_mad;
dev->ib_dev.alloc_mr= mlx5_ib_alloc_mr;
dev->ib_dev.map_mr_sg   = mlx5_ib_map_mr_sg;
-   dev->ib_dev.alloc_fast_reg_page_list = mlx5_ib_alloc_fast_reg_page_list;
-   dev->ib_dev.free_fast_reg_page_list  = mlx5_ib_free_fast_reg_page_list;
dev->ib_dev.check_mr_status = mlx5_ib_check_mr_status;
dev->ib_dev.get_port_immutable  = mlx5_port_immutable;
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index bc1853f8e67d..91062d648125 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -337,12 +337,6 @@ struct mlx5_ib_mr {
int live;
 };
 
-struct mlx5_ib_fast_reg_page_list {
-   struct ib_fast_reg_page_listibfrpl;
-   __be64 *mapped_page_list;
-   dma_addr_t  map;
-};
-
 struct mlx5_ib_umr_context {
enum ib_wc_status   status;
struct completion   done;
@@ -493,11 +487,6 @@ static inline struct mlx5_ib_mr *to_mmr(struct ib_mr *ibmr)
return container_of(ibmr, struct mlx5_ib_mr, ibmr);
 }
 
-static inline struct mlx5_ib_fast_reg_page_list *to_mfrpl(struct 
ib_fast_reg_page_list *ibfrpl)
-{
-   return container_of(ibfrpl, struct mlx5_ib_fast_reg_page_list, ibfrpl);
-}
-
 struct mlx5_ib_ah {
struct ib_ahibah;
struct mlx5_av  av;
@@ -568,9 +557,6 @@ struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
 int mlx5_ib_map_mr_sg(struct ib_mr *ibmr,
  struct scatterlist *sg,
  unsigned int sg_nents);
-struct ib_fast_reg_page_list *mlx5_ib_alloc_fast_reg_page_list(struct 
ib_device *ibdev,
-  int 
page_list_len);
-void mlx5_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
 int mlx5_ib_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num,
const struct ib_wc *in_wc, const struct ib_grh *in_grh,
const struct ib_mad_hdr *in, size_t in_mad_size,
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 2f3b648719da..9f662d48606d 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1378,48 +1378,6 @@ err_free:
return ERR_PTR(err);
 }
 
-struct ib_fast_reg_page_list *mlx5_ib_alloc_fast_reg_page_list(struct 
ib_device *ibdev,
-  int 
page_list_len)
-{
-   struct mlx5_ib_fast_reg_page_list *mfrpl;
-   int size = page_list_len * sizeof(u64);
-
-   mfrpl = kmalloc(sizeof(*mfrpl), GFP_KERNEL);
-   if (!mfrpl)
-   return ERR_PTR(-ENOMEM);
-
-   mfrpl->ibfrpl.page_list = kmalloc(size, GFP_KERNEL);
-   if (!mfrpl->ibfrpl.page_list)
-   goto err_free;
-
-   mfrpl->mapped_page_list = dma_alloc_coherent(ibdev->dma_device,
-size, &mfrpl->map,
-GFP_KERNEL);
-   if (!mfrpl->mapped_page_list)
-   goto err_free;
-
-   WARN_ON(mfrpl->map & 0x3f);
-
-   return &mfrpl->ibfrpl;
-
-err_free:
-   kfree(mfrpl->ibfrpl.page_list);
-   kfree(mfrpl);
-   return ERR_PTR(-ENOMEM);
-}
-
-void mlx5_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list)
-{
-   struct mlx5_ib_fast_reg_page_list *mfrpl = to_mfrpl(page_list);
-   struct mlx5_ib_dev *dev = to_mdev(page_list->device);
-   int size = page_list->max_page_list_len * sizeof(u64);

[PATCH v1 06/24] RDMA/cxgb3: Support the new memory registration API

2015-09-17 Thread Sagi Grimberg
Support the new memory registration API by allocating a
private page list array in iwch_mr and populate it when
iwch_map_mr_sg is invoked. Also, support IB_WR_REG_MR
by duplicating build_fastreg just take the needed information
from different places:
- page_size, iova, length (ib_mr)
- page array (iwch_mr)
- key, access flags (ib_reg_wr)

The IB_WR_FAST_REG_MR handlers will be removed later when
all the ULPs will be converted.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/cxgb3/iwch_provider.c | 33 
 drivers/infiniband/hw/cxgb3/iwch_provider.h |  2 ++
 drivers/infiniband/hw/cxgb3/iwch_qp.c   | 48 +
 3 files changed, 83 insertions(+)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 93308c45f298..ee3d5ca7de6c 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -463,6 +463,7 @@ static int iwch_dereg_mr(struct ib_mr *ib_mr)
return -EINVAL;
 
mhp = to_iwch_mr(ib_mr);
+   kfree(mhp->pages);
rhp = mhp->rhp;
mmid = mhp->attr.stag >> 8;
cxio_dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size,
@@ -821,6 +822,12 @@ static struct ib_mr *iwch_alloc_mr(struct ib_pd *pd,
if (!mhp)
goto err;
 
+   mhp->pages = kcalloc(max_num_sg, sizeof(u64), GFP_KERNEL);
+   if (!mhp->pages) {
+   ret = -ENOMEM;
+   goto pl_err;
+   }
+
mhp->rhp = rhp;
ret = iwch_alloc_pbl(mhp, max_num_sg);
if (ret)
@@ -847,11 +854,36 @@ err3:
 err2:
iwch_free_pbl(mhp);
 err1:
+   kfree(mhp->pages);
+pl_err:
kfree(mhp);
 err:
return ERR_PTR(ret);
 }
 
+static int iwch_set_page(struct ib_mr *ibmr, u64 addr)
+{
+   struct iwch_mr *mhp = to_iwch_mr(ibmr);
+
+   if (unlikely(mhp->npages == mhp->attr.pbl_size))
+   return -ENOMEM;
+
+   mhp->pages[mhp->npages++] = addr;
+
+   return 0;
+}
+
+static int iwch_map_mr_sg(struct ib_mr *ibmr,
+ struct scatterlist *sg,
+ unsigned int sg_nents)
+{
+   struct iwch_mr *mhp = to_iwch_mr(ibmr);
+
+   mhp->npages = 0;
+
+   return ib_sg_to_pages(ibmr, sg, sg_nents, iwch_set_page);
+}
+
 static struct ib_fast_reg_page_list *iwch_alloc_fastreg_pbl(
struct ib_device *device,
int page_list_len)
@@ -1450,6 +1482,7 @@ int iwch_register_device(struct iwch_dev *dev)
dev->ibdev.bind_mw = iwch_bind_mw;
dev->ibdev.dealloc_mw = iwch_dealloc_mw;
dev->ibdev.alloc_mr = iwch_alloc_mr;
+   dev->ibdev.map_mr_sg = iwch_map_mr_sg;
dev->ibdev.alloc_fast_reg_page_list = iwch_alloc_fastreg_pbl;
dev->ibdev.free_fast_reg_page_list = iwch_free_fastreg_pbl;
dev->ibdev.attach_mcast = iwch_multicast_attach;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.h 
b/drivers/infiniband/hw/cxgb3/iwch_provider.h
index 87c14b0c5ac0..2ac85b86a680 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.h
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.h
@@ -77,6 +77,8 @@ struct iwch_mr {
struct iwch_dev *rhp;
u64 kva;
struct tpt_attributes attr;
+   u64 *pages;
+   u32 npages;
 };
 
 typedef struct iwch_mw iwch_mw_handle;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c 
b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index bac0508fedd9..a09ea538e990 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -146,6 +146,49 @@ static int build_rdma_read(union t3_wr *wqe, struct 
ib_send_wr *wr,
return 0;
 }
 
+static int build_memreg(union t3_wr *wqe, struct ib_reg_wr *wr,
+ u8 *flit_cnt, int *wr_cnt, struct t3_wq *wq)
+{
+   struct iwch_mr *mhp = to_iwch_mr(wr->mr);
+   int i;
+   __be64 *p;
+
+   if (mhp->npages > T3_MAX_FASTREG_DEPTH)
+   return -EINVAL;
+   *wr_cnt = 1;
+   wqe->fastreg.stag = cpu_to_be32(wr->key);
+   wqe->fastreg.len = cpu_to_be32(mhp->ibmr.length);
+   wqe->fastreg.va_base_hi = cpu_to_be32(mhp->ibmr.iova >> 32);
+   wqe->fastreg.va_base_lo_fbo =
+   cpu_to_be32(mhp->ibmr.iova & 0x);
+   wqe->fastreg.page_type_perms = cpu_to_be32(
+   V_FR_PAGE_COUNT(mhp->npages) |
+   V_FR_PAGE_SIZE(ilog2(wr->mr->page_size) - 12) |
+   V_FR_TYPE(TPT_VATO) |
+   V_FR_PERMS(iwch_ib_to_tpt_access(wr->access)));
+   p = &wqe->fastreg.pbl_addrs[0];
+   for (i = 0; i < mhp->npages; i++, p++) {
+
+   /* If we need a 2nd WR, then set it up */
+   if (i == T3_MAX_FASTREG_FRAG) {
+   *wr_cnt = 2;
+   wqe = (union t3_wr *)(wq->queue +
+   Q_PTR2IDX((wq->wptr+1), wq->size_l

[PATCH v1 04/24] IB/mlx4: Support the new memory registration API

2015-09-17 Thread Sagi Grimberg
Support the new memory registration API by allocating a
private page list array in mlx4_ib_mr and populate it when
mlx4_ib_map_mr_sg is invoked. Also, support IB_WR_REG_MR
by setting the exact WQE as IB_WR_FAST_REG_MR, just take the
needed information from different places:
- page_size, iova, length, access flags (ib_mr)
- page array (mlx4_ib_mr)
- key (ib_reg_wr)

The IB_WR_FAST_REG_MR handlers will be removed later when
all the ULPs will be converted.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/mlx4/cq.c  |  1 +
 drivers/infiniband/hw/mlx4/main.c|  1 +
 drivers/infiniband/hw/mlx4/mlx4_ib.h |  7 
 drivers/infiniband/hw/mlx4/mr.c  | 72 +---
 drivers/infiniband/hw/mlx4/qp.c  | 25 +
 5 files changed, 100 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index 2f4259525bb1..b62236e24708 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -819,6 +819,7 @@ repoll:
break;
case MLX4_OPCODE_FMR:
wc->opcode= IB_WC_FAST_REG_MR;
+   /* TODO: wc->opcode= IB_WC_REG_MR; */
break;
case MLX4_OPCODE_LOCAL_INVAL:
wc->opcode= IB_WC_LOCAL_INV;
diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index efecdf0216d8..bb82f5fa1612 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -2247,6 +2247,7 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
ibdev->ib_dev.rereg_user_mr = mlx4_ib_rereg_user_mr;
ibdev->ib_dev.dereg_mr  = mlx4_ib_dereg_mr;
ibdev->ib_dev.alloc_mr  = mlx4_ib_alloc_mr;
+   ibdev->ib_dev.map_mr_sg = mlx4_ib_map_mr_sg;
ibdev->ib_dev.alloc_fast_reg_page_list = 
mlx4_ib_alloc_fast_reg_page_list;
ibdev->ib_dev.free_fast_reg_page_list  = 
mlx4_ib_free_fast_reg_page_list;
ibdev->ib_dev.attach_mcast  = mlx4_ib_mcg_attach;
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h 
b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 1e7b23bb2eb0..07fcf3a49256 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -131,6 +131,10 @@ struct mlx4_ib_cq {
 
 struct mlx4_ib_mr {
struct ib_mribmr;
+   __be64  *pages;
+   dma_addr_t  page_map;
+   u32 npages;
+   u32 max_pages;
struct mlx4_mr  mmr;
struct ib_umem *umem;
 };
@@ -706,6 +710,9 @@ int mlx4_ib_dealloc_mw(struct ib_mw *mw);
 struct ib_mr *mlx4_ib_alloc_mr(struct ib_pd *pd,
   enum ib_mr_type mr_type,
   u32 max_num_sg);
+int mlx4_ib_map_mr_sg(struct ib_mr *ibmr,
+ struct scatterlist *sg,
+ unsigned int sg_nents);
 struct ib_fast_reg_page_list *mlx4_ib_alloc_fast_reg_page_list(struct 
ib_device *ibdev,
   int 
page_list_len);
 void mlx4_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index 5bba176e9dfa..6ed745798ad3 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -59,7 +59,7 @@ struct ib_mr *mlx4_ib_get_dma_mr(struct ib_pd *pd, int acc)
struct mlx4_ib_mr *mr;
int err;
 
-   mr = kmalloc(sizeof *mr, GFP_KERNEL);
+   mr = kzalloc(sizeof *mr, GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);
 
@@ -140,7 +140,7 @@ struct ib_mr *mlx4_ib_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
int err;
int n;
 
-   mr = kmalloc(sizeof *mr, GFP_KERNEL);
+   mr = kzalloc(sizeof *mr, GFP_KERNEL);
if (!mr)
return ERR_PTR(-ENOMEM);
 
@@ -271,11 +271,41 @@ release_mpt_entry:
return err;
 }
 
+static int
+mlx4_alloc_priv_pages(struct ib_device *device,
+ struct mlx4_ib_mr *mr,
+ int max_pages)
+{
+   int size = max_pages * sizeof(u64);
+
+   mr->pages = dma_alloc_coherent(device->dma_device, size,
+  &mr->page_map, GFP_KERNEL);
+   if (!mr->pages)
+   return -ENOMEM;
+
+   return 0;
+}
+
+static void
+mlx4_free_priv_pages(struct mlx4_ib_mr *mr)
+{
+   struct ib_device *device = mr->ibmr.device;
+   int size = mr->max_pages * sizeof(u64);
+
+   if (mr->pages) {
+   dma_free_coherent(device->dma_device, size,
+ mr->pages, mr->page_map);
+   mr->pages = NULL;
+   }
+}
+
 int mlx4_ib_dereg_mr(struct ib_mr *ibmr)
 {
struct mlx4_ib_mr *mr = to_mmr(ibmr);
int ret;
 
+   mlx4_free

[PATCH v1 12/24] xprtrdma: Port to new memory registration API

2015-09-17 Thread Sagi Grimberg
Instead of maintaining a fastreg page list, keep an sg table
and convert an array of pages to a sg list. Then call ib_map_mr_sg
and construct ib_reg_wr.

Note that the next step would be to have NFS work with sg lists
as it maps well to sk_frags (see comment from hch
http://marc.info/?l=linux-rdma&m=143677002622296&w=2).

Signed-off-by: Sagi Grimberg 
---
 net/sunrpc/xprtrdma/frwr_ops.c  | 112 +++-
 net/sunrpc/xprtrdma/xprt_rdma.h |   3 +-
 2 files changed, 67 insertions(+), 48 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 0d2f46f600b6..4d0221ccb043 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -151,9 +151,13 @@ __frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, struct 
ib_device *device,
f->fr_mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, depth);
if (IS_ERR(f->fr_mr))
goto out_mr_err;
-   f->fr_pgl = ib_alloc_fast_reg_page_list(device, depth);
-   if (IS_ERR(f->fr_pgl))
+
+   f->sg = kcalloc(sizeof(*f->sg), depth, GFP_KERNEL);
+   if (IS_ERR(f->sg))
goto out_list_err;
+
+   sg_init_table(f->sg, depth);
+
return 0;
 
 out_mr_err:
@@ -163,7 +167,7 @@ out_mr_err:
return rc;
 
 out_list_err:
-   rc = PTR_ERR(f->fr_pgl);
+   rc = -ENOMEM;
dprintk("RPC:   %s: ib_alloc_fast_reg_page_list status %i\n",
__func__, rc);
ib_dereg_mr(f->fr_mr);
@@ -179,7 +183,7 @@ __frwr_release(struct rpcrdma_mw *r)
if (rc)
dprintk("RPC:   %s: ib_dereg_mr status %i\n",
__func__, rc);
-   ib_free_fast_reg_page_list(r->r.frmr.fr_pgl);
+   kfree(r->r.frmr.sg);
 }
 
 static int
@@ -312,14 +316,11 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
struct rpcrdma_mw *mw;
struct rpcrdma_frmr *frmr;
struct ib_mr *mr;
-   struct ib_fast_reg_wr fastreg_wr;
+   struct ib_reg_wr reg_wr;
struct ib_send_wr *bad_wr;
+   unsigned int dma_nents;
u8 key;
-   int len, pageoff;
-   int i, rc;
-   int seg_len;
-   u64 pa;
-   int page_no;
+   int i, rc, len, n;
 
mw = seg1->rl_mw;
seg1->rl_mw = NULL;
@@ -332,64 +333,80 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
} while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
+   mr = frmr->fr_mr;
 
-   pageoff = offset_in_page(seg1->mr_offset);
-   seg1->mr_offset -= pageoff; /* start of page */
-   seg1->mr_len += pageoff;
-   len = -pageoff;
if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;
 
-   for (page_no = i = 0; i < nsegs;) {
-   rpcrdma_map_one(device, seg, direction);
-   pa = seg->mr_dma;
-   for (seg_len = seg->mr_len; seg_len > 0; seg_len -= PAGE_SIZE) {
-   frmr->fr_pgl->page_list[page_no++] = pa;
-   pa += PAGE_SIZE;
-   }
+   for (len = 0, i = 0; i < nsegs;) {
+   if (seg->mr_page)
+   sg_set_page(&frmr->sg[i],
+   seg->mr_page,
+   seg->mr_len,
+   offset_in_page(seg->mr_offset));
+   else
+   sg_set_buf(&frmr->sg[i], seg->mr_offset,
+  seg->mr_len);
+
len += seg->mr_len;
++seg;
++i;
+
/* Check for holes */
if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
break;
}
+   frmr->sg_nents = i;
+
+   dma_nents = ib_dma_map_sg(device, frmr->sg, frmr->sg_nents, direction);
+   if (!dma_nents) {
+   pr_err("RPC:   %s: failed to dma map sg %p sg_nents %d\n",
+   __func__, frmr->sg, frmr->sg_nents);
+   return -ENOMEM;
+   }
+
+   n = ib_map_mr_sg(mr, frmr->sg, frmr->sg_nents, PAGE_SIZE);
+   if (unlikely(n != frmr->sg_nents)) {
+   pr_err("RPC:   %s: failed to map mr %p (%d/%d)\n",
+   __func__, frmr->fr_mr, n, frmr->sg_nents);
+   rc = n < 0 ? n : -EINVAL;
+   goto out_senderr;
+   }
+
dprintk("RPC:   %s: Using frmr %p to map %d segments (%d bytes)\n",
-   __func__, mw, i, len);
-
-   memset(&fastreg_wr, 0, sizeof(fastreg_wr));
-   fastreg_wr.wr.wr_id = (unsigned long)(void *)mw;
-   fastreg_wr.wr.opcode = IB_WR_FAST_REG_MR;
-   fastreg_wr.iova_start = seg1->mr_dma + pageoff;
-   fastreg_wr.page_list = frmr->fr_pgl;
-   fastreg_wr.page_shift = PAGE_SHIFT;
-   fastreg_wr.page_li

[PATCH v1 13/24] svcrdma: Port to new memory registration API

2015-09-17 Thread Sagi Grimberg
Instead of maintaining a fastreg page list, keep an sg table
and convert an array of pages to a sg list. Then call ib_map_mr_sg
and construct ib_reg_wr.

Signed-off-by: Sagi Grimberg 
---
 include/linux/sunrpc/svc_rdma.h  |  6 +--
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  | 76 ++--
 net/sunrpc/xprtrdma/svc_rdma_transport.c | 34 +-
 3 files changed, 55 insertions(+), 61 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 83211bc9219e..e240d102a911 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -105,11 +105,9 @@ struct svc_rdma_chunk_sge {
 };
 struct svc_rdma_fastreg_mr {
struct ib_mr *mr;
-   void *kva;
-   struct ib_fast_reg_page_list *page_list;
-   int page_list_len;
+   struct scatterlist *sg;
+   unsigned int sg_nents;
unsigned long access_flags;
-   unsigned long map_len;
enum dma_data_direction direction;
struct list_head frmr_list;
 };
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 7be42d0da19e..303f194970f9 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -220,12 +220,12 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
 {
struct ib_rdma_wr read_wr;
struct ib_send_wr inv_wr;
-   struct ib_fast_reg_wr fastreg_wr;
+   struct ib_reg_wr reg_wr;
u8 key;
-   int pages_needed = PAGE_ALIGN(*page_offset + rs_length) >> PAGE_SHIFT;
+   unsigned int nents = PAGE_ALIGN(*page_offset + rs_length) >> PAGE_SHIFT;
struct svc_rdma_op_ctxt *ctxt = svc_rdma_get_context(xprt);
struct svc_rdma_fastreg_mr *frmr = svc_rdma_get_frmr(xprt);
-   int ret, read, pno;
+   int ret, read, pno, dma_nents, n;
u32 pg_off = *page_offset;
u32 pg_no = *page_no;
 
@@ -234,16 +234,14 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
 
ctxt->direction = DMA_FROM_DEVICE;
ctxt->frmr = frmr;
-   pages_needed = min_t(int, pages_needed, xprt->sc_frmr_pg_list_len);
-   read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
+   nents = min_t(unsigned int, nents, xprt->sc_frmr_pg_list_len);
+   read = min_t(int, nents << PAGE_SHIFT, rs_length);
 
-   frmr->kva = page_address(rqstp->rq_arg.pages[pg_no]);
frmr->direction = DMA_FROM_DEVICE;
frmr->access_flags = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
-   frmr->map_len = pages_needed << PAGE_SHIFT;
-   frmr->page_list_len = pages_needed;
+   frmr->sg_nents = nents;
 
-   for (pno = 0; pno < pages_needed; pno++) {
+   for (pno = 0; pno < nents; pno++) {
int len = min_t(int, rs_length, PAGE_SIZE - pg_off);
 
head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no];
@@ -251,17 +249,12 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
head->arg.len += len;
if (!pg_off)
head->count++;
+
+   sg_set_page(&frmr->sg[pno], rqstp->rq_arg.pages[pg_no],
+   len, pg_off);
+
rqstp->rq_respages = &rqstp->rq_arg.pages[pg_no+1];
rqstp->rq_next_page = rqstp->rq_respages + 1;
-   frmr->page_list->page_list[pno] =
-   ib_dma_map_page(xprt->sc_cm_id->device,
-   head->arg.pages[pg_no], 0,
-   PAGE_SIZE, DMA_FROM_DEVICE);
-   ret = ib_dma_mapping_error(xprt->sc_cm_id->device,
-  frmr->page_list->page_list[pno]);
-   if (ret)
-   goto err;
-   atomic_inc(&xprt->sc_dma_used);
 
/* adjust offset and wrap to next page if needed */
pg_off += len;
@@ -277,28 +270,42 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
else
clear_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
 
+   dma_nents = ib_dma_map_sg(xprt->sc_cm_id->device,
+ frmr->sg, frmr->sg_nents,
+ frmr->direction);
+   if (!dma_nents) {
+   pr_err("svcrdma: failed to dma map sg %p\n",
+  frmr->sg);
+   return -ENOMEM;
+   }
+   atomic_inc(&xprt->sc_dma_used);
+
+   n = ib_map_mr_sg(frmr->mr, frmr->sg, frmr->sg_nents, PAGE_SIZE);
+   if (unlikely(n != frmr->sg_nents)) {
+   pr_err("svcrdma: failed to map mr %p (%d/%d elements)\n",
+  frmr->mr, n, frmr->sg_nents);
+   return n < 0 ? n : -EINVAL;
+   }
+
/* Bump the key */
key = (u8)(frmr->mr->lkey & 0x00FF);
ib_update_fast_reg_key(frmr->mr, ++key);
 
-   ctxt->sge[0].addr = (unsigned long)frmr->kva + *page_offset;
+   ctxt->sge[0].addr = frmr->

[PATCH v1 09/24] RDMA/nes: Support the new memory registration API

2015-09-17 Thread Sagi Grimberg
Support the new memory registration API by allocating a
private page list array in nes_mr and populate it when
nes_map_mr_sg is invoked. Also, support IB_WR_REG_MR
by duplicating IB_WR_FAST_REG_MR handling and take the
needed information from different places:
- page_size, iova, length (ib_mr)
- page array (nes_mr)
- key, access flags (ib_reg_wr)

The IB_WR_FAST_REG_MR handlers will be removed later when
all the ULPs will be converted.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/nes/nes_verbs.c | 115 ++
 drivers/infiniband/hw/nes/nes_verbs.h |   4 ++
 2 files changed, 119 insertions(+)

diff --git a/drivers/infiniband/hw/nes/nes_verbs.c 
b/drivers/infiniband/hw/nes/nes_verbs.c
index f71b37b75f82..ba069ec2ebf9 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -51,6 +51,7 @@ atomic_t qps_created;
 atomic_t sw_qps_destroyed;
 
 static void nes_unregister_ofa_device(struct nes_ib_device *nesibdev);
+static int nes_dereg_mr(struct ib_mr *ib_mr);
 
 /**
  * nes_alloc_mw
@@ -445,7 +446,44 @@ static struct ib_mr *nes_alloc_mr(struct ib_pd *ibpd,
nes_free_resource(nesadapter, nesadapter->allocated_mrs, 
stag_index);
ibmr = ERR_PTR(-ENOMEM);
}
+
+   nesmr->pages = pci_alloc_consistent(nesdev->pcidev,
+   max_num_sg * sizeof(u64),
+   &nesmr->paddr);
+   if (!nesmr->paddr)
+   goto err;
+
+   nesmr->max_pages = max_num_sg;
+
return ibmr;
+
+err:
+   nes_dereg_mr(ibmr);
+
+   return ERR_PTR(-ENOMEM);
+}
+
+static int nes_set_page(struct ib_mr *ibmr, u64 addr)
+{
+   struct nes_mr *nesmr = to_nesmr(ibmr);
+
+   if (unlikely(nesmr->npages == nesmr->max_pages))
+   return -ENOMEM;
+
+   nesmr->pages[nesmr->npages++] = cpu_to_le64(addr);
+
+   return 0;
+}
+
+static int nes_map_mr_sg(struct ib_mr *ibmr,
+struct scatterlist *sg,
+unsigned int sg_nents)
+{
+   struct nes_mr *nesmr = to_nesmr(ibmr);
+
+   nesmr->npages = 0;
+
+   return ib_sg_to_pages(ibmr, sg, sg_nents, nes_set_page);
 }
 
 /*
@@ -2683,6 +2721,13 @@ static int nes_dereg_mr(struct ib_mr *ib_mr)
u16 major_code;
u16 minor_code;
 
+
+   if (nesmr->pages)
+   pci_free_consistent(nesdev->pcidev,
+   nesmr->max_pages * sizeof(u64),
+   nesmr->pages,
+   nesmr->paddr);
+
if (nesmr->region) {
ib_umem_release(nesmr->region);
}
@@ -3513,6 +3558,75 @@ static int nes_post_send(struct ib_qp *ibqp, struct 
ib_send_wr *ib_wr,
  wqe_misc);
break;
}
+   case IB_WR_REG_MR:
+   {
+   struct nes_mr *mr = to_nesmr(reg_wr(ib_wr)->mr);
+   int page_shift = ilog2(reg_wr(ib_wr)->mr->page_size);
+   int flags = reg_wr(ib_wr)->access;
+
+   if (mr->npages > (NES_4K_PBL_CHUNK_SIZE / sizeof(u64))) 
{
+   nes_debug(NES_DBG_IW_TX, "SQ_FMR: bad 
page_list_len\n");
+   err = -EINVAL;
+   break;
+   }
+   wqe_misc = NES_IWARP_SQ_OP_FAST_REG;
+   set_wqe_64bit_value(wqe->wqe_words,
+   NES_IWARP_SQ_FMR_WQE_VA_FBO_LOW_IDX,
+   mr->ibmr.iova);
+   set_wqe_32bit_value(wqe->wqe_words,
+   NES_IWARP_SQ_FMR_WQE_LENGTH_LOW_IDX,
+   mr->ibmr.length);
+   set_wqe_32bit_value(wqe->wqe_words,
+   
NES_IWARP_SQ_FMR_WQE_LENGTH_HIGH_IDX, 0);
+   set_wqe_32bit_value(wqe->wqe_words,
+   NES_IWARP_SQ_FMR_WQE_MR_STAG_IDX,
+   reg_wr(ib_wr)->key);
+
+   if (page_shift == 12) {
+   wqe_misc |= NES_IWARP_SQ_FMR_WQE_PAGE_SIZE_4K;
+   } else if (page_shift == 21) {
+   wqe_misc |= NES_IWARP_SQ_FMR_WQE_PAGE_SIZE_2M;
+   } else {
+   nes_debug(NES_DBG_IW_TX, "Invalid page shift,"
+ " ib_wr=%u, max=1\n", ib_wr->num_sge);
+   err = -EINVAL;
+   break;
+   }
+
+   /* Set access_flags */
+   wqe_misc |= 
NES_IWARP_SQ_FMR_WQE_RIGHTS_ENABLE_LOCAL_READ;
+   if (flags & I

[PATCH v1 11/24] iser-target: Port to new memory registration API

2015-09-17 Thread Sagi Grimberg
Remove fastreg page list allocation as the page vector
is now private to the provider. Instead of constructing
the page list and fast_req work request, call ib_map_mr_sg
and construct ib_reg_wr.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/ulp/isert/ib_isert.c | 129 +++-
 drivers/infiniband/ulp/isert/ib_isert.h |   2 -
 2 files changed, 27 insertions(+), 104 deletions(-)

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c 
b/drivers/infiniband/ulp/isert/ib_isert.c
index dcb29d166211..67d56c3de3dd 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -475,10 +475,8 @@ isert_conn_free_fastreg_pool(struct isert_conn *isert_conn)
list_for_each_entry_safe(fr_desc, tmp,
 &isert_conn->fr_pool, list) {
list_del(&fr_desc->list);
-   ib_free_fast_reg_page_list(fr_desc->data_frpl);
ib_dereg_mr(fr_desc->data_mr);
if (fr_desc->pi_ctx) {
-   ib_free_fast_reg_page_list(fr_desc->pi_ctx->prot_frpl);
ib_dereg_mr(fr_desc->pi_ctx->prot_mr);
ib_dereg_mr(fr_desc->pi_ctx->sig_mr);
kfree(fr_desc->pi_ctx);
@@ -506,22 +504,13 @@ isert_create_pi_ctx(struct fast_reg_descriptor *desc,
return -ENOMEM;
}
 
-   pi_ctx->prot_frpl = ib_alloc_fast_reg_page_list(device,
-   ISCSI_ISER_SG_TABLESIZE);
-   if (IS_ERR(pi_ctx->prot_frpl)) {
-   isert_err("Failed to allocate prot frpl err=%ld\n",
- PTR_ERR(pi_ctx->prot_frpl));
-   ret = PTR_ERR(pi_ctx->prot_frpl);
-   goto err_pi_ctx;
-   }
-
pi_ctx->prot_mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
  ISCSI_ISER_SG_TABLESIZE);
if (IS_ERR(pi_ctx->prot_mr)) {
isert_err("Failed to allocate prot frmr err=%ld\n",
  PTR_ERR(pi_ctx->prot_mr));
ret = PTR_ERR(pi_ctx->prot_mr);
-   goto err_prot_frpl;
+   goto err_pi_ctx;
}
desc->ind |= ISERT_PROT_KEY_VALID;
 
@@ -541,8 +530,6 @@ isert_create_pi_ctx(struct fast_reg_descriptor *desc,
 
 err_prot_mr:
ib_dereg_mr(pi_ctx->prot_mr);
-err_prot_frpl:
-   ib_free_fast_reg_page_list(pi_ctx->prot_frpl);
 err_pi_ctx:
kfree(pi_ctx);
 
@@ -553,34 +540,18 @@ static int
 isert_create_fr_desc(struct ib_device *ib_device, struct ib_pd *pd,
 struct fast_reg_descriptor *fr_desc)
 {
-   int ret;
-
-   fr_desc->data_frpl = ib_alloc_fast_reg_page_list(ib_device,
-
ISCSI_ISER_SG_TABLESIZE);
-   if (IS_ERR(fr_desc->data_frpl)) {
-   isert_err("Failed to allocate data frpl err=%ld\n",
- PTR_ERR(fr_desc->data_frpl));
-   return PTR_ERR(fr_desc->data_frpl);
-   }
-
fr_desc->data_mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
   ISCSI_ISER_SG_TABLESIZE);
if (IS_ERR(fr_desc->data_mr)) {
isert_err("Failed to allocate data frmr err=%ld\n",
  PTR_ERR(fr_desc->data_mr));
-   ret = PTR_ERR(fr_desc->data_mr);
-   goto err_data_frpl;
+   return PTR_ERR(fr_desc->data_mr);
}
fr_desc->ind |= ISERT_DATA_KEY_VALID;
 
isert_dbg("Created fr_desc %p\n", fr_desc);
 
return 0;
-
-err_data_frpl:
-   ib_free_fast_reg_page_list(fr_desc->data_frpl);
-
-   return ret;
 }
 
 static int
@@ -2516,45 +2487,6 @@ unmap_cmd:
return ret;
 }
 
-static int
-isert_map_fr_pagelist(struct ib_device *ib_dev,
- struct scatterlist *sg_start, int sg_nents, u64 *fr_pl)
-{
-   u64 start_addr, end_addr, page, chunk_start = 0;
-   struct scatterlist *tmp_sg;
-   int i = 0, new_chunk, last_ent, n_pages;
-
-   n_pages = 0;
-   new_chunk = 1;
-   last_ent = sg_nents - 1;
-   for_each_sg(sg_start, tmp_sg, sg_nents, i) {
-   start_addr = ib_sg_dma_address(ib_dev, tmp_sg);
-   if (new_chunk)
-   chunk_start = start_addr;
-   end_addr = start_addr + ib_sg_dma_len(ib_dev, tmp_sg);
-
-   isert_dbg("SGL[%d] dma_addr: 0x%llx len: %u\n",
- i, (unsigned long long)tmp_sg->dma_address,
- tmp_sg->length);
-
-   if ((end_addr & ~PAGE_MASK) && i < last_ent) {
-   new_chunk = 0;
-   continue;
-   }
-   new_chunk = 1;
-
-   page = chunk_start & PAGE_MASK;
-   do {
-   fr_pl[n_pages++] = page;
-   isert_dbg("Mapped page_list[%d] page_addr: 0x%llx\n",
-  

[PATCH v1 20/24] iw_cxgb4: Remove old FRWR API

2015-09-17 Thread Sagi Grimberg
No ULP uses it anymore, go ahead and remove it.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/cxgb4/cq.c   |  2 +-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 18 
 drivers/infiniband/hw/cxgb4/mem.c  | 45 
 drivers/infiniband/hw/cxgb4/provider.c |  2 -
 drivers/infiniband/hw/cxgb4/qp.c   | 77 --
 5 files changed, 1 insertion(+), 143 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index c7aab48f07cd..4f8c3ff3da5e 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -752,7 +752,7 @@ static int c4iw_poll_cq_one(struct c4iw_cq *chp, struct 
ib_wc *wc)
wc->opcode = IB_WC_LOCAL_INV;
break;
case FW_RI_FAST_REGISTER:
-   wc->opcode = IB_WC_FAST_REG_MR;
+   wc->opcode = IB_WC_REG_MR;
break;
default:
printk(KERN_ERR MOD "Unexpected opcode %d "
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h 
b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index 032f90aa8ac9..699c52b875b1 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -409,20 +409,6 @@ static inline struct c4iw_mw *to_c4iw_mw(struct ib_mw 
*ibmw)
return container_of(ibmw, struct c4iw_mw, ibmw);
 }
 
-struct c4iw_fr_page_list {
-   struct ib_fast_reg_page_list ibpl;
-   DEFINE_DMA_UNMAP_ADDR(mapping);
-   dma_addr_t dma_addr;
-   struct c4iw_dev *dev;
-   int pll_len;
-};
-
-static inline struct c4iw_fr_page_list *to_c4iw_fr_page_list(
-   struct ib_fast_reg_page_list *ibpl)
-{
-   return container_of(ibpl, struct c4iw_fr_page_list, ibpl);
-}
-
 struct c4iw_cq {
struct ib_cq ibcq;
struct c4iw_dev *rhp;
@@ -970,10 +956,6 @@ int c4iw_accept_cr(struct iw_cm_id *cm_id, struct 
iw_cm_conn_param *conn_param);
 int c4iw_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len);
 void c4iw_qp_add_ref(struct ib_qp *qp);
 void c4iw_qp_rem_ref(struct ib_qp *qp);
-void c4iw_free_fastreg_pbl(struct ib_fast_reg_page_list *page_list);
-struct ib_fast_reg_page_list *c4iw_alloc_fastreg_pbl(
-   struct ib_device *device,
-   int page_list_len);
 struct ib_mr *c4iw_alloc_mr(struct ib_pd *pd,
enum ib_mr_type mr_type,
u32 max_num_sg);
diff --git a/drivers/infiniband/hw/cxgb4/mem.c 
b/drivers/infiniband/hw/cxgb4/mem.c
index 86ec65721797..ada42ed425a0 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -945,51 +945,6 @@ int c4iw_map_mr_sg(struct ib_mr *ibmr,
return ib_sg_to_pages(ibmr, sg, sg_nents, c4iw_set_page);
 }
 
-struct ib_fast_reg_page_list *c4iw_alloc_fastreg_pbl(struct ib_device *device,
-int page_list_len)
-{
-   struct c4iw_fr_page_list *c4pl;
-   struct c4iw_dev *dev = to_c4iw_dev(device);
-   dma_addr_t dma_addr;
-   int pll_len = roundup(page_list_len * sizeof(u64), 32);
-
-   c4pl = kmalloc(sizeof(*c4pl), GFP_KERNEL);
-   if (!c4pl)
-   return ERR_PTR(-ENOMEM);
-
-   c4pl->ibpl.page_list = dma_alloc_coherent(&dev->rdev.lldi.pdev->dev,
- pll_len, &dma_addr,
- GFP_KERNEL);
-   if (!c4pl->ibpl.page_list) {
-   kfree(c4pl);
-   return ERR_PTR(-ENOMEM);
-   }
-   dma_unmap_addr_set(c4pl, mapping, dma_addr);
-   c4pl->dma_addr = dma_addr;
-   c4pl->dev = dev;
-   c4pl->pll_len = pll_len;
-
-   PDBG("%s c4pl %p pll_len %u page_list %p dma_addr %pad\n",
-__func__, c4pl, c4pl->pll_len, c4pl->ibpl.page_list,
-&c4pl->dma_addr);
-
-   return &c4pl->ibpl;
-}
-
-void c4iw_free_fastreg_pbl(struct ib_fast_reg_page_list *ibpl)
-{
-   struct c4iw_fr_page_list *c4pl = to_c4iw_fr_page_list(ibpl);
-
-   PDBG("%s c4pl %p pll_len %u page_list %p dma_addr %pad\n",
-__func__, c4pl, c4pl->pll_len, c4pl->ibpl.page_list,
-&c4pl->dma_addr);
-
-   dma_free_coherent(&c4pl->dev->rdev.lldi.pdev->dev,
- c4pl->pll_len,
- c4pl->ibpl.page_list, dma_unmap_addr(c4pl, mapping));
-   kfree(c4pl);
-}
-
 int c4iw_dereg_mr(struct ib_mr *ib_mr)
 {
struct c4iw_dev *rhp;
diff --git a/drivers/infiniband/hw/cxgb4/provider.c 
b/drivers/infiniband/hw/cxgb4/provider.c
index 55dedadcffaa..8f115b405d76 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -558,8 +558,6 @@ int c4iw_register_device(struct c4iw_dev *dev)
dev->ibdev.dealloc_mw = c4iw_dealloc_mw;
dev->ibdev.allo

[PATCH v1 02/24] IB/mlx5: Remove dead fmr code

2015-09-17 Thread Sagi Grimberg
Just function declarations - no need for those
laying arround. If for some reason someone will want
FMR support in mlx5, it should be easy enough to restore
a few structs.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 25 -
 1 file changed, 25 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index ef4a47658f7a..210f99877b0b 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -364,20 +364,6 @@ enum {
MLX5_FMR_BUSY,
 };
 
-struct mlx5_ib_fmr {
-   struct ib_fmr   ibfmr;
-   struct mlx5_core_mr mr;
-   int access_flags;
-   int state;
-   /* protect fmr state
-*/
-   spinlock_t  lock;
-   u64 wrid;
-   struct ib_send_wr   wr[2];
-   u8  page_shift;
-   struct ib_fast_reg_page_listpage_list;
-};
-
 struct mlx5_cache_ent {
struct list_headhead;
/* sync access to the cahce entry
@@ -462,11 +448,6 @@ static inline struct mlx5_ib_dev *to_mdev(struct ib_device 
*ibdev)
return container_of(ibdev, struct mlx5_ib_dev, ib_dev);
 }
 
-static inline struct mlx5_ib_fmr *to_mfmr(struct ib_fmr *ibfmr)
-{
-   return container_of(ibfmr, struct mlx5_ib_fmr, ibfmr);
-}
-
 static inline struct mlx5_ib_cq *to_mcq(struct ib_cq *ibcq)
 {
return container_of(ibcq, struct mlx5_ib_cq, ibcq);
@@ -582,12 +563,6 @@ struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
 struct ib_fast_reg_page_list *mlx5_ib_alloc_fast_reg_page_list(struct 
ib_device *ibdev,
   int 
page_list_len);
 void mlx5_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
-struct ib_fmr *mlx5_ib_fmr_alloc(struct ib_pd *pd, int acc,
-struct ib_fmr_attr *fmr_attr);
-int mlx5_ib_map_phys_fmr(struct ib_fmr *ibfmr, u64 *page_list,
- int npages, u64 iova);
-int mlx5_ib_unmap_fmr(struct list_head *fmr_list);
-int mlx5_ib_fmr_dealloc(struct ib_fmr *ibfmr);
 int mlx5_ib_process_mad(struct ib_device *ibdev, int mad_flags, u8 port_num,
const struct ib_wc *in_wc, const struct ib_grh *in_grh,
const struct ib_mad_hdr *in, size_t in_mad_size,
-- 
1.8.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 10/24] IB/iser: Port to new fast registration API

2015-09-17 Thread Sagi Grimberg
Remove fastreg page list allocation as the page vector
is now private to the provider. Instead of constructing
the page list and fast_req work request, call ib_map_mr_sg
and construct ib_reg_wr.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/ulp/iser/iscsi_iser.h  |  8 ++---
 drivers/infiniband/ulp/iser/iser_memory.c | 53 ++-
 drivers/infiniband/ulp/iser/iser_verbs.c  | 16 +-
 3 files changed, 26 insertions(+), 51 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h 
b/drivers/infiniband/ulp/iser/iscsi_iser.h
index 2484bee993ec..271aa71e827c 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -297,7 +297,7 @@ struct iser_tx_desc {
u8   wr_idx;
union iser_wr {
struct ib_send_wr   send;
-   struct ib_fast_reg_wr   fast_reg;
+   struct ib_reg_wrfast_reg;
struct ib_sig_handover_wr   sig;
} wrs[ISER_MAX_WRS];
struct iser_mem_reg  data_reg;
@@ -412,7 +412,6 @@ struct iser_device {
  *
  * @mr: memory region
  * @fmr_pool:   pool of fmrs
- * @frpl:   fast reg page list used by frwrs
  * @page_vec:   fast reg page list used by fmr pool
  * @mr_valid:   is mr valid indicator
  */
@@ -421,10 +420,7 @@ struct iser_reg_resources {
struct ib_mr *mr;
struct ib_fmr_pool   *fmr_pool;
};
-   union {
-   struct ib_fast_reg_page_list *frpl;
-   struct iser_page_vec *page_vec;
-   };
+   struct iser_page_vec *page_vec;
u8mr_valid:1;
 };
 
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index b29fda3e8e74..d78eafb159b4 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -472,7 +472,7 @@ iser_reg_sig_mr(struct iscsi_iser_task *iser_task,
sig_reg->sge.addr = 0;
sig_reg->sge.length = scsi_transfer_length(iser_task->sc);
 
-   iser_dbg("sig reg: lkey: 0x%x, rkey: 0x%x, addr: 0x%llx, length: %u\n",
+   iser_dbg("lkey=0x%x rkey=0x%x addr=0x%llx length=%u\n",
 sig_reg->sge.lkey, sig_reg->rkey, sig_reg->sge.addr,
 sig_reg->sge.length);
 err:
@@ -484,47 +484,40 @@ static int iser_fast_reg_mr(struct iscsi_iser_task 
*iser_task,
struct iser_reg_resources *rsc,
struct iser_mem_reg *reg)
 {
-   struct ib_conn *ib_conn = &iser_task->iser_conn->ib_conn;
-   struct iser_device *device = ib_conn->device;
-   struct ib_mr *mr = rsc->mr;
-   struct ib_fast_reg_page_list *frpl = rsc->frpl;
struct iser_tx_desc *tx_desc = &iser_task->desc;
-   struct ib_fast_reg_wr *wr;
-   int offset, size, plen;
-
-   plen = iser_sg_to_page_vec(mem, device->ib_device, frpl->page_list,
-  &offset, &size);
-   if (plen * SIZE_4K < size) {
-   iser_err("fast reg page_list too short to hold this SG\n");
-   return -EINVAL;
-   }
+   struct ib_mr *mr = rsc->mr;
+   struct ib_reg_wr *wr;
+   int n;
 
if (!rsc->mr_valid)
iser_inv_rkey(iser_tx_next_wr(tx_desc), mr);
 
-   wr = fast_reg_wr(iser_tx_next_wr(tx_desc));
-   wr->wr.opcode = IB_WR_FAST_REG_MR;
+   n = ib_map_mr_sg(mr, mem->sg, mem->size, SIZE_4K);
+   if (unlikely(n != mem->size)) {
+   iser_err("failed to map sg (%d/%d)\n",
+n, mem->size);
+   return n < 0 ? n : -EINVAL;
+   }
+
+   wr = reg_wr(iser_tx_next_wr(tx_desc));
+   wr->wr.opcode = IB_WR_REG_MR;
wr->wr.wr_id = ISER_FASTREG_LI_WRID;
wr->wr.send_flags = 0;
-   wr->iova_start = frpl->page_list[0] + offset;
-   wr->page_list = frpl;
-   wr->page_list_len = plen;
-   wr->page_shift = SHIFT_4K;
-   wr->length = size;
-   wr->rkey = mr->rkey;
-   wr->access_flags = (IB_ACCESS_LOCAL_WRITE  |
-   IB_ACCESS_REMOTE_WRITE |
-   IB_ACCESS_REMOTE_READ);
+   wr->mr = mr;
+   wr->key = mr->rkey;
+   wr->access = IB_ACCESS_LOCAL_WRITE  |
+IB_ACCESS_REMOTE_WRITE |
+IB_ACCESS_REMOTE_READ;
+
rsc->mr_valid = 0;
 
reg->sge.lkey = mr->lkey;
reg->rkey = mr->rkey;
-   reg->sge.addr = frpl->page_list[0] + offset;
-   reg->sge.length = size;
+   reg->sge.addr = mr->iova;
+   reg->sge.length = mr->length;
 
-   iser_dbg("fast reg: lkey=0x%x, rkey=0x%x, addr=0x%llx,"
-" length=0x%x\n", reg->sge.lkey, reg->rkey,
-reg->sge.addr, reg->sge.length);
+   iser_dbg("lkey=0x%x rkey=0x%x addr=0x%llx length=0x%x\n"

[PATCH v1 22/24] RDMA/nes: Remove old FRWR API

2015-09-17 Thread Sagi Grimberg
No ULP uses it anymore, go ahead and remove it.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/nes/nes_hw.h|   6 --
 drivers/infiniband/hw/nes/nes_verbs.c | 162 +-
 2 files changed, 1 insertion(+), 167 deletions(-)

diff --git a/drivers/infiniband/hw/nes/nes_hw.h 
b/drivers/infiniband/hw/nes/nes_hw.h
index d748e4b31b8d..c9080208aad2 100644
--- a/drivers/infiniband/hw/nes/nes_hw.h
+++ b/drivers/infiniband/hw/nes/nes_hw.h
@@ -1200,12 +1200,6 @@ struct nes_fast_mr_wqe_pbl {
dma_addr_t  paddr;
 };
 
-struct nes_ib_fast_reg_page_list {
-   struct ib_fast_reg_page_listibfrpl;
-   struct nes_fast_mr_wqe_pbl  nes_wqe_pbl;
-   u64 pbl;
-};
-
 struct nes_listener {
struct work_struct  work;
struct workqueue_struct *wq;
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c 
b/drivers/infiniband/hw/nes/nes_verbs.c
index ba069ec2ebf9..51a0a9cedcf4 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -486,76 +486,6 @@ static int nes_map_mr_sg(struct ib_mr *ibmr,
return ib_sg_to_pages(ibmr, sg, sg_nents, nes_set_page);
 }
 
-/*
- * nes_alloc_fast_reg_page_list
- */
-static struct ib_fast_reg_page_list *nes_alloc_fast_reg_page_list(
-   struct ib_device *ibdev,
-   int page_list_len)
-{
-   struct nes_vnic *nesvnic = to_nesvnic(ibdev);
-   struct nes_device *nesdev = nesvnic->nesdev;
-   struct ib_fast_reg_page_list *pifrpl;
-   struct nes_ib_fast_reg_page_list *pnesfrpl;
-
-   if (page_list_len > (NES_4K_PBL_CHUNK_SIZE / sizeof(u64)))
-   return ERR_PTR(-E2BIG);
-   /*
-* Allocate the ib_fast_reg_page_list structure, the
-* nes_fast_bpl structure, and the PLB table.
-*/
-   pnesfrpl = kmalloc(sizeof(struct nes_ib_fast_reg_page_list) +
-  page_list_len * sizeof(u64), GFP_KERNEL);
-
-   if (!pnesfrpl)
-   return ERR_PTR(-ENOMEM);
-
-   pifrpl = &pnesfrpl->ibfrpl;
-   pifrpl->page_list = &pnesfrpl->pbl;
-   pifrpl->max_page_list_len = page_list_len;
-   /*
-* Allocate the WQE PBL
-*/
-   pnesfrpl->nes_wqe_pbl.kva = pci_alloc_consistent(nesdev->pcidev,
-page_list_len * 
sizeof(u64),
-
&pnesfrpl->nes_wqe_pbl.paddr);
-
-   if (!pnesfrpl->nes_wqe_pbl.kva) {
-   kfree(pnesfrpl);
-   return ERR_PTR(-ENOMEM);
-   }
-   nes_debug(NES_DBG_MR, "nes_alloc_fast_reg_pbl: nes_frpl = %p, "
- "ibfrpl = %p, ibfrpl.page_list = %p, pbl.kva = %p, "
- "pbl.paddr = %llx\n", pnesfrpl, &pnesfrpl->ibfrpl,
- pnesfrpl->ibfrpl.page_list, pnesfrpl->nes_wqe_pbl.kva,
- (unsigned long long) pnesfrpl->nes_wqe_pbl.paddr);
-
-   return pifrpl;
-}
-
-/*
- * nes_free_fast_reg_page_list
- */
-static void nes_free_fast_reg_page_list(struct ib_fast_reg_page_list *pifrpl)
-{
-   struct nes_vnic *nesvnic = to_nesvnic(pifrpl->device);
-   struct nes_device *nesdev = nesvnic->nesdev;
-   struct nes_ib_fast_reg_page_list *pnesfrpl;
-
-   pnesfrpl = container_of(pifrpl, struct nes_ib_fast_reg_page_list, 
ibfrpl);
-   /*
-* Free the WQE PBL.
-*/
-   pci_free_consistent(nesdev->pcidev,
-   pifrpl->max_page_list_len * sizeof(u64),
-   pnesfrpl->nes_wqe_pbl.kva,
-   pnesfrpl->nes_wqe_pbl.paddr);
-   /*
-* Free the PBL structure
-*/
-   kfree(pnesfrpl);
-}
-
 /**
  * nes_query_device
  */
@@ -3470,94 +3400,6 @@ static int nes_post_send(struct ib_qp *ibqp, struct 
ib_send_wr *ib_wr,

NES_IWARP_SQ_LOCINV_WQE_INV_STAG_IDX,
ib_wr->ex.invalidate_rkey);
break;
-   case IB_WR_FAST_REG_MR:
-   {
-   int i;
-   struct ib_fast_reg_wr *fwr = fast_reg_wr(ib_wr);
-   int flags = fwr->access_flags;
-   struct nes_ib_fast_reg_page_list *pnesfrpl =
-   container_of(fwr->page_list,
-struct nes_ib_fast_reg_page_list,
-ibfrpl);
-   u64 *src_page_list = pnesfrpl->ibfrpl.page_list;
-   u64 *dst_page_list = pnesfrpl->nes_wqe_pbl.kva;
-
-   if (fwr->page_list_len >
-   (NES_4K_PBL_CHUNK_SIZE / sizeof(u64))) {
-   nes_debug(NES_DBG_IW_TX, "SQ_FMR: bad 
page_list_len\n");
-   e

[PATCH v1 07/24] iw_cxgb4: Support the new memory registration API

2015-09-17 Thread Sagi Grimberg
Support the new memory registration API by allocating a
private page list array in c4iw_mr and populate it when
c4iw_map_mr_sg is invoked. Also, support IB_WR_REG_MR
by duplicating build_fastreg just take the needed information
from different places:
- page_size, iova, length (ib_mr)
- page array (c4iw_mr)
- key, access flags (ib_reg_wr)

The IB_WR_FAST_REG_MR handlers will be removed later when
all the ULPs will be converted.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |  7 
 drivers/infiniband/hw/cxgb4/mem.c  | 38 +
 drivers/infiniband/hw/cxgb4/provider.c |  1 +
 drivers/infiniband/hw/cxgb4/qp.c   | 75 ++
 4 files changed, 121 insertions(+)

diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h 
b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index c7bb38c931a5..032f90aa8ac9 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -386,6 +386,10 @@ struct c4iw_mr {
struct c4iw_dev *rhp;
u64 kva;
struct tpt_attributes attr;
+   u64 *mpl;
+   dma_addr_t mpl_addr;
+   u32 max_mpl_len;
+   u32 mpl_len;
 };
 
 static inline struct c4iw_mr *to_c4iw_mr(struct ib_mr *ibmr)
@@ -973,6 +977,9 @@ struct ib_fast_reg_page_list *c4iw_alloc_fastreg_pbl(
 struct ib_mr *c4iw_alloc_mr(struct ib_pd *pd,
enum ib_mr_type mr_type,
u32 max_num_sg);
+int c4iw_map_mr_sg(struct ib_mr *ibmr,
+  struct scatterlist *sg,
+  unsigned int sg_nents);
 int c4iw_dealloc_mw(struct ib_mw *mw);
 struct ib_mw *c4iw_alloc_mw(struct ib_pd *pd, enum ib_mw_type type);
 struct ib_mr *c4iw_reg_user_mr(struct ib_pd *pd, u64 start,
diff --git a/drivers/infiniband/hw/cxgb4/mem.c 
b/drivers/infiniband/hw/cxgb4/mem.c
index 026b91ebd5e2..86ec65721797 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -863,6 +863,7 @@ struct ib_mr *c4iw_alloc_mr(struct ib_pd *pd,
u32 mmid;
u32 stag = 0;
int ret = 0;
+   int length = roundup(max_num_sg * sizeof(u64), 32);
 
if (mr_type != IB_MR_TYPE_MEM_REG ||
max_num_sg > t4_max_fr_depth(use_dsgl))
@@ -876,6 +877,14 @@ struct ib_mr *c4iw_alloc_mr(struct ib_pd *pd,
goto err;
}
 
+   mhp->mpl = dma_alloc_coherent(&rhp->rdev.lldi.pdev->dev,
+ length, &mhp->mpl_addr, GFP_KERNEL);
+   if (!mhp->mpl) {
+   ret = -ENOMEM;
+   goto err_mpl;
+   }
+   mhp->max_mpl_len = length;
+
mhp->rhp = rhp;
ret = alloc_pbl(mhp, max_num_sg);
if (ret)
@@ -905,11 +914,37 @@ err2:
c4iw_pblpool_free(&mhp->rhp->rdev, mhp->attr.pbl_addr,
  mhp->attr.pbl_size << 3);
 err1:
+   dma_free_coherent(&mhp->rhp->rdev.lldi.pdev->dev,
+ mhp->max_mpl_len, mhp->mpl, mhp->mpl_addr);
+err_mpl:
kfree(mhp);
 err:
return ERR_PTR(ret);
 }
 
+static int c4iw_set_page(struct ib_mr *ibmr, u64 addr)
+{
+   struct c4iw_mr *mhp = to_c4iw_mr(ibmr);
+
+   if (unlikely(mhp->mpl_len == mhp->max_mpl_len))
+   return -ENOMEM;
+
+   mhp->mpl[mhp->mpl_len++] = addr;
+
+   return 0;
+}
+
+int c4iw_map_mr_sg(struct ib_mr *ibmr,
+  struct scatterlist *sg,
+  unsigned int sg_nents)
+{
+   struct c4iw_mr *mhp = to_c4iw_mr(ibmr);
+
+   mhp->mpl_len = 0;
+
+   return ib_sg_to_pages(ibmr, sg, sg_nents, c4iw_set_page);
+}
+
 struct ib_fast_reg_page_list *c4iw_alloc_fastreg_pbl(struct ib_device *device,
 int page_list_len)
 {
@@ -970,6 +1005,9 @@ int c4iw_dereg_mr(struct ib_mr *ib_mr)
rhp = mhp->rhp;
mmid = mhp->attr.stag >> 8;
remove_handle(rhp, &rhp->mmidr, mmid);
+   if (mhp->mpl)
+   dma_free_coherent(&mhp->rhp->rdev.lldi.pdev->dev,
+ mhp->max_mpl_len, mhp->mpl, mhp->mpl_addr);
dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size,
   mhp->attr.pbl_addr);
if (mhp->attr.pbl_size)
diff --git a/drivers/infiniband/hw/cxgb4/provider.c 
b/drivers/infiniband/hw/cxgb4/provider.c
index 7746113552e7..55dedadcffaa 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -557,6 +557,7 @@ int c4iw_register_device(struct c4iw_dev *dev)
dev->ibdev.bind_mw = c4iw_bind_mw;
dev->ibdev.dealloc_mw = c4iw_dealloc_mw;
dev->ibdev.alloc_mr = c4iw_alloc_mr;
+   dev->ibdev.map_mr_sg = c4iw_map_mr_sg;
dev->ibdev.alloc_fast_reg_page_list = c4iw_alloc_fastreg_pbl;
dev->ibdev.free_fast_reg_page_list = c4iw_free_fastreg_pbl;
dev->ibdev.attach_mcast = c4iw_multicast_attach;
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index b6

[PATCH v1 19/24] RDMA/cxgb3: Remove old FRWR API

2015-09-17 Thread Sagi Grimberg
No ULP uses it anymore, go ahead and remove it.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/cxgb3/iwch_cq.c   |  2 +-
 drivers/infiniband/hw/cxgb3/iwch_provider.c | 24 ---
 drivers/infiniband/hw/cxgb3/iwch_qp.c   | 47 -
 3 files changed, 1 insertion(+), 72 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_cq.c 
b/drivers/infiniband/hw/cxgb3/iwch_cq.c
index cf5474ae68ff..cfe404925a39 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cq.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cq.c
@@ -123,7 +123,7 @@ static int iwch_poll_cq_one(struct iwch_dev *rhp, struct 
iwch_cq *chp,
wc->opcode = IB_WC_LOCAL_INV;
break;
case T3_FAST_REGISTER:
-   wc->opcode = IB_WC_FAST_REG_MR;
+   wc->opcode = IB_WC_REG_MR;
break;
default:
printk(KERN_ERR MOD "Unexpected opcode %d "
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index ee3d5ca7de6c..99ae2ab14b9e 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -884,28 +884,6 @@ static int iwch_map_mr_sg(struct ib_mr *ibmr,
return ib_sg_to_pages(ibmr, sg, sg_nents, iwch_set_page);
 }
 
-static struct ib_fast_reg_page_list *iwch_alloc_fastreg_pbl(
-   struct ib_device *device,
-   int page_list_len)
-{
-   struct ib_fast_reg_page_list *page_list;
-
-   page_list = kmalloc(sizeof *page_list + page_list_len * sizeof(u64),
-   GFP_KERNEL);
-   if (!page_list)
-   return ERR_PTR(-ENOMEM);
-
-   page_list->page_list = (u64 *)(page_list + 1);
-   page_list->max_page_list_len = page_list_len;
-
-   return page_list;
-}
-
-static void iwch_free_fastreg_pbl(struct ib_fast_reg_page_list *page_list)
-{
-   kfree(page_list);
-}
-
 static int iwch_destroy_qp(struct ib_qp *ib_qp)
 {
struct iwch_dev *rhp;
@@ -1483,8 +1461,6 @@ int iwch_register_device(struct iwch_dev *dev)
dev->ibdev.dealloc_mw = iwch_dealloc_mw;
dev->ibdev.alloc_mr = iwch_alloc_mr;
dev->ibdev.map_mr_sg = iwch_map_mr_sg;
-   dev->ibdev.alloc_fast_reg_page_list = iwch_alloc_fastreg_pbl;
-   dev->ibdev.free_fast_reg_page_list = iwch_free_fastreg_pbl;
dev->ibdev.attach_mcast = iwch_multicast_attach;
dev->ibdev.detach_mcast = iwch_multicast_detach;
dev->ibdev.process_mad = iwch_process_mad;
diff --git a/drivers/infiniband/hw/cxgb3/iwch_qp.c 
b/drivers/infiniband/hw/cxgb3/iwch_qp.c
index a09ea538e990..d0548fc6395e 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_qp.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_qp.c
@@ -189,48 +189,6 @@ static int build_memreg(union t3_wr *wqe, struct ib_reg_wr 
*wr,
return 0;
 }
 
-static int build_fastreg(union t3_wr *wqe, struct ib_send_wr *send_wr,
-   u8 *flit_cnt, int *wr_cnt, struct t3_wq *wq)
-{
-   struct ib_fast_reg_wr *wr = fast_reg_wr(send_wr);
-   int i;
-   __be64 *p;
-
-   if (wr->page_list_len > T3_MAX_FASTREG_DEPTH)
-   return -EINVAL;
-   *wr_cnt = 1;
-   wqe->fastreg.stag = cpu_to_be32(wr->rkey);
-   wqe->fastreg.len = cpu_to_be32(wr->length);
-   wqe->fastreg.va_base_hi = cpu_to_be32(wr->iova_start >> 32);
-   wqe->fastreg.va_base_lo_fbo = cpu_to_be32(wr->iova_start & 0x);
-   wqe->fastreg.page_type_perms = cpu_to_be32(
-   V_FR_PAGE_COUNT(wr->page_list_len) |
-   V_FR_PAGE_SIZE(wr->page_shift-12) |
-   V_FR_TYPE(TPT_VATO) |
-   V_FR_PERMS(iwch_ib_to_tpt_access(wr->access_flags)));
-   p = &wqe->fastreg.pbl_addrs[0];
-   for (i = 0; i < wr->page_list_len; i++, p++) {
-
-   /* If we need a 2nd WR, then set it up */
-   if (i == T3_MAX_FASTREG_FRAG) {
-   *wr_cnt = 2;
-   wqe = (union t3_wr *)(wq->queue +
-   Q_PTR2IDX((wq->wptr+1), wq->size_log2));
-   build_fw_riwrh((void *)wqe, T3_WR_FASTREG, 0,
-  Q_GENBIT(wq->wptr + 1, wq->size_log2),
-  0, 1 + wr->page_list_len - T3_MAX_FASTREG_FRAG,
-  T3_EOP);
-
-   p = &wqe->pbl_frag.pbl_addrs[0];
-   }
-   *p = cpu_to_be64((u64)wr->page_list->page_list[i]);
-   }
-   *flit_cnt = 5 + wr->page_list_len;
-   if (*flit_cnt > 15)
-   *flit_cnt = 15;
-   return 0;
-}
-
 static int build_inv_stag(union t3_wr *wqe, struct ib_send_wr *wr,
u8 *flit_cnt)
 {
@@ -457,11 +415,6 @@ int iwch_post_send(struct ib_qp *ibqp, struct ib_send_wr 
*wr,
if (!qhp-

[PATCH v1 08/24] IB/qib: Support the new memory registration API

2015-09-17 Thread Sagi Grimberg
Support the new memory registration API by allocating a
private page list array in qib_mr and populate it when
qib_map_mr_sg is invoked. Also, support IB_WR_REG_MR
by duplicating qib_fastreg_mr just take the needed information
from different places:
- page_size, iova, length (ib_mr)
- page array (qib_mr)
- key, access flags (ib_reg_wr)

The IB_WR_FAST_REG_MR handlers will be removed later when
all the ULPs will be converted.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/qib/qib_keys.c  | 56 +++
 drivers/infiniband/hw/qib/qib_mr.c| 32 
 drivers/infiniband/hw/qib/qib_verbs.c |  9 +-
 drivers/infiniband/hw/qib/qib_verbs.h |  8 +
 4 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/qib/qib_keys.c 
b/drivers/infiniband/hw/qib/qib_keys.c
index eaf139a33b2e..a5057efc7faf 100644
--- a/drivers/infiniband/hw/qib/qib_keys.c
+++ b/drivers/infiniband/hw/qib/qib_keys.c
@@ -390,3 +390,59 @@ bail:
spin_unlock_irqrestore(&rkt->lock, flags);
return ret;
 }
+
+/*
+ * Initialize the memory region specified by the work reqeust.
+ */
+int qib_reg_mr(struct qib_qp *qp, struct ib_reg_wr *wr)
+{
+   struct qib_lkey_table *rkt = &to_idev(qp->ibqp.device)->lk_table;
+   struct qib_pd *pd = to_ipd(qp->ibqp.pd);
+   struct qib_mr *mr = to_imr(wr->mr);
+   struct qib_mregion *mrg;
+   u32 key = wr->key;
+   unsigned i, n, m;
+   int ret = -EINVAL;
+   unsigned long flags;
+   u64 *page_list;
+   size_t ps;
+
+   spin_lock_irqsave(&rkt->lock, flags);
+   if (pd->user || key == 0)
+   goto bail;
+
+   mrg = rcu_dereference_protected(
+   rkt->table[(key >> (32 - ib_qib_lkey_table_size))],
+   lockdep_is_held(&rkt->lock));
+   if (unlikely(mrg == NULL || qp->ibqp.pd != mrg->pd))
+   goto bail;
+
+   if (mr->npages > mrg->max_segs)
+   goto bail;
+
+   ps = mr->ibmr.page_size;
+   if (mr->ibmr.length > ps * mr->npages)
+   goto bail;
+
+   mrg->user_base = mr->ibmr.iova;
+   mrg->iova = mr->ibmr.iova;
+   mrg->lkey = key;
+   mrg->length = mr->ibmr.length;
+   mrg->access_flags = wr->access;
+   page_list = mr->pages;
+   m = 0;
+   n = 0;
+   for (i = 0; i < mr->npages; i++) {
+   mrg->map[m]->segs[n].vaddr = (void *) page_list[i];
+   mrg->map[m]->segs[n].length = ps;
+   if (++n == QIB_SEGSZ) {
+   m++;
+   n = 0;
+   }
+   }
+
+   ret = 0;
+bail:
+   spin_unlock_irqrestore(&rkt->lock, flags);
+   return ret;
+}
diff --git a/drivers/infiniband/hw/qib/qib_mr.c 
b/drivers/infiniband/hw/qib/qib_mr.c
index 19220dcb9a3b..0fa4b0de8074 100644
--- a/drivers/infiniband/hw/qib/qib_mr.c
+++ b/drivers/infiniband/hw/qib/qib_mr.c
@@ -303,6 +303,7 @@ int qib_dereg_mr(struct ib_mr *ibmr)
int ret = 0;
unsigned long timeout;
 
+   kfree(mr->pages);
qib_free_lkey(&mr->mr);
 
qib_put_mr(&mr->mr); /* will set completion if last */
@@ -340,7 +341,38 @@ struct ib_mr *qib_alloc_mr(struct ib_pd *pd,
if (IS_ERR(mr))
return (struct ib_mr *)mr;
 
+   mr->pages = kcalloc(max_num_sg, sizeof(u64), GFP_KERNEL);
+   if (!mr->pages)
+   goto err;
+
return &mr->ibmr;
+
+err:
+   qib_dereg_mr(&mr->ibmr);
+   return ERR_PTR(-ENOMEM);
+}
+
+static int qib_set_page(struct ib_mr *ibmr, u64 addr)
+{
+   struct qib_mr *mr = to_imr(ibmr);
+
+   if (unlikely(mr->npages == mr->mr.max_segs))
+   return -ENOMEM;
+
+   mr->pages[mr->npages++] = addr;
+
+   return 0;
+}
+
+int qib_map_mr_sg(struct ib_mr *ibmr,
+ struct scatterlist *sg,
+ unsigned int sg_nents)
+{
+   struct qib_mr *mr = to_imr(ibmr);
+
+   mr->npages = 0;
+
+   return ib_sg_to_pages(ibmr, sg, sg_nents, qib_set_page);
 }
 
 struct ib_fast_reg_page_list *
diff --git a/drivers/infiniband/hw/qib/qib_verbs.c 
b/drivers/infiniband/hw/qib/qib_verbs.c
index a6b0b098ff30..a1e53d7b662b 100644
--- a/drivers/infiniband/hw/qib/qib_verbs.c
+++ b/drivers/infiniband/hw/qib/qib_verbs.c
@@ -362,7 +362,10 @@ static int qib_post_one_send(struct qib_qp *qp, struct 
ib_send_wr *wr,
 * undefined operations.
 * Make sure buffer is large enough to hold the result for atomics.
 */
-   if (wr->opcode == IB_WR_FAST_REG_MR) {
+   if (wr->opcode == IB_WR_REG_MR) {
+   if (qib_reg_mr(qp, reg_wr(wr)))
+   goto bail_inval;
+   } else if (wr->opcode == IB_WR_FAST_REG_MR) {
if (qib_fast_reg_mr(qp, wr))
goto bail_inval;
} else if (qp->ibqp.qp_type == IB_QPT_UC) {
@@ -401,6 +404,9 @@ static int qib_post_one_send(struct qib_qp *qp, struct 
ib_send_wr *wr,
if (qp->ibqp.qp

[PATCH v1 03/24] IB/mlx5: Support the new memory registration API

2015-09-17 Thread Sagi Grimberg
Support the new memory registration API by allocating a
private page list array in mlx5_ib_mr and populate it when
mlx5_ib_map_mr_sg is invoked. Also, support IB_WR_REG_MR
by setting the exact WQE as IB_WR_FAST_REG_MR, just take the
needed information from different places:
- page_size, iova, length, access flags (ib_mr)
- page array (mlx5_ib_mr)
- key (ib_reg_wr)

The IB_WR_FAST_REG_MR handlers will be removed later when
all the ULPs will be converted.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/mlx5/cq.c  |  3 ++
 drivers/infiniband/hw/mlx5/main.c|  1 +
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  8 
 drivers/infiniband/hw/mlx5/mr.c  | 65 
 drivers/infiniband/hw/mlx5/qp.c  | 83 
 5 files changed, 160 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 5c9eeea62805..90daf791d51d 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -108,6 +108,9 @@ static enum ib_wc_opcode get_umr_comp(struct mlx5_ib_wq 
*wq, int idx)
case IB_WR_LOCAL_INV:
return IB_WC_LOCAL_INV;
 
+   case IB_WR_REG_MR:
+   return IB_WC_REG_MR;
+
case IB_WR_FAST_REG_MR:
return IB_WC_FAST_REG_MR;
 
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 276d7824be8a..7ebce545daf1 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1432,6 +1432,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
dev->ib_dev.detach_mcast= mlx5_ib_mcg_detach;
dev->ib_dev.process_mad = mlx5_ib_process_mad;
dev->ib_dev.alloc_mr= mlx5_ib_alloc_mr;
+   dev->ib_dev.map_mr_sg   = mlx5_ib_map_mr_sg;
dev->ib_dev.alloc_fast_reg_page_list = mlx5_ib_alloc_fast_reg_page_list;
dev->ib_dev.free_fast_reg_page_list  = mlx5_ib_free_fast_reg_page_list;
dev->ib_dev.check_mr_status = mlx5_ib_check_mr_status;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 210f99877b0b..bc1853f8e67d 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -319,6 +319,11 @@ enum mlx5_ib_mtt_access_flags {
 
 struct mlx5_ib_mr {
struct ib_mribmr;
+   void*descs;
+   dma_addr_t  desc_map;
+   int ndescs;
+   int max_descs;
+   int desc_size;
struct mlx5_core_mr mmr;
struct ib_umem *umem;
struct mlx5_shared_mr_info  *smr_info;
@@ -560,6 +565,9 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr);
 struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
   enum ib_mr_type mr_type,
   u32 max_num_sg);
+int mlx5_ib_map_mr_sg(struct ib_mr *ibmr,
+ struct scatterlist *sg,
+ unsigned int sg_nents);
 struct ib_fast_reg_page_list *mlx5_ib_alloc_fast_reg_page_list(struct 
ib_device *ibdev,
   int 
page_list_len);
 void mlx5_ib_free_fast_reg_page_list(struct ib_fast_reg_page_list *page_list);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 6d8aac0c1748..2f3b648719da 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1165,6 +1165,35 @@ error:
return err;
 }
 
+static int
+mlx5_alloc_priv_descs(struct ib_device *device,
+ struct mlx5_ib_mr *mr,
+ int ndescs,
+ int desc_size)
+{
+   int size = ndescs * desc_size;
+
+   mr->descs = dma_alloc_coherent(device->dma_device, size,
+  &mr->desc_map, GFP_KERNEL);
+   if (!mr->descs)
+   return -ENOMEM;
+
+   return 0;
+}
+
+static void
+mlx5_free_priv_descs(struct mlx5_ib_mr *mr)
+{
+   struct ib_device *device = mr->ibmr.device;
+   int size = mr->max_descs * mr->desc_size;
+
+   if (mr->descs) {
+   dma_free_coherent(device->dma_device, size,
+ mr->descs, mr->desc_map);
+   mr->descs = NULL;
+   }
+}
+
 static int clean_mr(struct mlx5_ib_mr *mr)
 {
struct mlx5_ib_dev *dev = to_mdev(mr->ibmr.device);
@@ -1184,6 +1213,8 @@ static int clean_mr(struct mlx5_ib_mr *mr)
mr->sig = NULL;
}
 
+   mlx5_free_priv_descs(mr);
+
if (!umred) {
err = destroy_mkey(dev, mr);
if (err) {
@@ -1273,6 +1304,14 @@ struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
if (mr_type == IB_MR_TYPE_MEM_REG) {
access_mode = MLX5_ACCESS_MODE_MTT;
in->seg.log2_page_size = PAGE_SHIFT;
+
+   err = mlx5_alloc_priv_descs

[PATCH v1 14/24] RDS/IW: Convert to new memory registration API

2015-09-17 Thread Sagi Grimberg
Get rid of fast_reg page list and its construction.
Instead, just pass the RDS sg list to ib_map_mr_sg
and post the new ib_reg_wr.

This is done both for server IW RDMA_READ registration
and the client remote key registration.

Signed-off-by: Sagi Grimberg 
---
 net/rds/iw.h  |   5 +--
 net/rds/iw_rdma.c | 118 ++
 net/rds/iw_send.c |  57 +-
 3 files changed, 70 insertions(+), 110 deletions(-)

diff --git a/net/rds/iw.h b/net/rds/iw.h
index fe858e5dd8d1..5af01d1758b3 100644
--- a/net/rds/iw.h
+++ b/net/rds/iw.h
@@ -74,13 +74,12 @@ struct rds_iw_send_work {
struct rm_rdma_op   *s_op;
struct rds_iw_mapping   *s_mapping;
struct ib_mr*s_mr;
-   struct ib_fast_reg_page_list *s_page_list;
unsigned char   s_remap_count;
 
union {
struct ib_send_wr   s_send_wr;
struct ib_rdma_wr   s_rdma_wr;
-   struct ib_fast_reg_wr   s_fast_reg_wr;
+   struct ib_reg_wrs_reg_wr;
};
struct ib_sge   s_sge[RDS_IW_MAX_SGE];
unsigned long   s_queued;
@@ -199,7 +198,7 @@ struct rds_iw_device {
 
 /* Magic WR_ID for ACKs */
 #define RDS_IW_ACK_WR_ID   ((u64)0xULL)
-#define RDS_IW_FAST_REG_WR_ID  ((u64)0xefefefefefefefefULL)
+#define RDS_IW_REG_WR_ID   ((u64)0xefefefefefefefefULL)
 #define RDS_IW_LOCAL_INV_WR_ID ((u64)0xdfdfdfdfdfdfdfdfULL)
 
 struct rds_iw_statistics {
diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c
index f8a612cc69e6..8c4e42759514 100644
--- a/net/rds/iw_rdma.c
+++ b/net/rds/iw_rdma.c
@@ -47,7 +47,6 @@ struct rds_iw_mr {
struct rdma_cm_id   *cm_id;
 
struct ib_mr*mr;
-   struct ib_fast_reg_page_list *page_list;
 
struct rds_iw_mapping   mapping;
unsigned char   remap_count;
@@ -77,8 +76,8 @@ struct rds_iw_mr_pool {
 
 static int rds_iw_flush_mr_pool(struct rds_iw_mr_pool *pool, int free_all);
 static void rds_iw_mr_pool_flush_worker(struct work_struct *work);
-static int rds_iw_init_fastreg(struct rds_iw_mr_pool *pool, struct rds_iw_mr 
*ibmr);
-static int rds_iw_map_fastreg(struct rds_iw_mr_pool *pool,
+static int rds_iw_init_reg(struct rds_iw_mr_pool *pool, struct rds_iw_mr 
*ibmr);
+static int rds_iw_map_reg(struct rds_iw_mr_pool *pool,
  struct rds_iw_mr *ibmr,
  struct scatterlist *sg, unsigned int nents);
 static void rds_iw_free_fastreg(struct rds_iw_mr_pool *pool, struct rds_iw_mr 
*ibmr);
@@ -258,19 +257,18 @@ static void rds_iw_set_scatterlist(struct 
rds_iw_scatterlist *sg,
sg->bytes = 0;
 }
 
-static u64 *rds_iw_map_scatterlist(struct rds_iw_device *rds_iwdev,
+static int rds_iw_map_scatterlist(struct rds_iw_device *rds_iwdev,
struct rds_iw_scatterlist *sg)
 {
struct ib_device *dev = rds_iwdev->dev;
-   u64 *dma_pages = NULL;
-   int i, j, ret;
+   int i, ret;
 
WARN_ON(sg->dma_len);
 
sg->dma_len = ib_dma_map_sg(dev, sg->list, sg->len, DMA_BIDIRECTIONAL);
if (unlikely(!sg->dma_len)) {
printk(KERN_WARNING "RDS/IW: dma_map_sg failed!\n");
-   return ERR_PTR(-EBUSY);
+   return -EBUSY;
}
 
sg->bytes = 0;
@@ -303,31 +301,14 @@ static u64 *rds_iw_map_scatterlist(struct rds_iw_device 
*rds_iwdev,
if (sg->dma_npages > fastreg_message_size)
goto out_unmap;
 
-   dma_pages = kmalloc(sizeof(u64) * sg->dma_npages, GFP_ATOMIC);
-   if (!dma_pages) {
-   ret = -ENOMEM;
-   goto out_unmap;
-   }
-
-   for (i = j = 0; i < sg->dma_len; ++i) {
-   unsigned int dma_len = ib_sg_dma_len(dev, &sg->list[i]);
-   u64 dma_addr = ib_sg_dma_address(dev, &sg->list[i]);
-   u64 end_addr;
 
-   end_addr = dma_addr + dma_len;
-   dma_addr &= ~PAGE_MASK;
-   for (; dma_addr < end_addr; dma_addr += PAGE_SIZE)
-   dma_pages[j++] = dma_addr;
-   BUG_ON(j > sg->dma_npages);
-   }
 
-   return dma_pages;
+   return 0;
 
 out_unmap:
ib_dma_unmap_sg(rds_iwdev->dev, sg->list, sg->len, DMA_BIDIRECTIONAL);
sg->dma_len = 0;
-   kfree(dma_pages);
-   return ERR_PTR(ret);
+   return ret;
 }
 
 
@@ -440,7 +421,7 @@ static struct rds_iw_mr *rds_iw_alloc_mr(struct 
rds_iw_device *rds_iwdev)
INIT_LIST_HEAD(&ibmr->mapping.m_list);
ibmr->mapping.m_mr = ibmr;
 
-   err = rds_iw_init_fastreg(pool, ibmr);
+   err = rds_iw_init_reg(pool, ibmr);
if (err)
goto out_no_cigar;
 
@@ -622,7 +603,7 @@ void *rds_iw_get_mr(struct scatterlist *sg, unsigned long 
nents,
ibmr->cm_id = cm_id;
ibmr->device = rds_iwdev;
 
-   ret = rds_iw_map_fastreg(rds_iwdev->mr_pool, ibmr, sg, nents);
+   

[PATCH v1 00/24] New fast registration API

2015-09-17 Thread Sagi Grimberg
Hi all,

As discussed on the linux-rdma list, there is plenty of room for
improvement in our memory registration APIs. We keep finding
ULPs that are duplicating code, sometimes use wrong strategies
and mis-use our current API.

As a first step, this patch set replaces the fast registration API
to accept a kernel common struct scatterlist and takes care of
the page vector construction in the core layer with hooks for the
drivers HW specific assignments. This allows to remove a common
code duplication as it was done in each and every ULP driver.

The changes from v0 (WIP) are:
- Rebased on top of 4.3-rc1 + Christoph's ib_send_wr conversion patches

- Allow the ULP to pass page_size argument to ib_map_mr_sg in order
  to have it work better in some specific workloads. This suggestion
  came from Bart Van Assache which pointed out that some applications
  might use page sizes significantly smaller than the system PAGE_SIZE
  of specific architectures

- Fixed some logical bugs in ib_sg_to_pages

- Added a set_page function pointer for drivers to pass to ib_sg_to_pages
  so some drivers (e.g mlx4, mlx5, nes) can avoid keeping a second page
  vector and/or re-iterate on the page vector in order to perform HW specific
  assignments (big/little endian conversion, extra flags)

- Converted SRP initiator and RDS iwarp ULPs to the new API

- Removed fast registration code from hfi1 driver (as it isn't supported
  anyway). I assume that the correct place to get the support back would
  be in a shared SW library (hfi1, qib, rxe).

- Updated the change logs

So far my tests covered:
- ULPs:
* iser initiator
* iser target
* xprtrdma
* svcrdma
- Drivers:
* mlx4
* mlx5
* Steve Wise was kind enough to run NFS client/server over cxgb4 and I
  have yet to receive any negative feedback from him.

I don't have access to other HW devices (qib, nes) nor iwarp devices so RDS is
compile tested only.

I'm targeting this to 4.4 so I'll appreciate more feedback and a bigger testing
coverage.

The code is available at: https://github.com/sagigrimberg/linux/tree/reg_api.3

Sagi Grimberg (24):
  IB/core: Introduce new fast registration API
  IB/mlx5: Remove dead fmr code
  IB/mlx5: Support the new memory registration API
  IB/mlx4: Support the new memory registration API
  RDMA/ocrdma: Support the new memory registration API
  RDMA/cxgb3: Support the new memory registration API
  iw_cxgb4: Support the new memory registration API
  IB/qib: Support the new memory registration API
  RDMA/nes: Support the new memory registration API
  IB/iser: Port to new fast registration API
  iser-target: Port to new memory registration API
  xprtrdma: Port to new memory registration API
  svcrdma: Port to new memory registration API
  RDS/IW: Convert to new memory registration API
  IB/srp: Convert to new memory registration API
  IB/mlx5: Remove old FRWR API support
  IB/mlx4: Remove old FRWR API support
  RDMA/ocrdma: Remove old FRWR API
  RDMA/cxgb3: Remove old FRWR API
  iw_cxgb4: Remove old FRWR API
  IB/qib: Remove old FRWR API
  RDMA/nes: Remove old FRWR API
  IB/hfi1: Remove Old fast registraion API support
  IB/core: Remove old fast registration API

 drivers/infiniband/core/verbs.c | 132 ---
 drivers/infiniband/hw/cxgb3/iwch_cq.c   |   2 +-
 drivers/infiniband/hw/cxgb3/iwch_provider.c |  39 +++--
 drivers/infiniband/hw/cxgb3/iwch_provider.h |   2 +
 drivers/infiniband/hw/cxgb3/iwch_qp.c   |  37 +++--
 drivers/infiniband/hw/cxgb4/cq.c|   2 +-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h  |  25 +--
 drivers/infiniband/hw/cxgb4/mem.c   |  61 +++
 drivers/infiniband/hw/cxgb4/provider.c  |   3 +-
 drivers/infiniband/hw/cxgb4/qp.c|  46 +++---
 drivers/infiniband/hw/mlx4/cq.c |   2 +-
 drivers/infiniband/hw/mlx4/main.c   |   3 +-
 drivers/infiniband/hw/mlx4/mlx4_ib.h|  22 +--
 drivers/infiniband/hw/mlx4/mr.c | 120 --
 drivers/infiniband/hw/mlx4/qp.c |  30 ++--
 drivers/infiniband/hw/mlx5/cq.c |   4 +-
 drivers/infiniband/hw/mlx5/main.c   |   3 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h|  47 +-
 drivers/infiniband/hw/mlx5/mr.c | 107 +++-
 drivers/infiniband/hw/mlx5/qp.c | 140 
 drivers/infiniband/hw/nes/nes_hw.h  |   6 -
 drivers/infiniband/hw/nes/nes_verbs.c   | 161 +++---
 drivers/infiniband/hw/nes/nes_verbs.h   |   4 +
 drivers/infiniband/hw/ocrdma/ocrdma.h   |   2 +
 drivers/infiniband/hw/ocrdma/ocrdma_main.c  |   3 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 151 -
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.h |   7 +-
 drivers/infiniband/hw/qib/qib_keys.c|  40 ++---
 drivers/infiniband/hw/qib/qib_mr.c  |  46 +++---
 drivers/infiniband/hw/qib/qib_verbs.c   |  13 +-
 drivers/infiniband/hw/qib/qib_verbs.h