Re: [PATCH v3 5/6] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies

2015-12-12 Thread Tom Talpey

Three comments.

On 12/7/2015 12:43 PM, Chuck Lever wrote:

To support the NFSv4.1 backchannel on RDMA connections, add a
capability for receiving an RPC/RDMA reply on a connection
established by a client.
(snip)



+/* By convention, backchannel calls arrive via rdma_msg type


"By convention" is ok, but it's important to note that this is
actually not "by protocol". Therefore, the following check may
reject valid messages. Even though it is unlikely an implementation
will insert chunks, it's not illegal, and ignoring them will
be less harmful. So I'm going to remake my earlier observation
that three checks below should be removed:


+ * messages, and never populate the chunk lists. This makes
+ * the RPC/RDMA header small and fixed in size, so it is
+ * straightforward to check the RPC header's direction field.
+ */
+static bool
+svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
+{
+   __be32 *p = (__be32 *)rmsgp;
+
+   if (!xprt->xpt_bc_xprt)
+   return false;
+
+   if (rmsgp->rm_type != rdma_msg)
+   return false;


These three:


+   if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
+   return false;
+


(snip)

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index a1fd74a..3895574 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -309,6 +309,8 @@ struct rpcrdma_buffer {
u32 rb_bc_srv_max_requests;
spinlock_t  rb_reqslock;/* protect rb_allreqs */
struct list_headrb_allreqs;
+
+   u32 rb_bc_max_requests;


Why does this need to be u32? Shouldn't it be an int, and also the
rb_bc_srv_max_requests just above? The forward channel max_requests
are int, btw.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/6] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-12-12 Thread Tom Talpey

Two small comments.

On 12/7/2015 12:42 PM, Chuck Lever wrote:

Minor optimization: when dealing with write chunk XDR roundup, do
not post a Write WR for the zero bytes in the pad. Simply update
the write segment in the RPC-over-RDMA header to reflect the extra
pad bytes.

The Reply chunk is also a write chunk, but the server does not use
send_write_chunks() to send the Reply chunk. That's OK in this case:
the server Upper Layer typically marshals the Reply chunk contents
in a single contiguous buffer, without a separate tail for the XDR
pad.

The comments and the variable naming refer to "chunks" but what is
really meant is "segments." The existing code sends only one
xdr_write_chunk per RPC reply.

The fix assumes this as well. When the XDR pad in the first write
chunk is reached, the assumption is the Write list is complete and
send_write_chunks() returns.

That will remain a valid assumption until the server Upper Layer can
support multiple bulk payload results per RPC.

Signed-off-by: Chuck Lever 
---
  net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++
  1 file changed, 7 insertions(+)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 969a1ab..bad5eaa 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
arg_ch->rs_handle,
arg_ch->rs_offset,
write_len);
+
+   /* Do not send XDR pad bytes */


It might be clearer to say "marshal" instead of "send".


+   if (chunk_no && write_len < 4) {


Why is it necessary to check for chunk_no == 0? It is not
possible for leading data to ever be padding, nor is a leading
data element ever less than 4 bytes long. Right?

Tom.


+   chunk_no++;
+   break;
+   }
+
chunk_off = 0;
while (write_len) {
ret = send_write(xprt, rqstp,

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/7] xprtrdma: Add class for RDMA backwards direction transport

2015-12-01 Thread Tom Talpey

On 11/30/2015 5:25 PM, Chuck Lever wrote:

To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class that enables backward
direction messages on an existing forward channel connection.




+static void *
+xprt_rdma_bc_allocate(struct rpc_task *task, size_t size)
+{
+   struct rpc_rqst *rqst = task->tk_rqstp;
+   struct svc_rdma_op_ctxt *ctxt;
+   struct svcxprt_rdma *rdma;
+   struct svc_xprt *sxprt;
+   struct page *page;
+
+   if (size > PAGE_SIZE) {
+   WARN_ONCE(1, "failed to handle buffer allocation (size %zu)\n",
+ size);


You may want to add more context to that rather cryptic string, at
least the function name.

Also, it's not exactly "failed to handle", it's an invalid size. Why
would this ever happen? Why even log it?




+static int
+rpcrdma_bc_send_request(struct svcxprt_rdma *rdma, struct rpc_rqst *rqst)
+{

...

+
+drop_connection:
+   dprintk("Failed to send backchannel call\n");


Ditto on the prefix / function context.

And also...


+   dprintk("%s: sending request with xid: %08x\n",
+   __func__, be32_to_cpu(rqst->rq_xid));

...

+   dprintk("RPC:   %s: xprt %p\n", __func__, xprt);


The format strings for many of the dprintk's are somewhat inconsistent.
Some start with "RPC", some with the function name, and some (in other
patches of this series) with "svcrdma". Confusing, and perhaps hard to
pick out of the log.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR

2015-11-24 Thread Tom Talpey

On 11/24/2015 1:52 AM, Christoph Hellwig wrote:

On Mon, Nov 23, 2015 at 07:57:42PM -0500, Tom Talpey wrote:

On 11/23/2015 5:14 PM, Chuck Lever wrote:

FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.


I thought FMR support was about to be removed in the core.


Seems like everyone would love to kill it, but no one dares to do
it just yet.  Reasons to keep FMRs:

  - mthca doesn't support FRs but haven't been staged out
  - RDS only supports FRMs for the IB transports (it does support FRs for
an entirely separate iWarp transports.  I'd love to know why we can't
just use that iWarp transport for IB as well).
  - mlx4 hardware might be slower with FRs than FRMs (Or mentioned this
in reply to the iSER remote invalidation series).


Ok, let me add then.

Reasons to kill FMRs:

  - FMRs are UNSAFE. They protect on page boundaries, not byte,
therefore a length error on an operation can corrupt data
outside the i/o buffer remotely. To say nothing of maliciousness.

  - FMRs are SLOWER. The supposed performance improvement of FMR
comes from the "mempool" that is often placed in front of them.
These pools cache remotely registered regions, in the hope of
reusing them without having to invalidate in between. See the
first bullet for the ramifications of that.

  - FMRs are SYNCHRONOUS. The FMR verbs make direct calls into the
verbs which block and/or do not return control to the caller
promptly as do work queue operations. The processor spends
more time servicing the device, typically at raised irq.

  - FMRs are PROPRIETARY. Only one vendor supports them, only one
of their NICs has no decent alternative, and that NIC is
functionally obsolete.

If storage stacks continue to support them, I personally feel it
is critical to carefully document the risks to customer data as
part of shipping the code.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/9] IB: add a proper completion queue abstraction

2015-11-24 Thread Tom Talpey

On 11/24/2015 2:03 AM, Jason Gunthorpe wrote:

On Mon, Nov 23, 2015 at 06:35:28PM -0800, Caitlin Bestler wrote:

Are there actual HCAs that make this mistake?


All IB HCAs have this behavior and require apps to see a send CQ
completion before making any statements about the state of the send Q
or buffers handed over to the HCA. Tom and I have seen this in real
systems under proper stress conditions. [Which is why I am so certain
about this, because when I first hit it years ago I dug into the spec
and figured out it was not a HW bug I was looking at]


To be clear, I saw the reply-completion-before-request-completion on
Windows, not Linux, but the principle is identical. It's simply a
fact of life on a multiprocessor, unless you want to throw in locks
and synchronization rules that drivers have to follow to enforce
ordered completions across queues. Which trust me, you don't.

In Windows SMB Direct, we added reference counts around pretty much
every verb interaction associated with each upper layer operation,
and did not retire them until all refcounts went to zero. It is
excruciatingly correct yet performs incredibly well.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-24 Thread Tom Talpey

On 11/24/2015 5:59 AM, Sagi Grimberg wrote:

As I see it, if we don't wait for local-invalidate to complete before
unmap and IO completion (and no one does)


For the record, that is false. Windows quite strictly performs these
steps, and deliver millions of real 4K IOPS over SMB Direct.


Waiting for local invalidate to complete would be a really big
sacrifice for our storage ULPs.


Not waiting would also be a sacrifice, by compromising data integrity.
It's a tough tradeoff, but if you choose to go that way it will be
critical to be honest about the consequences to the data.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/8] svcrdma: Define maximum number of backchannel requests

2015-11-23 Thread Tom Talpey

On 11/23/2015 5:20 PM, Chuck Lever wrote:

Extra resources for handling backchannel requests have to be
pre-allocated when a transport instance is created. Set a limit.

Signed-off-by: Chuck Lever 
---
  include/linux/sunrpc/svc_rdma.h  |5 +
  net/sunrpc/xprtrdma/svc_rdma_transport.c |6 +-
  2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f869807..478aa30 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -178,6 +178,11 @@ struct svcxprt_rdma {
  #define RPCRDMA_SQ_DEPTH_MULT   8
  #define RPCRDMA_MAX_REQUESTS32
  #define RPCRDMA_MAX_REQ_SIZE4096
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)


Why is this a config option? Why wouldn't you always want
this? It's needed for any post-1990 NFS dialect.


+#define RPCRDMA_MAX_BC_REQUESTS8


Why a constant 8? The forward channel value is apparently
configurable, just a few lines down.


+#else
+#define RPCRDMA_MAX_BC_REQUESTS0
+#endif

  #define RPCSVC_MAXPAYLOAD_RDMARPCSVC_MAXPAYLOAD

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index b348b4a..01c7b36 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
  (size_t)RPCSVC_MAXPAGES);
newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
   RPCSVC_MAXPAGES);
+   /* XXX: what if HCA can't support enough WRs for bc operation? */
newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
-  (size_t)svcrdma_max_requests);
+  (size_t)(svcrdma_max_requests +
+  RPCRDMA_MAX_BC_REQUESTS));
newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;

/*
@@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
qp_attr.event_handler = qp_event_handler;
qp_attr.qp_context = >sc_xprt;
qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
+   qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS;
qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
+   qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS;
qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/8] svcrdma: Define maximum number of backchannel requests

2015-11-23 Thread Tom Talpey

On 11/23/2015 8:09 PM, Chuck Lever wrote:



On Nov 23, 2015, at 7:39 PM, Tom Talpey <t...@talpey.com> wrote:

On 11/23/2015 5:20 PM, Chuck Lever wrote:

Extra resources for handling backchannel requests have to be
pre-allocated when a transport instance is created. Set a limit.

Signed-off-by: Chuck Lever <chuck.le...@oracle.com>
---
  include/linux/sunrpc/svc_rdma.h  |5 +
  net/sunrpc/xprtrdma/svc_rdma_transport.c |6 +-
  2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f869807..478aa30 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -178,6 +178,11 @@ struct svcxprt_rdma {
  #define RPCRDMA_SQ_DEPTH_MULT   8
  #define RPCRDMA_MAX_REQUESTS32
  #define RPCRDMA_MAX_REQ_SIZE4096
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)


Why is this a config option? Why wouldn't you always want
this? It's needed for any post-1990 NFS dialect.


I think some distros want to be able to compile out NFSv4.x
on small systems, and take all the backchannel cruft with it.


So shouldn't it follow the NFSv4.x config options then?





+#define RPCRDMA_MAX_BC_REQUESTS8


Why a constant 8? The forward channel value is apparently
configurable, just a few lines down.


The client side backward direction credit limit, now
in 4.4, is already a constant.

The client side ULP uses a constant for the slot table
size: NFS4_MAX_BACK_CHANNEL_OPS. I’m not 100% sure but
the server seems to just echo that number back to the
client.

I’d rather not add an admin knob for this. Why would it
be necessary?


Because no constant is ever correct. Why isn't it "1"? Do
you allow multiple credits? Why not that value?

For instance.





+#else
+#define RPCRDMA_MAX_BC_REQUESTS0
+#endif

  #define RPCSVC_MAXPAYLOAD_RDMARPCSVC_MAXPAYLOAD

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index b348b4a..01c7b36 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
  (size_t)RPCSVC_MAXPAGES);
newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
   RPCSVC_MAXPAGES);
+   /* XXX: what if HCA can't support enough WRs for bc operation? */
newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
-  (size_t)svcrdma_max_requests);
+  (size_t)(svcrdma_max_requests +
+  RPCRDMA_MAX_BC_REQUESTS));
newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;

/*
@@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
qp_attr.event_handler = qp_event_handler;
qp_attr.qp_context = >sc_xprt;
qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
+   qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS;
qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
+   qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS;
qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 6/8] xprtrdma: Add class for RDMA backwards direction transport

2015-11-23 Thread Tom Talpey

On 11/23/2015 5:21 PM, Chuck Lever wrote:

To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class for backwards direction
operation.


So, what's special here is that it re-uses an existing forward
channel's connection? If not, it would seem unnecessary to
define a new type/transport semantic. Say this?



Signed-off-by: Chuck Lever 
---
  include/linux/sunrpc/xprt.h  |1
  net/sunrpc/xprt.c|1
  net/sunrpc/xprtrdma/svc_rdma_transport.c |   14 +-
  net/sunrpc/xprtrdma/transport.c  |  243 ++
  net/sunrpc/xprtrdma/xprt_rdma.h  |2
  5 files changed, 256 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index 69ef5b3..7637ccd 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -85,6 +85,7 @@ struct rpc_rqst {
__u32 * rq_buffer;  /* XDR encode buffer */
size_t  rq_callsize,
rq_rcvsize;
+   void *  rq_privdata; /* xprt-specific per-rqst data */
size_t  rq_xmit_bytes_sent; /* total bytes sent */
size_t  rq_reply_bytes_recvd;   /* total reply bytes */
/* received */
diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
index 2e98f4a..37edea6 100644
--- a/net/sunrpc/xprt.c
+++ b/net/sunrpc/xprt.c
@@ -1425,3 +1425,4 @@ void xprt_put(struct rpc_xprt *xprt)
if (atomic_dec_and_test(>count))
xprt_destroy(xprt);
  }
+EXPORT_SYMBOL_GPL(xprt_put);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 58ec362..3768a7f 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1182,12 +1182,14 @@ static void __svc_rdma_free(struct work_struct *work)
  {
struct svcxprt_rdma *rdma =
container_of(work, struct svcxprt_rdma, sc_work);
-   dprintk("svcrdma: svc_rdma_free(%p)\n", rdma);
+   struct svc_xprt *xprt = >sc_xprt;
+
+   dprintk("svcrdma: %s(%p)\n", __func__, rdma);

/* We should only be called from kref_put */
-   if (atomic_read(>sc_xprt.xpt_ref.refcount) != 0)
+   if (atomic_read(>xpt_ref.refcount) != 0)
pr_err("svcrdma: sc_xprt still in use? (%d)\n",
-  atomic_read(>sc_xprt.xpt_ref.refcount));
+  atomic_read(>xpt_ref.refcount));

/*
 * Destroy queued, but not processed read completions. Note
@@ -1222,6 +1224,12 @@ static void __svc_rdma_free(struct work_struct *work)
pr_err("svcrdma: dma still in use? (%d)\n",
   atomic_read(>sc_dma_used));

+   /* Final put of backchannel client transport */
+   if (xprt->xpt_bc_xprt) {
+   xprt_put(xprt->xpt_bc_xprt);
+   xprt->xpt_bc_xprt = NULL;
+   }
+
/* De-allocate fastreg mr */
rdma_dealloc_frmr_q(rdma);

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8c545f7..fda7488 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -51,6 +51,7 @@
  #include 
  #include 
  #include 
+#include 

  #include "xprt_rdma.h"

@@ -148,7 +149,10 @@ static struct ctl_table sunrpc_table[] = {
  #define RPCRDMA_MAX_REEST_TO  (30U * HZ)
  #define RPCRDMA_IDLE_DISC_TO  (5U * 60 * HZ)

-static struct rpc_xprt_ops xprt_rdma_procs;/* forward reference */
+static struct rpc_xprt_ops xprt_rdma_procs;
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+static struct rpc_xprt_ops xprt_rdma_bc_procs;
+#endif

  static void
  xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr *sap)
@@ -499,7 +503,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size)
if (req == NULL)
return NULL;

-   flags = GFP_NOIO | __GFP_NOWARN;
+   flags = RPCRDMA_DEF_GFP;
if (RPC_IS_SWAPPER(task))
flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;

@@ -684,6 +688,199 @@ xprt_rdma_disable_swap(struct rpc_xprt *xprt)
  {
  }

+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+/* Server-side transport endpoint wants a whole page for its send
+ * buffer. The client RPC code constructs the RPC header in this
+ * buffer before it invokes ->send_request.
+ */
+static void *
+xprt_rdma_bc_allocate(struct rpc_task *task, size_t size)
+{
+   struct rpc_rqst *rqst = task->tk_rqstp;
+   struct svc_rdma_op_ctxt *ctxt;
+   struct svcxprt_rdma *rdma;
+   struct svc_xprt *sxprt;
+   struct page *page;
+
+   if (size > PAGE_SIZE) {
+   WARN_ONCE(1, "failed to handle buffer allocation (size %zu)\n",
+ size);
+   return NULL;
+   }
+
+   page = alloc_page(RPCRDMA_DEF_GFP);
+   if 

Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers

2015-11-23 Thread Tom Talpey

On 11/23/2015 5:13 PM, Chuck Lever wrote:

Rarely, senders post a Send that is larger than the client's inline
threshold. That can be due to a bug, or the client and server may
not have communicated about their inline limits. RPC-over-RDMA
currently doesn't specify any particular limit on inline size, so
peers have to guess what it is.

It is fatal to the connection if the size of a Send is larger than
the client's receive buffer. The sender is likely to retry with the
same message size, so the workload is stuck at that point.

Follow Postel's robustness principal: Be conservative in what you
do, be liberal in what you accept from others. Increase the size of
client receive buffers by a safety margin, and add a warning when
the inline threshold is exceeded during receive.


Safety is good, but how do know the chosen value is enough?
Isn't it better to fail the badly-composed request and be done
with it? Even if the stupid sender loops, which it will do
anyway.



Note the Linux server's receive buffers are already page-sized.

Signed-off-by: Chuck Lever 
---
  net/sunrpc/xprtrdma/rpc_rdma.c  |7 +++
  net/sunrpc/xprtrdma/verbs.c |6 +-
  net/sunrpc/xprtrdma/xprt_rdma.h |5 +
  3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..a169252 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
int rdmalen, status;
unsigned long cwnd;
u32 credits;
+   RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);

dprintk("RPC:   %s: incoming rep %p\n", __func__, rep);

@@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
goto out_badstatus;
if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
goto out_shortreply;
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+   cdata = _xprt->rx_data;
+   if (rep->rr_len > cdata->inline_rsize)
+   pr_warn("RPC: %u byte reply exceeds inline threshold\n",
+   rep->rr_len);
+#endif

headerp = rdmab_to_msg(rep->rr_rdmabuf);
if (headerp->rm_vers != rpcrdma_version)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index eadd1655..e3f12e2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
if (rep == NULL)
goto out;

-   rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
+   /* The actual size of our receive buffers is increased slightly
+* to prevent small receive overruns from killing our connection.
+*/
+   rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
+  RPCRDMA_RECV_MARGIN,
   GFP_KERNEL);
if (IS_ERR(rep->rr_rdmabuf)) {
rc = PTR_ERR(rep->rr_rdmabuf);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index ac7f8d4..1b72ab1 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
  #define RPCRDMA_INLINE_PAD_VALUE(rq)\
rpcx_to_rdmad(rq->rq_xprt).padding

+/* To help prevent spurious connection shutdown, allow senders to
+ * overrun our receive inline threshold by a small bit.
+ */
+#define RPCRDMA_RECV_MARGIN(128)
+
  /*
   * Statistics for RPCRDMA
   */

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers

2015-11-23 Thread Tom Talpey

On 11/23/2015 8:16 PM, Chuck Lever wrote:



On Nov 23, 2015, at 7:55 PM, Tom Talpey <t...@talpey.com> wrote:

On 11/23/2015 5:13 PM, Chuck Lever wrote:

Rarely, senders post a Send that is larger than the client's inline
threshold. That can be due to a bug, or the client and server may
not have communicated about their inline limits. RPC-over-RDMA
currently doesn't specify any particular limit on inline size, so
peers have to guess what it is.

It is fatal to the connection if the size of a Send is larger than
the client's receive buffer. The sender is likely to retry with the
same message size, so the workload is stuck at that point.

Follow Postel's robustness principal: Be conservative in what you
do, be liberal in what you accept from others. Increase the size of
client receive buffers by a safety margin, and add a warning when
the inline threshold is exceeded during receive.


Safety is good, but how do know the chosen value is enough?
Isn't it better to fail the badly-composed request and be done
with it? Even if the stupid sender loops, which it will do
anyway.


It’s good enough to compensate for the most common sender bug,
which is that the sender did not account for the 28 bytes of
the RPC-over-RDMA header when it built the send buffer. The
additional 100 byte margin is gravy.


I think it's good to have sympathy and resilience to differing
designs on the other end of the wire, but I fail to have it for
stupid bugs. Unless this can take down the receiver, fail it fast.

MHO.



The loop occurs because the server gets a completion error.
The client just sees a connection loss. There’s no way for it
to know it should fail the RPC, so it keeps trying.

Perhaps the server could remember that the reply failed, and
when the client retransmits, it can simply return that XID
with an RDMA_ERROR.



Note the Linux server's receive buffers are already page-sized.

Signed-off-by: Chuck Lever <chuck.le...@oracle.com>
---
  net/sunrpc/xprtrdma/rpc_rdma.c  |7 +++
  net/sunrpc/xprtrdma/verbs.c |6 +-
  net/sunrpc/xprtrdma/xprt_rdma.h |5 +
  3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..a169252 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
int rdmalen, status;
unsigned long cwnd;
u32 credits;
+   RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata);

dprintk("RPC:   %s: incoming rep %p\n", __func__, rep);

@@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
goto out_badstatus;
if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
goto out_shortreply;
+#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
+   cdata = _xprt->rx_data;
+   if (rep->rr_len > cdata->inline_rsize)
+   pr_warn("RPC: %u byte reply exceeds inline threshold\n",
+   rep->rr_len);
+#endif

headerp = rdmab_to_msg(rep->rr_rdmabuf);
if (headerp->rm_vers != rpcrdma_version)
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index eadd1655..e3f12e2 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt)
if (rep == NULL)
goto out;

-   rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize,
+   /* The actual size of our receive buffers is increased slightly
+* to prevent small receive overruns from killing our connection.
+*/
+   rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize +
+  RPCRDMA_RECV_MARGIN,
   GFP_KERNEL);
if (IS_ERR(rep->rr_rdmabuf)) {
rc = PTR_ERR(rep->rr_rdmabuf);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index ac7f8d4..1b72ab1 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal {
  #define RPCRDMA_INLINE_PAD_VALUE(rq)\
rpcx_to_rdmad(rq->rq_xprt).padding

+/* To help prevent spurious connection shutdown, allow senders to
+ * overrun our receive inline threshold by a small bit.
+ */
+#define RPCRDMA_RECV_MARGIN(128)
+
  /*
   * Statistics for RPCRDMA
   */

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To u

Re: [PATCH 2/9] IB: add a proper completion queue abstraction

2015-11-23 Thread Tom Talpey

On 11/23/2015 7:00 PM, Jason Gunthorpe wrote:

On Mon, Nov 23, 2015 at 03:30:42PM -0800, Caitlin Bestler wrote:

The receive completion can be safely assumed to indicate transmit
completion over a reliable connection unless your peer has gone
completely bonkers and is replying to a command that it did not
receive.


Perhaps iWarp is different and does specify this ordering but IB does
not.


iWARP is not different. The situation you (Jason) describe has
nothing to do with the transport. It has everything to do with
as you point out the lack of causality between send and receive
completions.

It is entirely possible for the reply to be received before the
send is fully processed. For example, the send might be issued
on one core, and that core scheduled away before the completion
for the send is ready. In the meantime, the request goes on
the wire, the target processes it and replies, and the reply
is processed. Boom, the send queue completion is still pending.

Been there, seen that. Bluescreened on it, mysteriously.

A really good way to see this is with software providers, btw.
Try it with soft{roce,iwarp}, under heavy load.

Tom.



The issue with IB is how the ACK protocol is designed. There is not
strong ordering between ACKs and data transfers. A HCA can send
ACK,DATA and the network could drop the ACK. The recevier side does
not know the ACK was lost and goes ahead to process DATA.

Since only ACK advances the sendq and DATA advances the recvq it is
trivial to get a case where the recvq is advanced with a reply while
the sendq continues to wait for the ACK to be resent.

Further IB allows ACK coalescing and has no rules for how an ACK is
placed. It is entirely valid for a HCA to RECV,REPLY,ACK - for
instance.


I actually had a bug in an early iWARP emulation where the simulated
peer, because it was simulated, responded
instantly. The result was a TCP segment that both acked the
transmission *and* contained the reply. The bug was
that the code processed the reception before the transmission ack,
causing the receive completion to be placed
on the completion queue before transmit completion.


I don't know if iWARP has the same lax ordering as IB, but certainly,
what you describe is legal for IB verbs to do, and our kernel ULPs
have to cope with it.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 8/9] xprtrdma: Invalidate in the RPC reply handler

2015-11-23 Thread Tom Talpey

On 11/23/2015 5:14 PM, Chuck Lever wrote:

There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.

The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.


No concern, just Hurray for implementing this fundamental
integrity requirement. It's even more important when krb5i
is in play, to avoid truly malicious injection-after-integrity.



This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.

Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.

Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.

Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.

Reported-by: Jason Gunthorpe 
Signed-off-by: Chuck Lever 
---
  net/sunrpc/xprtrdma/rpc_rdma.c |   10 ++
  1 file changed, 10 insertions(+)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index d7b9156..4ad72ca 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -898,6 +898,16 @@ badheader:
break;
}

+   /* Invalidate and flush the data payloads before waking the
+* waiting application. This guarantees the memory region is
+* properly fenced from the server before the application
+* accesses the data. It also ensures proper send flow
+* control: waking the next RPC waits until this RPC has
+* relinquished all its Send Queue entries.
+*/
+   if (req->rl_nchunks)
+   r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);
+
credits = be32_to_cpu(headerp->rm_credit);
if (credits == 0)
credits = 1;/* don't deadlock */

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 5/8] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies

2015-11-23 Thread Tom Talpey

On 11/23/2015 5:20 PM, Chuck Lever wrote:

To support the NFSv4.1 backchannel on RDMA connections, add a
capability for receiving an RPC/RDMA reply on a connection
established by a client.

Signed-off-by: Chuck Lever 
---
  net/sunrpc/xprtrdma/rpc_rdma.c  |   76 +++
  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   60 
  net/sunrpc/xprtrdma/xprt_rdma.h |4 ++
  3 files changed, 140 insertions(+)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..fef0623 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -946,3 +946,79 @@ repost:
if (rpcrdma_ep_post_recv(_xprt->rx_ia, _xprt->rx_ep, rep))
rpcrdma_recv_buffer_put(rep);
  }
+
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+int
+rpcrdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
+   struct xdr_buf *rcvbuf)
+{
+   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct kvec *dst, *src = >head[0];
+   struct rpc_rqst *req;
+   unsigned long cwnd;
+   u32 credits;
+   size_t len;
+   __be32 xid;
+   __be32 *p;
+   int ret;
+
+   p = (__be32 *)src->iov_base;
+   len = src->iov_len;
+   xid = rmsgp->rm_xid;
+
+   pr_info("%s: xid=%08x, length=%zu\n",
+   __func__, be32_to_cpu(xid), len);
+   pr_info("%s: RPC/RDMA: %*ph\n",
+   __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
+   pr_info("%s:  RPC: %*ph\n",
+   __func__, (int)len, p);
+
+   ret = -EAGAIN;
+   if (src->iov_len < 24)
+   goto out_shortreply;
+
+   spin_lock_bh(>transport_lock);
+   req = xprt_lookup_rqst(xprt, xid);
+   if (!req)
+   goto out_notfound;
+
+   dst = >rq_private_buf.head[0];
+   memcpy(>rq_private_buf, >rq_rcv_buf, sizeof(struct xdr_buf));
+   if (dst->iov_len < len)
+   goto out_unlock;
+   memcpy(dst->iov_base, p, len);
+
+   credits = be32_to_cpu(rmsgp->rm_credit);
+   if (credits == 0)
+   credits = 1;/* don't deadlock */
+   else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
+   credits = r_xprt->rx_buf.rb_bc_max_requests;
+
+   cwnd = xprt->cwnd;
+   xprt->cwnd = credits << RPC_CWNDSHIFT;
+   if (xprt->cwnd > cwnd)
+   xprt_release_rqst_cong(req->rq_task);
+
+   ret = 0;
+   xprt_complete_rqst(req->rq_task, rcvbuf->len);
+   rcvbuf->len = 0;
+
+out_unlock:
+   spin_unlock_bh(>transport_lock);
+out:
+   return ret;
+
+out_shortreply:
+   pr_info("svcrdma: short bc reply: xprt=%p, len=%zu\n",
+   xprt, src->iov_len);
+   goto out;
+
+out_notfound:
+   pr_info("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n",
+   xprt, be32_to_cpu(xid));
+
+   goto out_unlock;
+}
+
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index ff4f01e..2b762b5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -47,6 +47,7 @@
  #include 
  #include 
  #include 
+#include "xprt_rdma.h"

  #define RPCDBG_FACILITY   RPCDBG_SVCXPRT

@@ -567,6 +568,42 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
return ret;
  }

+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+/* By convention, backchannel calls arrive via rdma_msg type
+ * messages, and never populate the chunk lists. This makes
+ * the RPC/RDMA header small and fixed in size, so it is
+ * straightforward to check the RPC header's direction field.
+ */
+static bool
+svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
+{
+   __be32 *p = (__be32 *)rmsgp;
+
+   if (!xprt->xpt_bc_xprt)
+   return false;
+
+   if (rmsgp->rm_type != rdma_msg)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
+   return false;


The above assertion is only true for the NFS behavior as spec'd
today (no chunk-bearing bulk data on existing backchannel NFS
protocol messages). That at least deserves a comment. Or, why
not simply ignore the chunks? They're not the receiver's problem.


+
+   /* sanity */
+   if (p[7] != rmsgp->rm_xid)
+   return false;
+   /* call direction */
+   if (p[8] == cpu_to_be32(RPC_CALL))
+   return false;
+
+   return true;
+}
+
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */
+
  /*
   * Set up the rqstp thread context to point to the RQ buffer. If
   * necessary, pull additional data from the client with an RDMA_READ
@@ -632,6 +669,17 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
  

Re: [PATCH v1 8/8] svcrdma: Remove svc_rdma_fastreg_mr::access_flags field

2015-11-23 Thread Tom Talpey

On 11/23/2015 5:21 PM, Chuck Lever wrote:

Clean up: The access_flags field is not used outside of
rdma_read_chunk_frmr() and is always set to the same value.

Signed-off-by: Chuck Lever 
---
  include/linux/sunrpc/svc_rdma.h |1 -
  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |3 +--
  2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 243edf4..eee2a0d 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -107,7 +107,6 @@ struct svc_rdma_fastreg_mr {
struct ib_mr *mr;
struct scatterlist *sg;
int sg_nents;
-   unsigned long access_flags;
enum dma_data_direction direction;
struct list_head frmr_list;
  };
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 9480043..8ab1ab5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -240,7 +240,6 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
read = min_t(int, (nents << PAGE_SHIFT) - *page_offset, rs_length);

frmr->direction = DMA_FROM_DEVICE;
-   frmr->access_flags = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
frmr->sg_nents = nents;

for (pno = 0; pno < nents; pno++) {
@@ -308,7 +307,7 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
reg_wr.wr.num_sge = 0;
reg_wr.mr = frmr->mr;
reg_wr.key = frmr->mr->lkey;
-   reg_wr.access = frmr->access_flags;
+   reg_wr.access = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);


Wait, the REMOTE_WRITE is there to support iWARP, but it isn't
needed for IB or RoCE. Shouldn't this be updated to peek at those
new attributes to decide, instead of remaining unconditional?



reg_wr.wr.next = _wr.wr;

/* Prepare RDMA_READ */

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR

2015-11-23 Thread Tom Talpey

On 11/23/2015 5:14 PM, Chuck Lever wrote:

FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.


I thought FMR support was about to be removed in the core.




1. Gather all the MRs for the RPC request onto a list, and invoke
ib_unmap_fmr() once with that list. This reduces the number of
doorbells when there is more than one MR to invalidate

2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
This is critical after invalidating a Write chunk.

Signed-off-by: Chuck Lever 
---
  net/sunrpc/xprtrdma/fmr_ops.c |   64 +
  1 file changed, 64 insertions(+)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index f1e8daf..c14f3a4 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -179,6 +179,69 @@ out_maperr:
return rc;
  }

+static void
+__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   int nsegs = seg->mr_nsegs;
+
+   seg->rl_mw = NULL;
+
+   while (nsegs--)
+   rpcrdma_unmap_one(device, seg++);
+
+   rpcrdma_put_mw(r_xprt, mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct rpcrdma_mr_seg *seg;
+   unsigned int i, nchunks;
+   struct rpcrdma_mw *mw;
+   LIST_HEAD(unmap_list);
+   int rc;
+
+   dprintk("RPC:   %s: req %p\n", __func__, req);
+
+   /* ORDER: Invalidate all of the req's MRs first
+*
+* ib_unmap_fmr() is slow, so use a single call instead
+* of one call per mapped MR.
+*/
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = >rl_segments[i];
+   mw = seg->rl_mw;
+
+   list_add(>r.fmr.fmr->list, _list);
+
+   i += seg->mr_nsegs;
+   }
+   rc = ib_unmap_fmr(_list);
+   if (rc)
+   pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
+
+   /* ORDER: Now DMA unmap all of the req's MRs, and return
+* them to the free MW list.
+*/
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = >rl_segments[i];
+
+   __fmr_dma_unmap(r_xprt, seg);
+
+   i += seg->mr_nsegs;
+   seg->mr_nsegs = 0;
+   }
+
+   req->rl_nchunks = 0;
+}
+
  /* Use the ib_unmap_fmr() verb to prevent further remote
   * access via RDMA READ or RDMA WRITE.
   */
@@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)

  const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
+   .ro_unmap_sync  = fmr_op_unmap_sync,
.ro_unmap   = fmr_op_unmap,
.ro_open= fmr_op_open,
.ro_maxpages= fmr_op_maxpages,

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 5/8] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies

2015-11-23 Thread Tom Talpey

I'll have to think about whether I agree with that as a
protocol statement.

Chunks in a reply are there to account for the data that is
handled in the chunk of a request. So it kind of comes down
to whether RDMA is allowed (or used) on the backchannel. I
still think that is fundamentally an upper layer question,
not an RPC one.


On 11/23/2015 8:47 PM, Chuck Lever wrote:



On Nov 23, 2015, at 7:44 PM, Tom Talpey <t...@talpey.com> wrote:

On 11/23/2015 5:20 PM, Chuck Lever wrote:

To support the NFSv4.1 backchannel on RDMA connections, add a
capability for receiving an RPC/RDMA reply on a connection
established by a client.

Signed-off-by: Chuck Lever <chuck.le...@oracle.com>
---
  net/sunrpc/xprtrdma/rpc_rdma.c  |   76 +++
  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   60 
  net/sunrpc/xprtrdma/xprt_rdma.h |4 ++
  3 files changed, 140 insertions(+)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..fef0623 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -946,3 +946,79 @@ repost:
if (rpcrdma_ep_post_recv(_xprt->rx_ia, _xprt->rx_ep, rep))
rpcrdma_recv_buffer_put(rep);
  }
+
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+int
+rpcrdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp,
+   struct xdr_buf *rcvbuf)
+{
+   struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
+   struct kvec *dst, *src = >head[0];
+   struct rpc_rqst *req;
+   unsigned long cwnd;
+   u32 credits;
+   size_t len;
+   __be32 xid;
+   __be32 *p;
+   int ret;
+
+   p = (__be32 *)src->iov_base;
+   len = src->iov_len;
+   xid = rmsgp->rm_xid;
+
+   pr_info("%s: xid=%08x, length=%zu\n",
+   __func__, be32_to_cpu(xid), len);
+   pr_info("%s: RPC/RDMA: %*ph\n",
+   __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp);
+   pr_info("%s:  RPC: %*ph\n",
+   __func__, (int)len, p);
+
+   ret = -EAGAIN;
+   if (src->iov_len < 24)
+   goto out_shortreply;
+
+   spin_lock_bh(>transport_lock);
+   req = xprt_lookup_rqst(xprt, xid);
+   if (!req)
+   goto out_notfound;
+
+   dst = >rq_private_buf.head[0];
+   memcpy(>rq_private_buf, >rq_rcv_buf, sizeof(struct xdr_buf));
+   if (dst->iov_len < len)
+   goto out_unlock;
+   memcpy(dst->iov_base, p, len);
+
+   credits = be32_to_cpu(rmsgp->rm_credit);
+   if (credits == 0)
+   credits = 1;/* don't deadlock */
+   else if (credits > r_xprt->rx_buf.rb_bc_max_requests)
+   credits = r_xprt->rx_buf.rb_bc_max_requests;
+
+   cwnd = xprt->cwnd;
+   xprt->cwnd = credits << RPC_CWNDSHIFT;
+   if (xprt->cwnd > cwnd)
+   xprt_release_rqst_cong(req->rq_task);
+
+   ret = 0;
+   xprt_complete_rqst(req->rq_task, rcvbuf->len);
+   rcvbuf->len = 0;
+
+out_unlock:
+   spin_unlock_bh(>transport_lock);
+out:
+   return ret;
+
+out_shortreply:
+   pr_info("svcrdma: short bc reply: xprt=%p, len=%zu\n",
+   xprt, src->iov_len);
+   goto out;
+
+out_notfound:
+   pr_info("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n",
+   xprt, be32_to_cpu(xid));
+
+   goto out_unlock;
+}
+
+#endif /* CONFIG_SUNRPC_BACKCHANNEL */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index ff4f01e..2b762b5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -47,6 +47,7 @@
  #include 
  #include 
  #include 
+#include "xprt_rdma.h"

  #define RPCDBG_FACILITY   RPCDBG_SVCXPRT

@@ -567,6 +568,42 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
return ret;
  }

+#if defined(CONFIG_SUNRPC_BACKCHANNEL)
+
+/* By convention, backchannel calls arrive via rdma_msg type
+ * messages, and never populate the chunk lists. This makes
+ * the RPC/RDMA header small and fixed in size, so it is
+ * straightforward to check the RPC header's direction field.
+ */
+static bool
+svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp)
+{
+   __be32 *p = (__be32 *)rmsgp;
+
+   if (!xprt->xpt_bc_xprt)
+   return false;
+
+   if (rmsgp->rm_type != rdma_msg)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[0] != xdr_zero)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[1] != xdr_zero)
+   return false;
+   if (rmsgp->rm_body.rm_chunks[2] != xdr_zero)
+   return false;


The above assertion is only true for the NFS behavior as spec'd
today (no chunk-bearing bulk data on existing b

Re: [PATCH v1 2/8] svcrdma: Define maximum number of backchannel requests

2015-11-23 Thread Tom Talpey

On 11/23/2015 8:36 PM, Chuck Lever wrote:



On Nov 23, 2015, at 8:19 PM, Tom Talpey <t...@talpey.com> wrote:

On 11/23/2015 8:09 PM, Chuck Lever wrote:



On Nov 23, 2015, at 7:39 PM, Tom Talpey <t...@talpey.com> wrote:

On 11/23/2015 5:20 PM, Chuck Lever wrote:

Extra resources for handling backchannel requests have to be
pre-allocated when a transport instance is created. Set a limit.

Signed-off-by: Chuck Lever <chuck.le...@oracle.com>
---
  include/linux/sunrpc/svc_rdma.h  |5 +
  net/sunrpc/xprtrdma/svc_rdma_transport.c |6 +-
  2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index f869807..478aa30 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -178,6 +178,11 @@ struct svcxprt_rdma {
  #define RPCRDMA_SQ_DEPTH_MULT   8
  #define RPCRDMA_MAX_REQUESTS32
  #define RPCRDMA_MAX_REQ_SIZE4096
+#if defined(CONFIG_SUNRPC_BACKCHANNEL)


Why is this a config option? Why wouldn't you always want
this? It's needed for any post-1990 NFS dialect.


I think some distros want to be able to compile out NFSv4.x
on small systems, and take all the backchannel cruft with it.


So shouldn't it follow the NFSv4.x config options then?


Setting CONFIG_NFS_V4_1 sets CONFIG_SUNRPC_BACKCHANNEL.
Adding #ifdef CONFIG_NFS_V4_1 in net/sunrpc would be
a layering violation.


Ok, I guess. It seems a fairly small and abstruse detail to
surface as a config option. But if it's already there, sure,
use it.



I see however that CONFIG_SUNRPC_BACKCHANNEL controls
only the client backchannel capability. Perhaps it is
out of place to use it to enable the server’s backchannel
capability.


Ok, now it's even smaller and more abstruse. :-) To say
nothing of multiplying. ;-)






+#define RPCRDMA_MAX_BC_REQUESTS8


Why a constant 8? The forward channel value is apparently
configurable, just a few lines down.


The client side backward direction credit limit, now
in 4.4, is already a constant.

The client side ULP uses a constant for the slot table
size: NFS4_MAX_BACK_CHANNEL_OPS. I’m not 100% sure but
the server seems to just echo that number back to the
client.

I’d rather not add an admin knob for this. Why would it
be necessary?


Because no constant is ever correct. Why isn't it "1"? Do
you allow multiple credits? Why not that value?

For instance.


There’s no justification for the forward channel credit
limit either.

The code in Linux assumes one session slot in the NFSv4.1
backchannel. When we get around to it, this can be made
more flexible.


Ok so again, why choose "8" here?



It’s much easier to add flexibility and admin control later
than it is to take it away when the knob becomes useless or
badly designed. For now, 8 works, and it doesn’t have to be
permanent.

I could add a comment that says

/* Arbitrary: support up to eight backward credits.
  */



+#else
+#define RPCRDMA_MAX_BC_REQUESTS0
+#endif

  #define RPCSVC_MAXPAYLOAD_RDMARPCSVC_MAXPAYLOAD

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index b348b4a..01c7b36 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
  (size_t)RPCSVC_MAXPAGES);
newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd,
   RPCSVC_MAXPAGES);
+   /* XXX: what if HCA can't support enough WRs for bc operation? */
newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr,
-  (size_t)svcrdma_max_requests);
+  (size_t)(svcrdma_max_requests +
+  RPCRDMA_MAX_BC_REQUESTS));
newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests;

/*
@@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
qp_attr.event_handler = qp_event_handler;
qp_attr.qp_context = >sc_xprt;
qp_attr.cap.max_send_wr = newxprt->sc_sq_depth;
+   qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS;
qp_attr.cap.max_recv_wr = newxprt->sc_max_requests;
+   qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS;
qp_attr.cap.max_send_sge = newxprt->sc_max_sge;
qp_attr.cap.max_recv_sge = newxprt->sc_max_sge;
qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR;

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger

Re: [PATCH libmlx5 0/7] Completion timestamping

2015-11-16 Thread Tom Talpey

On 11/15/2015 7:30 AM, Matan Barak wrote:

This series adds support for completion timestamp. In order to
support this feature, several extended verbs were implemented
(as instructed in libibverbs).


Can you describe what these timestamps are actually for? It's not
clear at all from the comments. I'm assuming they are for some sort
of fine-grained statistics? Are they purely for userspace consumers?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/3] svcrdma: Use device rdma_read_access_flags

2015-11-10 Thread Tom Talpey

On 11/10/2015 4:17 PM, Jason Gunthorpe wrote:

On Tue, Nov 10, 2015 at 04:00:43PM -0500, Tom Talpey wrote:


Hmm, agreed, but it must still be acceptable to perform a registration
instead of using the local_dma_lkey. As mentioned earlier, there are
scatter limits and other implications for all-physical addressing that
an upper layer may choose to avoid.


It is always acceptable to use a lkey MR instead of the local dma
lkey, but ULPs should prefer to use the local dma lkey if possible,
for performance reasons.


Sure, the key words are "if possible". For example, a single 1MB
RDMA Read is unlikely to be possible with the dma lkey, assuming
worst-case physical page scatter it would need 256 SGEs. But 1MB
can be registered easily.

In any case, my point was that the ULP gets to choose, so, it looks
like we agree.

Tom.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/3] svcrdma: Use device rdma_read_access_flags

2015-11-10 Thread Tom Talpey

On 11/10/2015 1:25 PM, Jason Gunthorpe wrote:

On Tue, Nov 10, 2015 at 04:04:32AM -0800, Christoph Hellwig wrote:

On Tue, Nov 10, 2015 at 01:46:40PM +0200, Sagi Grimberg wrote:



On 10/11/2015 13:41, Christoph Hellwig wrote:

Oh, and while we're at it.  Can someone explain why we're even
using rdma_read_chunk_frmr for IB?  It seems to work around the
fact tat iWarp only allow a single RDMA READ SGE, but it's used
whenever the device has IB_DEVICE_MEM_MGT_EXTENSIONS, which seems
wrong.


I think Steve can answer it better than I can. I think that it is
just to have a single code path for both IB and iWARP. I agree that
the condition seems wrong and for small transfers rdma_read_chunk_frmr
is really a performance loss.


Well, the code path already exists, but only is used fi
IB_DEVICE_MEM_MGT_EXTENSIONS isn't set.  Below is an untested patch
that demonstrates how I think svcrdma should setup the reads.  Note
that this also allows to entirely remove it's allphys MR.

Note that as a followon this would also allow to remove the
non-READ_W_INV code path from rdma_read_chunk_frmr as a future
step.


I like this, my only comment is we should have a rdma_cap for this
behavior, rdma_cap_needs_rdma_read_mr(pd) or something?


Windows NDKPI has this, it's the oh-so-succinctly-named flag
NDK_ADAPTER_FLAG_RDMA_READ_SINK_NOT_REQUIRED. The ULP is free
to ignore it and pass the NDK_MR_FLAG_RDMA_READ_SINK flag anyway,
the provider is expected to ignore it if not needed.




+   if (rdma_protocol_iwarp(dev, newxprt->sc_cm_id->port_num)) {


Use here


+   /*
+* iWARP requires remote write access for the data sink, and
+* only supports a single SGE for RDMA_READ requests, so we'll
+* have to use a memory registration for each RDMA_READ.
+*/
+   if (!(dev->device_cap_flags &
IB_DEVICE_MEM_MGT_EXTENSIONS)) {


Lets enforce this in the core, if rdma_cap_needs_rdma_read_mr is set
the the device must also set IB_DEVICE_MEM_MGT_EXTENSIONS, check at
device creation time.


+   } else if (rdma_ib_or_roce(dev, newxprt->sc_cm_id->port_num)) {
+   /*
+* For IB or RoCE life is easy, no unsafe write access is
+* required and multiple SGEs are supported, so we don't need
+* to use MRs.
+*/
+   newxprt->sc_reader = rdma_read_chunk_lcl;
+   } else {
+   /*
+* Neither iWarp nor IB-ish, we're out of luck.
+*/
goto errout;


No need for the else, !rdma_cap_needs_rdma_read_mr means pd->local_dma_lkey is 
okay
to use.


Hmm, agreed, but it must still be acceptable to perform a registration
instead of using the local_dma_lkey. As mentioned earlier, there are
scatter limits and other implications for all-physical addressing that
an upper layer may choose to avoid.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] IB/core: Expose a device attribute for rdma_read access flags

2015-11-10 Thread Tom Talpey

On 11/10/2015 6:51 AM, Yann Droneaud wrote:

Le mardi 10 novembre 2015 à 12:44 +0200, Sagi Grimberg a écrit :

Also, set rdma_read_access_flags in the relevant device drivers:
mlx4, mlx5, qib, ocrdma, nes: IB_ACCESS_LOCAL_WRITE
cxgb3, cxgb4, hfi: IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE



Why were those hw providers not modified to
enforce IB_ACCESS_REMOTE_WRITE when needed, instead of asking users to
set it for them ?


Because the Linux verbs don't tell the driver whether the registration
is for an RDMA Read sink buffer.

Sagi, the Windows NDKPI has an NDK_MR_FLAG_RDMA_READ_SINK attribute
which the upper layer can use to convey this information, I've mentioned
it here before.
https://msdn.microsoft.com/en-us/library/windows/hardware/hh439908(v=vs.85).aspx

When this approach is used, the upper layer doesn't have to be aware
at all of the lower layer's details.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RDMA/CM and multiple QPs

2015-09-08 Thread Tom Talpey

On 9/6/2015 11:15 AM, Bart Van Assche wrote:

On 09/06/15 00:50, Christoph Hellwig wrote:

Note that the SRP driver already in tree is a good example for this,
although it doesn't use RDMA/CM and thus already operates on a
per-ib_device level.


The challenges with regard to adding RDMA/CM support to the SRP
initiator and target drivers are:
- IANA has not yet assigned a port number to the SRP protocol (see e.g.

http://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml).



IANA doesn't do this automatically. Has anyone made the request?

You might want to think through why it needs a dedicated port number
though. iSER reuses the iSCSI port, by negotiating RDMA during login.


- The login request (struct srp_login_req) is too large for the RDMA/CM.
   A format for the login parameters for the RDMA/CM has not yet been
   standardized.


Are you suggesting that RDMA/CM perform the login? That seems
like a layering issue.

Tom.



Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 00/10] Add RoCE support to the mlx5 driver

2015-08-25 Thread Tom Talpey

On 8/25/2015 4:35 AM, Achiad Shochat wrote:

On 8/24/2015 11:40 PM, Tom Talpey wrote:

On 8/20/2015 12:46 PM, Achiad Shochat wrote:

Hi Doug,

This patchset adds RoCE V1 and RoCE V2 support to the mlx5 device
driver.


Question - assuming I read them correctly, these patches add the
RoCE v1 and RoCE v2 support on a per-port basis. That is, a port
can be either IB, RoCE v1 or RoCE v2, but not a combination.

Has any thought been put toward supporting these protocols on a
per-QP basis, i.e. the caller of rdma_connect() can specify the
desired protocol? Or to have some sort of discovery?

I know that there may be implementation restrictions on today's
devices, but it's my personal belief that future devices will
support multiple protocols (perhaps beyond the three above), and
laying the groundwork for this today will be important.

Tom.





RoCE v1 and RoCE v2 are supported on a per-GID basis, not sure what got
you to understand it is per-port.


Because the protocol capabilities were being marked at the ib_device
level. Ok, it's good that the protocol is per-endpoint.

But I don't understand how it will work per-GID. What if the target
node is RoCEv2 and on another subnet? How will it discover the remote's
capability and establish the right protocol? How does the initiator
select the protocol, if there is a choice?

Another way of asking this question is, why is all this stuff in the
driver, at the bottom of the stack? I think it should be in the
rdmacm layer.

Tom.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 00/10] Add RoCE support to the mlx5 driver

2015-08-24 Thread Tom Talpey

On 8/20/2015 12:46 PM, Achiad Shochat wrote:

Hi Doug,

This patchset adds RoCE V1 and RoCE V2 support to the mlx5 device
driver.


Question - assuming I read them correctly, these patches add the
RoCE v1 and RoCE v2 support on a per-port basis. That is, a port
can be either IB, RoCE v1 or RoCE v2, but not a combination.

Has any thought been put toward supporting these protocols on a
per-QP basis, i.e. the caller of rdma_connect() can specify the
desired protocol? Or to have some sort of discovery?

I know that there may be implementation restrictions on today's
devices, but it's my personal belief that future devices will
support multiple protocols (perhaps beyond the three above), and
laying the groundwork for this today will be important.

Tom.



This patchset was applied and tested over patchset Add RoCE v2
support which was sent to the mailing list by Matan Barak.

Achiad Shochat (10):
   IB/mlx5: Support IB device's callback for getting the link layer
   IB/mlx5: Support IB device's callback for getting its netdev
   net/mlx5_core: Break down the vport mac address query function
   net/mlx5_core: Introduce access functions to enable/disable RoCE
   net/mlx5_core: Introduce access functions to query vport RoCE fields
   IB/mlx5: Extend query_device/port to support RoCE
   IB/mlx5: Set network_hdr_type upon RoCE responder completion
   IB/mlx5: Support IB device's callbacks for adding/deleting GIDs
   IB/mlx5: Add RoCE fields to Address Vector
   IB/mlx5: Support RoCE

  drivers/infiniband/hw/mlx5/ah.c |  32 ++-
  drivers/infiniband/hw/mlx5/cq.c |  17 ++
  drivers/infiniband/hw/mlx5/main.c   | 318 ++--
  drivers/infiniband/hw/mlx5/mlx5_ib.h|  15 +-
  drivers/infiniband/hw/mlx5/qp.c |  42 +++-
  drivers/net/ethernet/mellanox/mlx5/core/vport.c | 139 ++-
  include/linux/mlx5/device.h |  26 ++
  include/linux/mlx5/driver.h |   7 -
  include/linux/mlx5/mlx5_ifc.h   |  10 +-
  include/linux/mlx5/qp.h |  21 +-
  include/linux/mlx5/vport.h  |   8 +
  11 files changed, 578 insertions(+), 57 deletions(-)



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

2015-07-21 Thread Tom Talpey

On 7/21/2015 7:33 AM, Steve Wise wrote:




-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Tom Talpey
Sent: Monday, July 20, 2015 7:15 PM
To: Steve Wise; 'Jason Gunthorpe'
Cc: 'Chuck Lever'; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

On 7/20/2015 3:41 PM, Steve Wise wrote:




-Original Message-
From: Tom Talpey [mailto:t...@talpey.com]
Sent: Monday, July 20, 2015 5:04 PM
To: Steve Wise; 'Jason Gunthorpe'
Cc: 'Chuck Lever'; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

On 7/20/2015 2:16 PM, Steve Wise wrote:




-Original Message-
From: linux-nfs-ow...@vger.kernel.org [mailto:linux-nfs-ow...@vger.kernel.org] 
On Behalf Of Jason Gunthorpe
Sent: Monday, July 20, 2015 4:06 PM
To: Tom Talpey; Steve Wise
Cc: Chuck Lever; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

On Mon, Jul 20, 2015 at 01:34:16PM -0700, Tom Talpey wrote:

On 7/20/2015 12:03 PM, Chuck Lever wrote:

All HCA providers have an ib_get_dma_mr() verb. Thus
rpcrdma_ia_open() will either grab the device's local_dma_key if one
is available, or it will call ib_get_dma_mr() which is a 100%
guaranteed fallback.


I recall that in the past, some providers did not support mapping
all of the machine's potential physical memory with a single dma_mr.
If an rnic did/does not support 44-ish bits of length per region,
for example.


Looks like you are right, but the standard in kernel is to require
ib_get_dma_mr, if the HCA can't do that, then it cannot be used on a
big memory machine with kernel ULPs.

Looking deeper, both amso1100 and cxgb3 seem limited to 32 bits of
physical memory, and silently break all kernel ULPs if they are used
on a modern machine with  4G.

Is that right Steve?



Yes.


Based on that, should we remove the cxgb3 driver as well? Or at least
can you fix it up to at least fail get_dma_mr if there is too much
ram?



I would like to keep cxgb3 around.  I can add code to fail if the memory is  
32b.  Do you know how I get the amount of

available

ram?


A) are you sure it's an unsigned length, i.e. is it really 31 bits?



yes.


B) why bother to check? Are machines with 4GB interesting, and worth
supporting a special optimization?


No, but cxgb3 is still interesting to user applications, and perhaps NFSRDMA 
using FRMRs.


I'm obviously not making myself clear. I am suggesting that cxgb3 fail
the ib_get_dma_mr() verb, regardless of installed memory.

I am not suggesting it fail to load, or fail other memreg requests. It
should work normally in all other respects.


Even with its limitation, doesn't it have utility for someone using cxgb3 in an 
embedded 32b environment?


Sure, do you mean making it conditional on #if sizeof(physaddr) = 32?
That would make sense I guess.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

2015-07-20 Thread Tom Talpey

On 7/20/2015 1:55 PM, Chuck Lever wrote:

On Jul 20, 2015, at 4:34 PM, Tom Talpey t...@talpey.com wrote:


On 7/20/2015 12:03 PM, Chuck Lever wrote:

All HCA providers have an ib_get_dma_mr() verb. Thus
rpcrdma_ia_open() will either grab the device's local_dma_key if one
is available, or it will call ib_get_dma_mr() which is a 100%
guaranteed fallback.


I recall that in the past, some providers did not support mapping
all of the machine's potential physical memory with a single dma_mr.
If an rnic did/does not support 44-ish bits of length per region,
for example.


The buffers affected by this change are small, so I’m confident that
restriction would not be a problem here.


It's not about the buffer size, it's about the region. Because the
get_dma_mr does not specify a base address and length, the rnic must
basically attempt to map a base of zero and a length of the largest
physical offset.

This is not the case with the previous phys_reg_mr, which specified
the exact phys page range.


What might break with such restricted hardware is ALLPHYSICAL on
large-memory systems. ALLPHYSICAL does rely on a single DMA MR that
covers all of the NFS client’s memory.


Yes, indeed it always did, but that mode was not intended for general
use.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

2015-07-20 Thread Tom Talpey

On 7/20/2015 2:16 PM, Steve Wise wrote:




-Original Message-
From: linux-nfs-ow...@vger.kernel.org [mailto:linux-nfs-ow...@vger.kernel.org] 
On Behalf Of Jason Gunthorpe
Sent: Monday, July 20, 2015 4:06 PM
To: Tom Talpey; Steve Wise
Cc: Chuck Lever; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

On Mon, Jul 20, 2015 at 01:34:16PM -0700, Tom Talpey wrote:

On 7/20/2015 12:03 PM, Chuck Lever wrote:

All HCA providers have an ib_get_dma_mr() verb. Thus
rpcrdma_ia_open() will either grab the device's local_dma_key if one
is available, or it will call ib_get_dma_mr() which is a 100%
guaranteed fallback.


I recall that in the past, some providers did not support mapping
all of the machine's potential physical memory with a single dma_mr.
If an rnic did/does not support 44-ish bits of length per region,
for example.


Looks like you are right, but the standard in kernel is to require
ib_get_dma_mr, if the HCA can't do that, then it cannot be used on a
big memory machine with kernel ULPs.

Looking deeper, both amso1100 and cxgb3 seem limited to 32 bits of
physical memory, and silently break all kernel ULPs if they are used
on a modern machine with  4G.

Is that right Steve?



Yes.


Based on that, should we remove the cxgb3 driver as well? Or at least
can you fix it up to at least fail get_dma_mr if there is too much
ram?



I would like to keep cxgb3 around.  I can add code to fail if the memory is  
32b.  Do you know how I get the amount of available
ram?


A) are you sure it's an unsigned length, i.e. is it really 31 bits?

B) why bother to check? Are machines with 4GB interesting, and worth
supporting a special optimization?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

2015-07-20 Thread Tom Talpey

On 7/20/2015 12:03 PM, Chuck Lever wrote:

All HCA providers have an ib_get_dma_mr() verb. Thus
rpcrdma_ia_open() will either grab the device's local_dma_key if one
is available, or it will call ib_get_dma_mr() which is a 100%
guaranteed fallback.


I recall that in the past, some providers did not support mapping
all of the machine's potential physical memory with a single dma_mr.
If an rnic did/does not support 44-ish bits of length per region,
for example.

Have you verified that all current providers do in fact support the
necessary physical address range for this, and that the requirement
is stated in the verbs going forward?




There is never any need to use the ib_reg_phys_mr() code path in
rpcrdma_register_internal(), so it can be removed. The remaining
logic in rpcrdma_{de}register_internal() is folded into
rpcrdma_{alloc,free}_regbuf().

This is clean up only. No behavior change is expected.

Signed-off-by: Chuck Lever chuck.le...@oracle.com
Reviewed-by: Devesh Sharma devesh.sha...@avagotech.com
Reviewed-By: Sagi Grimberg sa...@mellanox.com
Tested-by: Devesh Sharma devesh.sha...@avagotech.com
---
  net/sunrpc/xprtrdma/verbs.c |  102 ---
  net/sunrpc/xprtrdma/xprt_rdma.h |1
  2 files changed, 21 insertions(+), 82 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 1065808..da184f9 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1229,75 +1229,6 @@ rpcrdma_mapping_error(struct rpcrdma_mr_seg *seg)
(unsigned long long)seg-mr_dma, seg-mr_dmalen);
  }

-static int
-rpcrdma_register_internal(struct rpcrdma_ia *ia, void *va, int len,
-   struct ib_mr **mrp, struct ib_sge *iov)
-{
-   struct ib_phys_buf ipb;
-   struct ib_mr *mr;
-   int rc;
-
-   /*
-* All memory passed here was kmalloc'ed, therefore phys-contiguous.
-*/
-   iov-addr = ib_dma_map_single(ia-ri_device,
-   va, len, DMA_BIDIRECTIONAL);
-   if (ib_dma_mapping_error(ia-ri_device, iov-addr))
-   return -ENOMEM;
-
-   iov-length = len;
-
-   if (ia-ri_have_dma_lkey) {
-   *mrp = NULL;
-   iov-lkey = ia-ri_dma_lkey;
-   return 0;
-   } else if (ia-ri_bind_mem != NULL) {
-   *mrp = NULL;
-   iov-lkey = ia-ri_bind_mem-lkey;
-   return 0;
-   }
-
-   ipb.addr = iov-addr;
-   ipb.size = iov-length;
-   mr = ib_reg_phys_mr(ia-ri_pd, ipb, 1,
-   IB_ACCESS_LOCAL_WRITE, iov-addr);
-
-   dprintk(RPC:   %s: phys convert: 0x%llx 
-   registered 0x%llx length %d\n,
-   __func__, (unsigned long long)ipb.addr,
-   (unsigned long long)iov-addr, len);
-
-   if (IS_ERR(mr)) {
-   *mrp = NULL;
-   rc = PTR_ERR(mr);
-   dprintk(RPC:   %s: failed with %i\n, __func__, rc);
-   } else {
-   *mrp = mr;
-   iov-lkey = mr-lkey;
-   rc = 0;
-   }
-
-   return rc;
-}
-
-static int
-rpcrdma_deregister_internal(struct rpcrdma_ia *ia,
-   struct ib_mr *mr, struct ib_sge *iov)
-{
-   int rc;
-
-   ib_dma_unmap_single(ia-ri_device,
-   iov-addr, iov-length, DMA_BIDIRECTIONAL);
-
-   if (NULL == mr)
-   return 0;
-
-   rc = ib_dereg_mr(mr);
-   if (rc)
-   dprintk(RPC:   %s: ib_dereg_mr failed %i\n, __func__, rc);
-   return rc;
-}
-
  /**
   * rpcrdma_alloc_regbuf - kmalloc and register memory for SEND/RECV buffers
   * @ia: controlling rpcrdma_ia
@@ -1317,26 +1248,30 @@ struct rpcrdma_regbuf *
  rpcrdma_alloc_regbuf(struct rpcrdma_ia *ia, size_t size, gfp_t flags)
  {
struct rpcrdma_regbuf *rb;
-   int rc;
+   struct ib_sge *iov;

-   rc = -ENOMEM;
rb = kmalloc(sizeof(*rb) + size, flags);
if (rb == NULL)
goto out;

-   rb-rg_size = size;
-   rb-rg_owner = NULL;
-   rc = rpcrdma_register_internal(ia, rb-rg_base, size,
-  rb-rg_mr, rb-rg_iov);
-   if (rc)
+   iov = rb-rg_iov;
+   iov-addr = ib_dma_map_single(ia-ri_device,
+ (void *)rb-rg_base, size,
+ DMA_BIDIRECTIONAL);
+   if (ib_dma_mapping_error(ia-ri_device, iov-addr))
goto out_free;

+   iov-length = size;
+   iov-lkey = ia-ri_have_dma_lkey ?
+   ia-ri_dma_lkey : ia-ri_bind_mem-lkey;
+   rb-rg_size = size;
+   rb-rg_owner = NULL;
return rb;

  out_free:
kfree(rb);
  out:
-   return ERR_PTR(rc);
+   return ERR_PTR(-ENOMEM);
  }

  /**
@@ -1347,10 +1282,15 @@ out:
  void
  rpcrdma_free_regbuf(struct rpcrdma_ia *ia, struct rpcrdma_regbuf *rb)
  {
-  

Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

2015-07-20 Thread Tom Talpey

On 7/20/2015 3:17 PM, Jason Gunthorpe wrote:

On Mon, Jul 20, 2015 at 03:04:21PM -0700, Tom Talpey wrote:

B) why bother to check? Are machines with 4GB interesting, and worth
supporting a special optimization?


mainline drivers shouldn't silently malfunction.


I meant why bother to check, just always return failure. It's more
honest, and robust. Someone might test with 2GB, conclude it worked,
then deploy with 8GB and boom.

Also, what about hotplug RAM? Same situation.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

2015-07-20 Thread Tom Talpey

On 7/20/2015 5:34 PM, Chuck Lever wrote:


On Jul 20, 2015, at 8:11 PM, Tom Talpey t...@talpey.com wrote:


On 7/20/2015 4:36 PM, Chuck Lever wrote:


On Jul 20, 2015, at 6:41 PM, Jason Gunthorpe jguntho...@obsidianresearch.com 
wrote:


On Mon, Jul 20, 2015 at 06:31:11PM -0400, Chuck Lever wrote:


On Jul 20, 2015, at 6:26 PM, Jason Gunthorpe jguntho...@obsidianresearch.com 
wrote:


On Mon, Jul 20, 2015 at 03:03:11PM -0400, Chuck Lever wrote:

+   iov-length = size;
+   iov-lkey = ia-ri_have_dma_lkey ?
+   ia-ri_dma_lkey : ia-ri_bind_mem-lkey;
+   rb-rg_size = size;
+   rb-rg_owner = NULL;
return rb;


There is something odd looking about this..

ri_bind_mem is only setup in the RPCRDMA_ALLPHYSICAL and
RPCRDMA_MTHCAFMR cases.

RPCRDMA_FRMR doesn't set it up.

So this code in rpcrdma_alloc_regbuf is never called for the FRMR
case?

If yes, then, how is FRMR working? There is absolutely no reason to
use FRMR to register local send buffers, just use the global all
memory lkey...

If no, then that is an oops?


I’ve tested this code, no oops.

FRWR always uses the DMA lkey. xprtrdma does not use FRWR if
IB_DEVICE_LOCAL_DMA_LKEY is not asserted.


Ah, I see. Ok.

Is there a reason to link FRWR and LOCAL_DMA_LKEY together?


Tom would know.


Tom Talpey or Tom Tucker? I don't think I used a dma_lkey at all
in the original work, but there were some changes later.


The commit which adds FRWR support to xprtrdma introduced a check in
rpcrdma_ia_open:

+   case RPCRDMA_FRMR:
+   /* Requires both frmr reg and local dma lkey */
+   if ((devattr.device_cap_flags 
+(IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) !=
+   (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) {

This seems to be the only place the local DMA lkey requirement is
expressed or enforced. But yeah, the local DMA lkey doesn’t seem
to be used anywhere in the FRWR-specific parts of the code.


I now vaguely remember someone telling me that both attributes were
required, but a) I may have misunderstood and b) I'm sure something
has changed. FRMR was brand new at the time.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

2015-07-20 Thread Tom Talpey

On 7/20/2015 3:21 PM, Chuck Lever wrote:


On Jul 20, 2015, at 5:55 PM, Tom Talpey t...@talpey.com wrote:


On 7/20/2015 1:55 PM, Chuck Lever wrote:

On Jul 20, 2015, at 4:34 PM, Tom Talpey t...@talpey.com wrote:


On 7/20/2015 12:03 PM, Chuck Lever wrote:

All HCA providers have an ib_get_dma_mr() verb. Thus
rpcrdma_ia_open() will either grab the device's local_dma_key if one
is available, or it will call ib_get_dma_mr() which is a 100%
guaranteed fallback.


I recall that in the past, some providers did not support mapping
all of the machine's potential physical memory with a single dma_mr.
If an rnic did/does not support 44-ish bits of length per region,
for example.


The buffers affected by this change are small, so I’m confident that
restriction would not be a problem here.


It's not about the buffer size, it's about the region. Because the
get_dma_mr does not specify a base address and length, the rnic must
basically attempt to map a base of zero and a length of the largest
physical offset.


Understood now, but:



This is not the case with the previous phys_reg_mr, which specified
the exact phys page range.


rpcrdma_ia_open() fails immediately if IB_DEVICE_LOCAL_DMA_LKEY
is not asserted _and_ ib_get_dma_mr() fails. We never get to the
logic in rpcrdma_regbuf_alloc() in that case, so ib_reg_phys_mr()
would still never be invoked. That really is dead code.

If you prefer, I can adjust the patch description to remove the
“100% guaranteed fallback” line.



Ok, good and yes 100% sounds like a risky statement.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

2015-07-20 Thread Tom Talpey

On 7/20/2015 4:36 PM, Chuck Lever wrote:


On Jul 20, 2015, at 6:41 PM, Jason Gunthorpe jguntho...@obsidianresearch.com 
wrote:


On Mon, Jul 20, 2015 at 06:31:11PM -0400, Chuck Lever wrote:


On Jul 20, 2015, at 6:26 PM, Jason Gunthorpe jguntho...@obsidianresearch.com 
wrote:


On Mon, Jul 20, 2015 at 03:03:11PM -0400, Chuck Lever wrote:

+   iov-length = size;
+   iov-lkey = ia-ri_have_dma_lkey ?
+   ia-ri_dma_lkey : ia-ri_bind_mem-lkey;
+   rb-rg_size = size;
+   rb-rg_owner = NULL;
return rb;


There is something odd looking about this..

ri_bind_mem is only setup in the RPCRDMA_ALLPHYSICAL and
RPCRDMA_MTHCAFMR cases.

RPCRDMA_FRMR doesn't set it up.

So this code in rpcrdma_alloc_regbuf is never called for the FRMR
case?

If yes, then, how is FRMR working? There is absolutely no reason to
use FRMR to register local send buffers, just use the global all
memory lkey...

If no, then that is an oops?


I’ve tested this code, no oops.

FRWR always uses the DMA lkey. xprtrdma does not use FRWR if
IB_DEVICE_LOCAL_DMA_LKEY is not asserted.


Ah, I see. Ok.

Is there a reason to link FRWR and LOCAL_DMA_LKEY together?


Tom would know.


Tom Talpey or Tom Tucker? I don't think I used a dma_lkey at all
in the original work, but there were some changes later.

I thought not all rnic's supported a dma_lkey. So requiring it
seems like a bad idea, to me.


Just use
the code you have for fmr with frwr to get the lkey, probably move it
to rpcrdma_ia_open .. Physical MR should create a 2nd MR dedicated for
rkey use.


But FMR is not supported by modern mlx5 and cxgb4?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

2015-07-20 Thread Tom Talpey

On 7/20/2015 3:41 PM, Steve Wise wrote:




-Original Message-
From: Tom Talpey [mailto:t...@talpey.com]
Sent: Monday, July 20, 2015 5:04 PM
To: Steve Wise; 'Jason Gunthorpe'
Cc: 'Chuck Lever'; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

On 7/20/2015 2:16 PM, Steve Wise wrote:




-Original Message-
From: linux-nfs-ow...@vger.kernel.org [mailto:linux-nfs-ow...@vger.kernel.org] 
On Behalf Of Jason Gunthorpe
Sent: Monday, July 20, 2015 4:06 PM
To: Tom Talpey; Steve Wise
Cc: Chuck Lever; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

On Mon, Jul 20, 2015 at 01:34:16PM -0700, Tom Talpey wrote:

On 7/20/2015 12:03 PM, Chuck Lever wrote:

All HCA providers have an ib_get_dma_mr() verb. Thus
rpcrdma_ia_open() will either grab the device's local_dma_key if one
is available, or it will call ib_get_dma_mr() which is a 100%
guaranteed fallback.


I recall that in the past, some providers did not support mapping
all of the machine's potential physical memory with a single dma_mr.
If an rnic did/does not support 44-ish bits of length per region,
for example.


Looks like you are right, but the standard in kernel is to require
ib_get_dma_mr, if the HCA can't do that, then it cannot be used on a
big memory machine with kernel ULPs.

Looking deeper, both amso1100 and cxgb3 seem limited to 32 bits of
physical memory, and silently break all kernel ULPs if they are used
on a modern machine with  4G.

Is that right Steve?



Yes.


Based on that, should we remove the cxgb3 driver as well? Or at least
can you fix it up to at least fail get_dma_mr if there is too much
ram?



I would like to keep cxgb3 around.  I can add code to fail if the memory is  
32b.  Do you know how I get the amount of

available

ram?


A) are you sure it's an unsigned length, i.e. is it really 31 bits?



yes.


B) why bother to check? Are machines with 4GB interesting, and worth
supporting a special optimization?


No, but cxgb3 is still interesting to user applications, and perhaps NFSRDMA 
using FRMRs.


I'm obviously not making myself clear. I am suggesting that cxgb3 fail
the ib_get_dma_mr() verb, regardless of installed memory.

I am not suggesting it fail to load, or fail other memreg requests. It
should work normally in all other respects.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-14 Thread Tom Talpey

On 7/14/2015 5:22 AM, Sagi Grimberg wrote:

On 7/14/2015 10:37 AM, 'Christoph Hellwig' wrote:

On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote:

On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote:

I think what we need to support for now is FRMR as the primary target,
and FMR as a secondar[y].


FMR is a *very* bad choice, for several reasons.

1) It's not supported by very many devices, in fact it might even
be thought to be obsolete.


It's support by the Mellanox adapters,


*Older* Mellanox adapters. mlx5 (and future drivers I assume) will
no longer support FMRs.


Right, but drop the word will. Mlx5 (the current ConnectX-4) doesn't
support FMR. It's only supported by mlx4 and mthca drivers.


Based on looking at the consumers and the above table I think FMR
are still the better fallback for devices that don't support FR,


It's better if you want it fast.


Do you guys think FMR is actually fast? Because it's not. Measure it.
It might have been marginally faster than other schemes in like, 2007,
and only when the mempool was there to leave the registrations open.
Don't go back there.

Tom.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-14 Thread Tom Talpey

On 7/14/2015 11:36 AM, 'Christoph Hellwig' wrote:

On Tue, Jul 14, 2015 at 12:10:36PM +0300, Sagi Grimberg wrote:

Having an API that does FRMR/FMR/PHYS_MR is even worse from the ULP
PoV. If you expose an API that might schedule (PHYS_MR) it limits the
context that the caller is allowed to call in.

I'm 100% against an registration API that *might* schedule.


Oh, I had missed that PHYS_MR might sleep.  That might be the reasons
why everyone is avoiding them despite Tom preferring them over FMR.


Any synchronous verb might sleep. They issue commands to the adapter
across the PCIe bus, and need to wait for the result. FMR does, too.
At least the work-request-based ones are explicit about waiting for
the completion.

Are you saying you want an API that does memory registration without
blocking AND without sending a work request to hardware, ever?



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-14 Thread Tom Talpey

On 7/14/2015 3:32 PM, Steve Wise wrote:


On Tue, Jul 14, 2015 at 02:25:50PM -0500, Steve Wise wrote:


The benefit is that we don't have to check for iWARP protocol in the
ULP.


IB should have to pay the cost of FRMR for lkey on RDMA READ, I'm
pretty sure you have to check for iWarp at somepoint..



You mean should not, yea?

Ok.  I'll check for iWARP.  But don't tell me to remove the transport-specific 
hacks in this series when I post it! ;)



FYI, in the Windows NDKPI (verbs-like kernel provider interface), there
is a device attribute defined as follows:

https://msdn.microsoft.com/en-us/library/windows/hardware/hh439851(v=vs.85).aspx

NDK_ADAPTER_FLAG_RDMA_READ_SINK_NOT_REQUIRED 0x0002

Set if the provider does not require special access rights on the sink
buffer for an RDMA read request. When this flag is set, the consumer is
not required to use the NDK_MR_FLAG_RDMA_READ_SINK or
NDK_OP_FLAG_RDMA_READ_SINK flags when it registers sink buffers for
RDMA read requests. The consumer can also use logical address mappings
directly (with a token obtained with the
NDK_FN_GET_PRIVILEGED_MEMORY_REGION_TOKEN function) as RDMA read sink
buffers. This is similar to access to local buffers for RDMA write,
send, and receive operations.



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/14] xprtrdma: Remove logic that constructs RDMA_MSGP type calls

2015-07-14 Thread Tom Talpey

On 7/13/2015 12:30 PM, Chuck Lever wrote:

RDMA_MSGP type calls insert a zero pad in the middle of the RPC
message to align the RPC request's data payload to the server's
alignment preferences. A server can then page flip the payload
into place to avoid a data copy in certain circumstances. However:
...

Clean up the marshaling code by removing the logic that constructs
RDMA_MSGP type calls. This also reduces the maximum send iovec size
from four to just two elements.




diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 8219011..0b50103 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h

...

+#define RPCRDMA_MAX_IOVS   (4)
+


So, shouldn't this constant be 2? The extra 2 iov's were used
only for constructing the pad.



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel fast memory registration API proposal [RFC]

2015-07-14 Thread Tom Talpey

On 7/14/2015 12:16 PM, Steve Wise wrote:




-Original Message-
From: Sagi Grimberg [mailto:sa...@dev.mellanox.co.il]
Sent: Tuesday, July 14, 2015 11:12 AM
To: Christoph Hellwig
Cc: Jason Gunthorpe; linux-rdma@vger.kernel.org; Steve Wise; Or Gerlitz; Oren 
Duer; Chuck Lever; Bart Van Assche; Liran Liss;

Hefty,

Sean; Doug Ledford; Tom Talpey
Subject: Re: Kernel fast memory registration API proposal [RFC]

On 7/14/2015 6:33 PM, Christoph Hellwig wrote:

On Tue, Jul 14, 2015 at 11:39:24AM +0300, Sagi Grimberg wrote:

This is exactly what I don't want to do. I don't think that implicit
posting is a good idea for reasons that I mentioned earlier:

This is where I have a problem. Providing an API that may or may not
post a work request on my QP is confusing, and I don't understand its
semantics at all. Do I need to reserve slots on my QP? should I ask for
a completion? If we suppress the completion will I see an error
completion? What should I expect to find in the wr_id?

We're much better off with keeping the post interface in place but
have it much simpler.


The ULP doesn't care if it needs to reserver the slot, and it generally
doesn't care about the notification either unless it needs to handle an
error.


That's generally correct. But the ULP may want to suppress notifications
of other posts as well (for example a storage initiator may want to
suppress send completions since it needs to reserve the request context
until it receives the status anyway - which is a receive completion). It
needs to take what was posted and what not into account to avoid
wrapping.

If I'm not mistaken iWARP requires to ask for a completion once every
send-queue length and empirically, IB drivers requires it too. So again,


Correct.


Indeed, correct. But this has nothing to do with the protocol. It's
necessary because without a completion, the driver can't keep the
work queue and completion queue pointers properly in sync with the
hardware. If the tail pointer catches up and crosses the head pointer,
boom.

This is a pure verbs issue, completely local behavior, and applies
to all providers.


I don't think implicit posting is a very good idea.


Specifically, it will work only if completions are never suppressed,
if the upper layer is aware of the additional work requests and their
ordering implications, and the overhead of the interrupts is acceptable.
None of these preconditions are desirable.




The completions are another mad mightmare in the RDMA stack API.  Every
other subsystem would just allow submitter to attach a function pointer
that handles the completion to the posted item.


This is actually a good idea. And it can be easily added I think (at
least to mlx drivers which I better). It can help removing a switch
statement for the ULPs when handling the completions.

But no, the RDMA stack

instead require ID allocators and gigantic boilerplate code in the
consumer to untangle that gigantic mess again.


I don't think the wr_id was ever intended to be an allocated tag. It's
the ULP context, usually a pointer to a transaction structure. Even
with passing a function pointer you need to get back your original
context don't you?






--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-13 Thread Tom Talpey

On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote:

I think what we need to support for now is FRMR as the primary target,
and FMR as a secondar[y].


FMR is a *very* bad choice, for several reasons.

1) It's not supported by very many devices, in fact it might even
be thought to be obsolete.

2) It does not protect on byte boundaries, therefore it is unsafe
to use for anything but page-sized, page-aligned RDMA operations.

3) It is a synchronous API, i.e. it is not work-request based and
therefore is not very high performance.

4) It sometimes is used with a pool, which defers deregistration
in the hopes of amortizing overhead. However, this deferral further
increases the risk of remote access, including altering of memory
contents after the fact.

Personally, I'd recommend ib_get_phys_mr() over FMR. It at least
doesn't suffer from issues 1, 2 and 4.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-13 Thread Tom Talpey

On 7/13/2015 1:18 PM, Jason Gunthorpe wrote:

On Fri, Jul 10, 2015 at 11:10:23PM -0400, Doug Ledford wrote:

Black hat server is beyond the scope of this discussion.


We cannot assume an all-trusted model here, there are many
configurations to deploy NFS/iSCSI that don't assume that. Even if you
assume it for the RDMA cases (which I stronlgy disagree with), it
still must be proven to not weaken the existing TCP/IP case.

So, a black hat server is on the table, attacking a client that the
admin is not intending to use with RDMA, by forcing it to switch to
RDMA before auth and exploiting the RDMA side.

This is where the iwarp guys have to analyze and come back to say it
is OK. Maybe iwarp can't get to rdma without auth or something...


Two observations.

One, auth is an Upper Layer matter. It's not the job of the transport
to authenticate the peer, the user, etc. Upper layers do this, and
iSCSI performs a login, NFSv4.1+ creates a session, SMB3 creates
sessions on multiple connections, etc. All this happens after the
connection is established.

Two, the iSCSI and NFSv4.1 protocols have explicit support for iWARP
step-up mode, which supports an initial connection in TCP (i.e.
with RDMA disabled), and switching to RDMA mode dynamically.

The IB and RoCE protocols do not support step-up mode, so in fact
one could argue that iWARP is *better* positioned to support this.

Tom.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 09/12] xprtrdma: Prepare rpcrdma_ep_post() for RDMA_NOMSG calls

2015-07-10 Thread Tom Talpey

On 7/10/2015 7:29 AM, Devesh Sharma wrote:

we need to honor the max limits of device by checking
dev_attr.max_sge? a vendor may not support 4 sges.


iWARP requires a minimum of 4 send SGEs (draft-hilland-verbs 8.1.3.2)

   An RI MUST support at least four Scatter/Gather Elements per
   Scatter/Gather List when the Scatter/Gather List refers to the Data
   Source of a Send Operation Type or the Data Sink of a Receive
   Operation. An RI is NOT REQUIRED to support more than one
   Scatter/Gather Element per Scatter/Gather List when the
   Scatter/Gather List refers to the Data Source of an RDMA Write.

I'm not certain if IB and RoCE state a similar minimum requirement,
but it seems a very bad idea to have fewer.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-10 Thread Tom Talpey

On 7/9/2015 6:53 PM, Jason Gunthorpe wrote:

On Thu, Jul 09, 2015 at 06:18:26PM -0400, Doug Ledford wrote:


A lot of this discussion is interesting and has gone off in an area that
I think is worth pursuing in the long run.  However, in the short run,
this patch was a minor cleanup/simplification patch.  These discussions
are moving into complete redesigns and rearchitecting.  Steve, I'm OK
with the cleanup and would prefer that it stay separate from these
larger issues.


So, I was originally of the view the flags change was fine.

But when I realized that it basically hides a
ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) behind an opaque API:

  rdma_get_dma_mr(RDMA_MRR_READ_DEST)

I changed my mind.


Yes, I agree. Creating a writable privileged rmr for the RDMA Read sink
buffer exposes all of physical memory to a simple arithmetic mistake.
This should never be allowed in a production config.

That said, for the use by the NFS/RDMA Server, the risk is significantly
mitigated. The sink MR is never sent to the peer by the NFS upper layer,
it is protected by the connection's PD, and it is enabled only when the
RDMA Read is active. However, the very real risk remains. This should
definitely not be the standard API.


For iWarp the above creates a rkey that can RDMA WRITE to any place in
system memory. If a remote guesses that rkey your system is totally
compromised. So it is insecure, and contrary to the advice in RFC5042.

Seeing rdma_get_dma_mr(RDMA_MRR_READ_DEST) added all over the code
base terrifies me, because it is utterly wrong, and looks harmless.

The mistep, is that enabling iSER for iWarp is not just this trivial
change:

-   device-mr = ib_get_dma_mr(device-pd, IB_ACCESS_LOCAL_WRITE);
+   mr_roles = RDMA_MRR_RECV | RDMA_MRR_SEND | RDMA_MRR_WRITE_SOURCE |
+  RDMA_MRR_READ_DEST;
+   device-mr = rdma_get_dma_mr(device-pd, mr_roles, 0);



Yep. It's the only safe, correct solution.


 But, it must also follow the path of NFS and use FRMR on iWarp for
 RDMA READ lkeys. This is the real quirk of iWarp, not that the MR
 flags are slightly different.


 From there, it is a logical wish: If Steve is going to FRMR'ize iSER
to be acceptable security wise, I'd rather see that work done on in a
general way. Hence the API discussion.


Well, it's important to realize that NFS already does the Right Thing.
So it isn't actually an urgent issue. There is time to discuss.

Tom.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-10 Thread Tom Talpey

On 7/10/2015 1:56 PM, Doug Ledford wrote:

On 07/10/2015 12:11 PM, Jason Gunthorpe wrote:

On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote:

and it is enabled only when the RDMA Read is active.


??? How is that done? ib_get_dma_mr is defined to return a remote
usable rkey that is valid from when it it returns until the MR is
destroyed. NFS creates one mr early on, it does not seem to make one
per RDMA READ?


Actually you're right - the NFS server never tears down the MR. So, the
old code is just as vulnerable as the new code.


For the proposed iSER patch the problem is very acute, iser makes a
single PD and phys MR at boot time for each device. This means there
is a single machine wide unchanging rkey that allows remote physical
memory access. An attacker only has to repeatedly open QPs to iSER and
guess rkey values until they find it. Add in likely non-crypto
randomness for rkeys, and I bet it isn't even that hard to do.


The rkeys have a PD, wich cannot be forged, so it's not a matter of
attacking, but it is most definitely a system integrity risk, as I
mentioned earlier, a simple arithmetic offset mistake can overwrite
anything.


So this seems to be a catastrophic security failing.


s/security/integrity/ and I agree.


As I recall, didn't the NFSoRDMA stack do even worse in the beginning
(meaning it previously had a memory model that was simply register all
memory on the client and pass the rkey to the server and then tell the
server when/where to read/write, which was subsequently removed in
Chuck's NFSoRDMA cleanups over the last 5 or so kernel versions)?


Yes, Doug, shamefully it did, but

a) That was the client and Jason and I are talking about the server
b) When that code was written (2007), FRMR did not exist, and other
   memory registration methods had abysmal performance
c) It was heavily deprecated with console warnings emitted
d) Chuck did away with it since then.


I'm not bringing this up to downplay the current iSER issue, but to
demonstrate that we have long had a different trust model than the one
in RFC5042.  More below on that.


I'll say something after requoting you.


 From there, it is a logical wish: If Steve is going to FRMR'ize iSER
to be acceptable security wise, I'd rather see that work done on in a
general way. Hence the API discussion.


Well, it's important to realize that NFS already does the Right Thing.
So it isn't actually an urgent issue. There is time to discuss.


It depends on what is going on with iWarp iSER. If the iWarp community
thinks it should go ahead insecurely then fine, put a giant warning in
dmesg and go ahead, otherwise iWarp needs to address this difference
with a proper generic API, which appears, must wrapper
ib_post_send. Harder job :\

I'm sorry Steve for leading you down the wrong path with these flags,
I did not fully realize exactly what the iWarp difference was at the
start :(


Are there security issues?  Yes.  Are we going to solve them in this
patch set?  No.  Especially since those security issues extend beyond
iSER + iWARP.

There are currently 4 RDMA capable link types/protocols.  Of those 4, 3
enforce locality (IB, OPA, RoCE all require lossless fabrics and cannot
be sent out into the wild internet).  RFC5042 is written with iWARP in
mind because it *can* be sent out over the Internet and therefore it is
possible for someone to place an RNIC card directly on the Internet and
it must therefore deal with all of the attack vectors this entails.

The linux RDMA stack, as a whole, is no where near RFC5042 compliant.
And it isn't just the kernel space at fault here.  Early versions of
opensm used to require that the master and slave machines must have
passwordless root ssh access to each other to copy around the guid2lid
file so when the master failed over, the slave would have an up to date
list.

Because 3 of the 4 RDMA technology types enforce localized clusters, the
majority of the stack and the implementations have utilized the cluster
as the point of trust.  In truth, I think it's simply been the easy way
to do things.  RFC5042 is very clearly a non-trust model where each
machine, and each connection, is assumed to be a threat.

So, long story short(ish):

While this discussion is taking place in the cleanup thread, part of it
involves the preceding patch set: iSER support for iWARP.  Patch 4/5 in
the iSER/iWARP series enables the remote write on device-mr
registration which constitutes the security issue we've been discussing.
  This cleanup patch set, then, is merely guilty of hiding that security
issue, not introducing it.

I think Steve and the other iSER folks should consider the wisdom of
including the iWARP support as it stands.  If they think that the common
use case is in a sequestered cluster, and that the current trust domain
is already not RFC5042 compliant, then I'm OK with enabling iSER support
for iWARP, but I would want there to be a very clear warning in the
dmesg about the trust domain issue.  I might

Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-09 Thread Tom Talpey

On 7/9/2015 1:01 PM, Jason Gunthorpe wrote:

Laid out like this, I think it even means we can nuke the IB DMA API
for these cases. rdma_post_read and rdma_post_complete_read are the
two points that need dma api calls (cache flushes), and they can just
do them internally.

This also tells me that the above call sites must already exist in
every ULP, so we, again, are not really substantially changing
core control flow for the ULP.

Are there more details that wreck the world?


Two things come to mind - PD's, and virtualization.

If there's no ib_get_dma_mr() call, what PD does the region get?
One could argue it inherits the QP's (Emulex proposed such a
PD-less MR in this year's OFS Developer's Workshop). But this
could impose new conditions on ULP's; they would have to be
aware of this affinity and it could affect their QP use.

More importantly, if a guest can post FRWR work requests with
physical addresses, what enforces their validity? The dma_mr
provides a PD but it also allows the host partition to interpose
on the call, setting up an IOMMU mapping, creating a new NIC TPT
mapping, etc. Without this, it may be possible for hostile guest
to forge FRMR's and attack the host, or other guests.

 I didn't explore how errors work, but, I think, errors are just a
 labeling exercise:
   if (wc is error  wc.wrid == read_wrid
  rdma_error_complete_read(...,read_wrid,wc)

 Error recovery blows up the QP, so we just need to book keep and get
 the MRs accounted for. The driver could do a synchronous clean up of
 whatever mess is left during the next create_qp, or on the PD destroy.

This is a subtle area. If the driver posts silenced completions as
you describe, there may not be a wc to reap. So either the driver or
the ULP will need to post a sentinel, the completion of which indicates
any prior silenced operations have actually done so. This can be
hard to get right. And if the driver posts everything signaled, well,
performance at high IOPS will be a challenge. The ULP is much better
positioned to manage that.

I'm with you on the flow control, btw. It's a new rule for the
ULP to obey, but probably not too onerous. Remember though, verbs
today return EAGAIN when the queue you're posting is full (a
terrible choice IMO). So upper layers don't actually need to
count WR's, unless they want to.

Tom.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-08 Thread Tom Talpey

On 7/8/2015 3:08 PM, Jason Gunthorpe wrote:

The MR stuff was never really designed, the HW people provided some
capability and the SW side just raw exposed it, thoughtlessly.


Jason, I don't disagree that the API can be improved. I have some
responses to your statements below though.


Why is code using iWarp calling ib_get_dma_mr with
RDMA_MRR_READ_DEST/IB_ACCESS_REMOTE_WRITE ? That is insecure.


Because the iWARP protocol requires it, which was very much an
intentional decision. It actually is not insecure, as discussed
in detail in RFC5042. However, it is different from Infiniband.


Why on earth is NFS using frmr to create RDMA READ lkeys?


Because NFS desires to have a single code path that works for all
transports. In addition, using the IB dma_mr as an lkey means that
large I/O (common with NFS) would require multiple RDMA Read
operations, when the page list exceeded the local scatter/gather
limit of the device.


I think when you do that, it quickly becomes clear that iWarp's
problem is not a seemingly minor issue with different flag bits, but
that iWarp *cannot* use local_dma_lkey as a RDMA READ local buffer.
Using ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) is an insecure work
around. So iWarp (and only iWarp) should take the pain of spinning up
temporary local MRs for RDMA READ.


That is entirely the wrong layer to address this. It would prevent
reuse of MRs, and require upper layers to be aware that this was
the case - which is exactly the opposite of what you are trying
to achieve.


This should all be hidden under a common API and it shouldn't be
sprinkled all over the place in NFS, iSER, etc.


Are you arguing that all upper layer storage protocols take a single
common approach to memory registration? That is a very different
discussion.

Tom.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 01/26] IB/Verbs: Implement new callback query_transport()

2015-04-27 Thread Tom Talpey

On 4/27/2015 2:52 PM, ira.weiny wrote:

On Mon, Apr 27, 2015 at 09:39:05AM +0200, Michael Wang wrote:



On 04/24/2015 05:12 PM, Liran Liss wrote:

From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-


[snip]

a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index
65994a1..d54f91e 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -75,10 +75,13 @@ enum rdma_node_type {  };

  enum rdma_transport_type {
+   /* legacy for users */
RDMA_TRANSPORT_IB,
RDMA_TRANSPORT_IWARP,
RDMA_TRANSPORT_USNIC,
-   RDMA_TRANSPORT_USNIC_UDP
+   RDMA_TRANSPORT_USNIC_UDP,
+   /* new transport */
+   RDMA_TRANSPORT_IBOE,


Remove RDMA_TRANSPORT_IBOE - it is not a transport.
ROCE uses IBTA transport.

If any code should test for ROCE should invoke a specific helper, e.g., 
rdma_protocol_iboe().
This is  what you currently call rdma_tech_iboe is patch 02/26.

I think that pretty much everybody agrees that rdma_protocol_*() is a better 
name than rdma_tech_*(), right?
So, let's change this.


Sure, sounds reasonable now, about the IBOE, we still need it to
separate the port support IB/ETH without the check on link-layer,
So what about a new enum on protocol type?

Like:

enum rdma_protocol {
RDMA_PROTOCOL_IB,
RDMA_PROTOCOL_IBOE,
RDMA_PROTOCOL_IWARP,
RDMA_PROTOCOL_USNIC_UDP
};

So we could use query_protocol() to ask device provide the protocol
type, and there will be no mixing with the legacy transport type
anymore :-)


I'm ok with that.  I like introducing a unique namespace which is clearly
different from the previous transport one.


I agree the word transport takes things into the weeds.

But on the topic of naming protocols, I've been wondering, is there
some reason that IBOE is being used instead of RoCE? The IBOE
protocol used to exist and is not the same as the currently
standardized RoCE, right?

Also wondering, why add UDP to USNIC, is there a different USNIC?

Naming multiple layers together seems confusing and maybe in the end
will create more code to deal with the differences. For example, what
token will RoCEv2 take? RoCE_UDP, RoCE_v2 or ... ?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 01/26] IB/Verbs: Implement new callback query_transport()

2015-04-27 Thread Tom Talpey

On 4/27/2015 5:36 PM, Doug Ledford wrote:

On Mon, 2015-04-27 at 17:16 -0700, Tom Talpey wrote:

On 4/27/2015 2:52 PM, ira.weiny wrote:

On Mon, Apr 27, 2015 at 09:39:05AM +0200, Michael Wang wrote:

On 04/24/2015 05:12 PM, Liran Liss wrote:

[snip]


Like:

enum rdma_protocol {
RDMA_PROTOCOL_IB,
RDMA_PROTOCOL_IBOE,
RDMA_PROTOCOL_IWARP,
RDMA_PROTOCOL_USNIC_UDP
};

So we could use query_protocol() to ask device provide the protocol
type, and there will be no mixing with the legacy transport type
anymore :-)


I'm ok with that.  I like introducing a unique namespace which is clearly
different from the previous transport one.


I agree the word transport takes things into the weeds.

But on the topic of naming protocols, I've been wondering, is there
some reason that IBOE is being used instead of RoCE?


Because back in the day, when RoCE was accepted into the kernel, I'm
pretty sure it was prior to the IBTA's final stamp of approval and
before the name was set on RoCE, so IBoE was chosen upstream as the more
correct name because it properly denoted what it was deemed to truly
be: IB Verbs over Ethernet.


Well history is all well and good, but it seems weird to not use the
current, standard name in new code. It confuses me, anyway, because
it seems like IBOE could easily mean something else.


Also wondering, why add UDP to USNIC, is there a different USNIC?


Yes, there are two transports, one a distinct ethertype and one that
encapsulates USNIC in UDP.


But this new enum isn't about transport, it's about protocol. So is
there one USNIC protocol, with a raw layering and a separate one with
UDP? Or is it one USNIC protocol with two different framings? Seems
there should be at least the USNIC protocol, without the _UDP
decoration, and I don't see it in the enum.




Naming multiple layers together seems confusing and maybe in the end
will create more code to deal with the differences. For example, what
token will RoCEv2 take? RoCE_UDP, RoCE_v2 or ... ?


Uncertain as of now.


Ok, but it's imminent, right? What's the preference/guidance?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 01/26] IB/Verbs: Implement new callback query_transport()

2015-04-27 Thread Tom Talpey

On 4/27/2015 6:24 PM, Doug Ledford wrote:

On Mon, 2015-04-27 at 17:53 -0700, Tom Talpey wrote:

On 4/27/2015 5:36 PM, Doug Ledford wrote:

On Mon, 2015-04-27 at 17:16 -0700, Tom Talpey wrote:

On 4/27/2015 2:52 PM, ira.weiny wrote:

On Mon, Apr 27, 2015 at 09:39:05AM +0200, Michael Wang wrote:

On 04/24/2015 05:12 PM, Liran Liss wrote:

[snip]


Like:

enum rdma_protocol {
RDMA_PROTOCOL_IB,
RDMA_PROTOCOL_IBOE,
RDMA_PROTOCOL_IWARP,
RDMA_PROTOCOL_USNIC_UDP
};

So we could use query_protocol() to ask device provide the protocol
type, and there will be no mixing with the legacy transport type
anymore :-)


I'm ok with that.  I like introducing a unique namespace which is clearly
different from the previous transport one.


I agree the word transport takes things into the weeds.

But on the topic of naming protocols, I've been wondering, is there
some reason that IBOE is being used instead of RoCE?


Because back in the day, when RoCE was accepted into the kernel, I'm
pretty sure it was prior to the IBTA's final stamp of approval and
before the name was set on RoCE, so IBoE was chosen upstream as the more
correct name because it properly denoted what it was deemed to truly
be: IB Verbs over Ethernet.


Well history is all well and good, but it seems weird to not use the
current, standard name in new code. It confuses me, anyway, because
it seems like IBOE could easily mean something else.


Having some of it refer to things as IBOE and some as ROCE would be
similarly confusing, and switching existing IBOE usage to ROCE would
cause pain to people with out of tree drivers (Lustre is the main one I
know of).  There's not a good answer here.  There's only less sucky
ones.


Hrm. Well, avoiding churn is good but legacies can wear ya down.
MHO it is worth doing since these are new enums/new patches.





Also wondering, why add UDP to USNIC, is there a different USNIC?


Yes, there are two transports, one a distinct ethertype and one that
encapsulates USNIC in UDP.


But this new enum isn't about transport, it's about protocol. So is
there one USNIC protocol, with a raw layering and a separate one with
UDP? Or is it one USNIC protocol with two different framings? Seems
there should be at least the USNIC protocol, without the _UDP
decoration, and I don't see it in the enum.


Keep in mind that this enum was Liran's response to Michael's original
patch.  In the enum in Michael's patch, there was both USNIC and
USNIC_UDP.


Right! That's why I'm confused. Seems wrong to drop it, right?






Naming multiple layers together seems confusing and maybe in the end
will create more code to deal with the differences. For example, what
token will RoCEv2 take? RoCE_UDP, RoCE_v2 or ... ?


Uncertain as of now.


Ok, but it's imminent, right? What's the preference/guidance?


There is a patchset from Devesh Sharma at Emulex.  It added the RoCEv2
capability.  As I recall, it used a new flag added to the existing port
capabilities bitmask and notably did not modify either the node type or
link layer that are currently used to differentiate between the
different protocols.  That's from memory though, so I could be mistaken.

But that patchset was not written with this patchset in mind, and
merging the two may well change that.  In any case, there is a proposed
spec to follow, so for now that's the preference/guidance (unless this
rework means that we need to depart from the spec on internals for
implementation reasons).


Well, if RoCEv2 uses the same protocol enum, that may introduce new
confusion, for example there will be some new CM handling for UDP encap,
source port selection, and of course vlan/tag assignment, etc. But if
there is support under way, and everyone is clear, then, ok.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 01/26] IB/Verbs: Implement new callback query_transport()

2015-04-24 Thread Tom Talpey

On 4/24/2015 8:23 AM, Michael Wang wrote:

Add new callback query_transport() and implement for each HW.

Mapping List:
node-type   link-layer  old-transport   new-transport
...
mlx4IB_CA   IB/ETH  IB  IB/IBOE
mlx5IB_CA   IB  IB  IB
...
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 57c9809..b6f2f58 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -262,6 +262,12 @@ out:
return err;
  }

+static enum rdma_transport_type
+mlx5_ib_query_transport(struct ib_device *device, u8 port_num)
+{
+   return RDMA_TRANSPORT_IB;
+}
+



Just noticed that mlx5 is not being coded as RoCE-capable like mlx4.
The mlx5 driver is for the new ConnectX-4, which implements all three
of IB, RoCE and RoCEv2, right? Are those last two not supported?


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6 01/26] IB/Verbs: Implement new callback query_transport()

2015-04-24 Thread Tom Talpey

On 4/24/2015 10:35 AM, Michael Wang wrote:



On 04/24/2015 04:29 PM, Tom Talpey wrote:

On 4/24/2015 8:23 AM, Michael Wang wrote:

[snip]

+static enum rdma_transport_type
+mlx5_ib_query_transport(struct ib_device *device, u8 port_num)
+{
+return RDMA_TRANSPORT_IB;
+}
+



Just noticed that mlx5 is not being coded as RoCE-capable like mlx4.
The mlx5 driver is for the new ConnectX-4, which implements all three
of IB, RoCE and RoCEv2, right? Are those last two not supported?


I'm not sure about the details of mlx5, but according to the current
implementation, it's transport is IB without a link-layer callback,
which means it doesn't support IBoE...

And there is no method to change the port link-layer type as mlx4 did.


Hal, is that correct?

From the Mellanox web:

http://www.mellanox.com/related-docs/products/IB_Adapter_card_brochure_c_2_3.pdf

ConnectX-4
ConnectX-4 adapter cards with Virtual Protocol Interconnect (VPI),
supporting EDR 100Gb/s InfiniBand and 100Gb/s Ethernet connectivity,
provide the...

Virtual Protocol Interconnect
VPI® flexibility enables any standard networking, clustering, storage,
and management protocol to seamlessly operate over any converged network
leveraging a consolidated software stack. Each port can operate on
InfiniBand, Ethernet, or Data Center Bridging (DCB) fabrics, and
supports IP over InfiniBand (IPoIB), Ethernet over InfiniBand (EoIB)
and RDMA over Converged Ethernet (RoCE and RoCEv2).
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 10/27] IB/Verbs: Reform cm related part in IB-core cma/ucm

2015-04-16 Thread Tom Talpey

On 4/16/2015 11:22 AM, Michael Wang wrote:



On 04/16/2015 04:31 PM, Hefty, Sean wrote:

This is equivalent to today where the checks are per node rather than
per port.

Should all checks here be port 1 based or only certain ones like listen
? For example, in connect/reject/disconnect, don't we already have port
? Guess this can be dealt with later as this is not a regression from
the current implementation.


Yeah, these parts of cma may need more carve in future, like some new
callback
for different CM type as Sean suggested.

Maybe directly using 1 could help to highlight the problem ;-)


Only a few checks need to be per device.  I think I pointed those out 
previously.  Testing should show anywhere that we miss fairly quickly, since 
port would still be 0.  For the checks that can be updated to be per port, I 
would rather go ahead and convert them.


Got it, will be changed in next version :-)

To be confirmed:
PORT ASSIGNED
rdma_init_qp_attr   Y
rdma_destroy_id unknown
cma_listen_on_dev   N
cma_bind_loopback   N
rdma_listen N


Why N? rdma_listen() can be constrained to a single port, right?
And even if wildcarded, it needs to act on multiple ports, which is
to say, it will fail only if no ports are eligible.

Tom.



rdma_connectY
rdma_accept Y
rdma_reject Y
rdma_disconnect Y
ib_ucm_add_one  N

Is this list correct?

Regards,
Michael Wang



- Sean






--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

2015-04-10 Thread Tom Talpey

On 4/10/2015 5:06 PM, ira.weiny wrote:

On Fri, Apr 10, 2015 at 01:17:23PM -0600, Jason Gunthorpe wrote:
...
So trying to sum up.

Have we settled on the following capabilities?  Helper function names aside.

/* legacy to communicate to userspace */
RDMA_LINK_LAYER_IB   = 0x0001,
RDMA_LINK_LAYER_ETH  = 0x0002,
RDMA_LINK_LAYER_MASK = 0x000f, /* more bits? */
/* I'm hoping we don't need more bits here */


/* legacy to communicate to userspace */
RDMA_TRANSPORT_IB= 0x0010,
RDMA_TRANSPORT_IWARP = 0x0020,
RDMA_TRANSPORT_USNIC = 0x0040,
RDMA_TRANSPORT_USNIC_UDP = 0x0080,
RDMA_TRANSPORT_MASK  = 0x00f0, /* more bits? */
/* I'm hoping we don't need more bits here */


/* New flags */

RDMA_MGMT_IB_MAD = 0x0100, /* ib_mad module support */
RDMA_MGMT_QP0= 0x0200, /* ib_mad QP0 support */
RDMA_MGMT_IB_SA  = 0x0400, /* ib_sa module support */
   /* NOTE includes IB Mcast */
RDMA_MGMT_IB_CM  = 0x0800, /* ib_cm module support */
RDMA_MGMT_OPA_MAD= 0x1000, /* ib_mad OPA MAD support */
RDMA_MGMT_MASK   = 0x000fff00,


You explicitly say userspace - why would an upper layer need to
know the link, transport and management details? These seem to be
mid-layer matters.


RDMA_ADDR_IB = 0x0010, /* Port does IB AH, PR, Pkey */
RDMA_ADDR_IBoE   = 0x0020, /* Port does IBoE AH, PR, Pkey */
/* Do we need iWarp (TCP) here? */
RDMA_ADDR_IB_MASK= 0x0ff0,


I do see a ULP needing to know the address family needed to pass to
rdma_connect and rdma_listen, so I would add IP, but not iWARP.


RDMA_SEPARATE_READ_SGE   = 0x1000,
RDMA_QUIRKS_MASK = 0x00fff000


This is good, but it also needs an attribute to signal the need for a
remote-writable RDMA Read sink buffer, for today's iWARP.

Tom.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/17] IB/Verbs: IB Management Helpers

2015-04-08 Thread Tom Talpey

On 4/7/2015 8:25 AM, Michael Wang wrote:

Mapping List:
node-type   link-layer  old-transport   new-transport
nes RNICETH IWARP   IWARP
amso1100RNICETH IWARP   IWARP
cxgb3   RNICETH IWARP   IWARP
cxgb4   RNICETH IWARP   IWARP
usnic   USNIC_UDP   ETH USNIC_UDP   USNIC_UDP
ocrdma  IB_CA   ETH IB  IBOE
mlx4IB_CA   IB/ETH  IB  IB/IBOE
mlx5IB_CA   IB  IB  IB
ehcaIB_CA   IB  IB  IB
ipath   IB_CA   IB  IB  IB
mthca   IB_CA   IB  IB  IB
qib IB_CA   IB  IB  IB


Can I rewind to ask a high-level question - what's the testing
plan for all of this? Do you have folks lined up for verifying
each of these adapters/networks, and what tests will they run?



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

2015-04-08 Thread Tom Talpey

On 4/8/2015 4:10 PM, Jason Gunthorpe wrote:

On Wed, Apr 08, 2015 at 02:29:46PM -0400, Doug Ledford wrote:

...


rdma_port_get_read_sge(dev, port)
{
if (rdma_transport_is_iwarp)
return 1;
return dev-port[port]-max_sge;
}


Hum, that is nice, but it doesn't quite fit with how the ULP needs to
work. The max limit when creating a WR is the value passed into the
qp_cap, not the device maximum limit.


Agreed, and I will again say that not all devices necessarily support
the same max_sge for all WR types. The current one-size-fits-all API
may make the upper layer think so, but it's possibly being lied to.


To do this properly we need to extend the qp_cap, and that is just too
big a change. A one bit iWarp quirk is OK for now.


Yes, it would be a large-ish change, and I like Doug's choice of word
quirk to capture these as exceptions, until the means for addressing
them is decided.

Overall, I like Doug's proposals, especially from an upper layer
perspective. I might suggest further refining them into categories, 
perhaps management, primarily of interest to kernel and the

plumbing of connections; and actual RDMA semantics, of interest
to RDMA consumers.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 09/17] IB/Verbs: Use helper cap_read_multi_sge() and reform svc_rdma_accept()

2015-04-07 Thread Tom Talpey

On 4/7/2015 8:34 AM, Michael Wang wrote:

  /**
+ * cap_read_multi_sge - Check if the port of device has the capability
+ * RDMA Read Multiple Scatter-Gather Entries.
+ *
+ * @device: Device to be checked
+ * @port_num: Port number of the device
+ *
+ * Return 0 when port of the device don't support
+ * RDMA Read Multiple Scatter-Gather Entries.
+ */
+static inline int cap_read_multi_sge(struct ib_device *device, u8 port_num)
+{
+   return !rdma_transport_iwarp(device, port_num);
+}


This just papers over the issue we discussed earlier. How *many*
entries does the device support? If a device supports one, or two,
is that enough? How does the upper layer know the limit?

This needs an explicit device attribute, to be fixed properly.


+
+/**
   * cap_ipoib - Check if the port of device has the capability
   * IP over Infiniband.
   *
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index e011027..604d035 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -118,8 +118,8 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,

  static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
  {
-   if (rdma_node_get_transport(xprt-sc_cm_id-device-node_type) ==
-RDMA_TRANSPORT_IWARP)
+   if (!cap_read_multi_sge(xprt-sc_cm_id-device,
+   xprt-sc_cm_id-port_num))
return 1;
else
return min_t(int, sge_count, xprt-sc_max_sge);


This is incorrect. The RDMA Read max is not at all the same as the
max_sge. It is a different operation, with a different set of work
request parameters.

In other words, the above same comment applies.



diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 4e61880..e75175d 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -979,8 +979,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
/*
 * Determine if a DMA MR is required and if so, what privs are required
 */
-   switch (rdma_node_get_transport(newxprt-sc_cm_id-device-node_type)) {
-   case RDMA_TRANSPORT_IWARP:
+   if (rdma_transport_iwarp(newxprt-sc_cm_id-device,
+newxprt-sc_cm_id-port_num)) {
newxprt-sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV;


Do I read this correctly that it is forcing the read with invalidate
capability to on for all iWARP devices? I don't think that is correct,
for the legacy devices you're also supporting.



@@ -992,8 +992,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
*xprt)
dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
} else
need_dma_mr = 0;
-   break;
-   case RDMA_TRANSPORT_IB:
+   } else if (rdma_ib_mgmt(newxprt-sc_cm_id-device,
+   newxprt-sc_cm_id-port_num)) {
if (!(newxprt-sc_dev_caps  SVCRDMA_DEVCAP_FAST_REG)) {
need_dma_mr = 1;
dma_mr_acc = IB_ACCESS_LOCAL_WRITE;


Now I'm even more confused. How is the presence of IB management
related to needing a privileged lmr?


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 08/11] IB/Verbs: Use management helper has_iwarp() for, iwarp-check

2015-03-31 Thread Tom Talpey

On 3/31/2015 3:39 AM, Michael Wang wrote:

On 03/31/2015 12:35 AM, Jason Gunthorpe wrote:

On Mon, Mar 30, 2015 at 05:10:12PM +0200, Michael Wang wrote:

I found that actually we don't have to touch this one which
only used by HW driver currently.

I'm having a hard time understanding this, the code in question was in

net/sunrpc/xprtrdma/svc_rdma_recvfrom.c

Which is the NFS ULP, not a device driver.


I'm not familiar with this part too :-P but yes, it looks like an ulp
to support NFS.


It's the NFS server itself.



Actually I'm thinking about Doug's idea to use rdma_transport_is_XX()
instead of the current basic helper, thus may be use rdma_transport_is_iwarp()
in here could be better, since it's actually a feature of iwarp tech
that RDMA Read only support one scatter-gather entry.


No, you should expose an attribute to surface the maximum length of
the remote gather list, which varies by adapter as well as protocol.
The fact that iWARP is different from IB is not relevant, and conflates
unrelated properties.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 08/11] IB/Verbs: Use management helper has_iwarp() for, iwarp-check

2015-03-31 Thread Tom Talpey

On 3/31/2015 7:41 AM, Michael Wang wrote:

Hi, Tom

Thanks for the comments :-)

On 03/31/2015 01:19 PM, Tom Talpey wrote:


[oops - repeating my reply with full cc's]


[snip]



Actually I'm thinking about Doug's idea to use rdma_transport_is_XX()
instead of the current basic helper, thus may be use rdma_transport_is_iwarp()
in here could be better, since it's actually a feature of iwarp tech
that RDMA Read only support one scatter-gather entry.


No, you should expose an attribute to surface the maximum length of
the remote gather list, which varies by adapter as well as protocol.
The fact that iWARP is different from IB is not relevant, and conflates
unrelated properties.


To be confirmed, so your point is that the max-read-sges will be different
even the transport is the same IWRAP, and that depends on the capability
of adapter, correct?


Yes, in fact the iWARP protocol does not preclude multiple read SGEs,
even though most iWARP implementations have chosen to support just one.

Even for multi-SGE-capable adapters, there is a limit of SGL size, based
on the adapter's work request format and other factors. So my argument
is that upper layers can and should query that, not make a blanket
decision based on protocol type.



I currently only find this one place where infer max-read-sges from
transport type, it looks more like a special case to me rather than a generic
method we could exposed... and also  not very related with IB management
helper concept IMHO.


It is most certainly not a special case, but you could decide to
introduce it in many ways. I'm not commenting on that.

My main concern is that you do not introduce a new and clumsy is iWARP
rule as an adapter-specific API requirement to expose the RDMA Read SGE
behavior. That's what your initial message seemed to imply?


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for simplifying NFS/RDMA client memory registration

2014-02-28 Thread Tom Talpey

On 2/26/2014 8:44 AM, Chuck Lever wrote:

Hi-

Shirley Ma and I are reviving work on the NFS/RDMA client code base in the 
Linux kernel.  So far we’ve built and run functional tests to determine what is 
working and what is broken.

One complication is the number of memory registration modes supported by the 
RPC/RDMA transport: there are seven.  These were added over the years to 
support particular HCAs or as proof-of-concept.  The transport chooses a 
registration mode at mount time based on what the link HCA supports.

Not all HCAs support all memory registration modes, so our test matrix is quite 
large.  I’d like to propose removing support for one or more of these memory 
registration modes in the name of making it easier to change this code and test 
it without breaking something that we can’t test.

BOUNCEBUFFERS - All HCAs support this mode.  Does not use RDMA READ and WRITE, 
and the client end copies data into place.  RDMA is offloaded, by data copy is 
not.  I’m told it was never intended for production use.

REGISTER - Safe but relatively slow.  Uses reg_phys_mr verb which is not 
supported in mlx4/mlx5, but all other HCAs/providers can use this mode.

MEM_WINDOWS - Uses bind_mr verb.  Safe, but supports only a narrow range of 
HCAs.

MEM_WINDOWS_ASYNC - Not always safe, and only a narrow range of HCAs is 
supported.

MTHCA_FMR - Uses alloc_fmr verb.  Safe, reasonably fast, but only a narrow 
range of older HCAs is supported.


The MTHCA FMR is not completely safe - it protects only on page
boundaries, therefore the neighboring bytes are vulnerable to
silent corruption (reads) and exposure (write).

It is quite correct that they are supported on only a specific
set of legacy Mellanox HCA. You should consider removing the
code that looked for this PCI ID and attempted to alter the
device's wire MTU, to overcome another of its limitations.



FRMR - Safe, generally fast.  Currently the preferred registration mode, but is 
not supported with some older HCAs/providers.


This should be, by far, the preferred mode. Also, if I recall
correctly, the server depends on this mode being available/supported.
However, it may not be supported by Soft iWARP. Physical addressing
is used.



ALLPHYSICAL - Usually fast, but not safe as it exposes client memory.  All HCAs 
support this mode.


Not safe is an understatement. It exposes all of client physical
memory to the peer, for both read and write. A simple pointer error
on the server will silently corrupt the client. This mode was
intended only for testing, and in experimental deployments.


Tom.




I propose removing BOUNCEBUFFERS since it is not intended for production use.

I propose removing ALLPHYSICAL and MEM_WINDOWS_ASYNC as they are not generally 
safe.  RFC 5666 suggests that unsafe memory registration modes be avoided.

I propose removing MEM_WINDOWS as it adds complexity without adding a lot of 
HCA compatibility.

I propose removing MTHCA_FMR as I’m told it is hard to obtain HCAs we would 
need for testing this registration mode, and these are all old adapters anyway.

This leaves NFS/RDMA client support for REGISTER and FRMR, which should cover 
all existing HCAs, and it is easy to test both of these memory registration 
modes with just one or two well-picked HCAs.

We would contribute these changes to the client code base.  The NFS/RDMA server 
code could use similar attention, but we are not volunteering to change it at 
this time.

Thoughts/comments?

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Helps to Decode rpc_debug Output

2013-08-26 Thread Tom Talpey

On 8/21/2013 11:55 AM, Wendy Cheng wrote:

On Thu, Aug 15, 2013 at 11:08 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote:

On Thu, Aug 15, 2013 at 5:46 AM, Tom Talpey t...@talpey.com wrote:

On 8/14/2013 8:14 PM, Wendy Cheng wrote:


Longer version of the question:
I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38
kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS
server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver.
Memory registration is RPCRDMA_ALLPHYSICAL. There are many issues so
far but I do manage to get nfs mount working. Simple file operations
(such as ls, file read/write, scp, etc) seem to work as well.




Yay ... got this up .. amazingly on a uOS that does not have much of
the conventional kernel debug facilities.


Congrats!


One thing I'm still scratching my head is that ... by looking at the
raw IOPS, I don't see dramatic difference between NFS-RDMA vs. NFS
over IPOIB (TCP).


Sounds like your bottleneck lies in some other component. What's the
storage, for example? RDMA won't do a thing to improve a slow disk.
Or, what kind of IOPS rate are you seeing? If these systems aren't
generating enough load to push a CPU limit, then shifting the protocol
on the same link might not yield much.


However, the total run time differs greatly. NFS
over RDMA seems to take a much longer time to finish (vs. NFS over
IPOIB). Not sure why is that  Maybe by the constant
connect/disconnect triggered by reestablish_timeout ? The connection
re-establish is known to be expensive on this uOS.


Um, yes, of course. Fix that before drawing any conclusions.

Tom.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Helps to Decode rpc_debug Output

2013-08-15 Thread Tom Talpey

On 8/14/2013 8:14 PM, Wendy Cheng wrote:

Longer version of the question:
I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38
kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS
server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver.
Memory registration is RPCRDMA_ALLPHYSICAL. There are many issues so
far but I do manage to get nfs mount working. Simple file operations
(such as ls, file read/write, scp, etc) seem to work as well.


You're probably seeing connection loss from bad RDMA handles, which
come into play when you send large r/w traffic. The fact that small
i/o such as simple ls, and non-NFS traffic such as scp, means the
network itself is ok.

 While trying to run iozone to see whether the performance gain can be
 justified for the development efforts, the program runs until it
 reaches 2MB file size - at that point, RDMA CM sends out
 TIMEWAIT_EXIT event, the xprt is disconnected, and all IOs on that
 share hang. IPOIB still works though. Not sure what would be the best
 way to debug this.

I would suggest enabling RPC transport debugging, and any tracing
in the IB stack itself, looking to see if you can find any patterns.
You may want to look at the server side, too.

Unfortunately, because you are using Infiniband, packet capture is
going to be next to impossible. You might try using an iWARP adapter,
or one which can be sniffed, if you suspect a traffic issue.

Why did you replace the Linux IB stack with OFED? Did you also take the
NFS/RDMA from that package, and if so are you sure that it all is
is working properly? Doesn't 2.6.38 already have all this?

Tom.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-30 Thread Tom Talpey

On 4/30/2013 1:09 AM, Yan Burman wrote:




-Original Message-
From: J. Bruce Fields [mailto:bfie...@fieldses.org]
Sent: Sunday, April 28, 2013 17:43
To: Yan Burman
Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
linux-...@vger.kernel.org; Or Gerlitz
Subject: Re: NFS over RDMA benchmark

On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote:

On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
y...@mellanox.com

I've been trying to do some benchmarks for NFS over
RDMA and I seem to

only get about half of the bandwidth that the HW can give me.

My setup consists of 2 servers each with 16 cores,
32Gb of memory, and

Mellanox ConnectX3 QDR card over PCI-e gen3.

These servers are connected to a QDR IB switch. The
backing storage on

the server is tmpfs mounted with noatime.

I am running kernel 3.5.7.

When running ib_send_bw, I get 4.3-4.5 GB/sec for
block sizes 4-

512K.

When I run fio over rdma mounted nfs, I get
260-2200MB/sec for the

same block sizes (4-512K). running over IPoIB-CM, I get
200-

980MB/sec.

...

I am trying to get maximum performance from a single server
- I used 2

processes in fio test - more than 2 did not show any performance

boost.

I tried running fio from 2 different PCs on 2 different
files, but the sum of

the two is more or less the same as running from single client PC.




I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also 
way higher now).
For some reason when I had intel IOMMU enabled, the performance dropped 
significantly.
I now get up to ~95K IOPS and 4.1GB/sec bandwidth.


Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps.

What is the client CPU percentage you see under this workload, and
how different are the NFS/RDMA and NFS/IPoIB overheads?


Now I will take care of the issue that I am running only at 40Gbit/s instead of 
56Gbit/s, but that is another unrelated problem (I suspect I have a cable 
issue).

This is still strange, since ib_send_bw with intel iommu enabled did get up to 
4.5GB/sec, so why did intel iommu affect only nfs code?


You'll need to do more profiling to track that down. I would suspect
that ib_send_bw is using some sort of direct hardware access, bypassing
the IOMMU management and possibly performing no dynamic memory registration.

The NFS/RDMA code goes via the standard kernel DMA API, and correctly
registers/deregisters memory on a per-i/o basis in order to provide
storage data integrity. Perhaps there are overheads in the IOMMU
management which can be addressed.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-30 Thread Tom Talpey

On 4/30/2013 1:09 AM, Yan Burman wrote:

I now get up to ~95K IOPS and 4.1GB/sec bandwidth.
...
 ib_send_bw with intel iommu enabled did get up to 4.5GB/sec


BTW, you may want to verify that these are the same GB. Many
benchmarks say KB/MB/GB when they really mean KiB/MiB/GiB.

At GB/GiB, the difference is about 7.5%, very close to the
difference between 4.1 and 4.5.

Just a thought.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-30 Thread Tom Talpey

On 4/30/2013 10:23 AM, Yan Burman wrote:

-Original Message-
From: Tom Talpey [mailto:t...@talpey.com]

On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote:

I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is

also way higher now).

For some reason when I had intel IOMMU enabled, the performance

dropped significantly.

I now get up to ~95K IOPS and 4.1GB/sec bandwidth.


Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps.



That is not a typo. I get 95K IOPS with randrw test with block size 4K.
I get 4.1GBps with block size 256K randread test.


Well, then I suggest you focus on whether you are satisfied with a
high bandwidth goal or a high IOPS goal. They are two very different
things, and clearly there are still significant issues to track down
in the server.


What is the client CPU percentage you see under this workload, and how
different are the NFS/RDMA and NFS/IPoIB overheads?


NFS/RDMA has about more 20-30% CPU usage than NFS/IPoIB, but RDMA has almost 
twice the bandwidth of IPoIB.


So, for 125% of the CPU, RDMA is delivering 200% of the bandwidth.
A common reporting approach is to calculate cycles per Byte (roughly,
CPU/MB/sec), and you'll find this can be a great tool for comparison
when overhead is a consideration.


Overall, CPU usage gets up to about 20% for randread and 50% for randwrite.


This is *client* CPU? Writes require the server to take additional
overhead to make RDMA Read requests, but the client side is doing
practically the same thing for the read vs write path. Again, you
may want to profile more deeply to track that difference down.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Tom Talpey

On 4/25/2013 1:18 PM, Wendy Cheng wrote:

On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote:

On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com
wrote:



So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
tar ball) ... Here is a random thought (not related to the rb tree
comment).

The inflight packet count seems to be controlled by
xprt_rdma_slot_table_entries that is currently hard-coded as
RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
with the bandwidth number if we pump it up, say 64 instead ? Not sure
whether FMR pool size needs to get adjusted accordingly though.


1)

The client slot count is not hard-coded, it can easily be changed by
writing a value to /proc and initiating a new mount. But I doubt that
increasing the slot table will improve performance much, unless this is
a small-random-read, and spindle-limited workload.


Hi Tom !

It was a shot in the dark :)  .. as our test bed has not been setup
yet .However, since I'll be working on (very) slow clients, increasing
this buffer is still interesting (to me). I don't see where it is
controlled by a /proc value (?) - but that is not a concern at this


The entries show up in /proc/sys/sunrpc (IIRC). The one you're looking
for is called rdma_slot_table_entries.


moment as /proc entry is easy to add. More questions on the server
though (see below) ...



2)

The observation appears to be that the bandwidth is server CPU limited.
Increasing the load offered by the client probably won't move the needle,
until that's addressed.



Could you give more hints on which part of the path is CPU limited ?


Sorry, I don't. The profile showing 25% of the 16-core, 2-socket server
spinning on locks is a smoking, flaming gun though. Maybe Tom Tucker
has some ideas on the srv rdma code, but it could also be in the sunrpc
or infiniband driver layers, can't really tell without the call stacks.


Is there a known Linux-based filesystem that is reasonbly tuned for
NFS-RDMA ? Any specific filesystem features would work well with
NFS-RDMA ? I'm wondering when disk+FS are added into the
configuration, how much advantages would NFS-RDMA get when compared
with a plain TCP/IP, say IPOIB on CM , transport ?


NFS-RDMA is not really filesystem dependent, but certainly there are
considerations for filesystems to support NFS, and of course the goal in
general is performance. NFS-RDMA is a network transport, applicable to
both client and server. Filesystem choice is a server consideration.

I don't have a simple answer to your question about how much better
NFS-RDMA is over other transports. Architecturally, a lot. In practice,
there are many, many variables. Have you seen RFC5532, that I cowrote
with the late Chet Juszczak? You may find it's still quite relevant.
http://tools.ietf.org/html/rfc5532
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-25 Thread Tom Talpey

On 4/25/2013 3:01 PM, Phil Pishioneri wrote:

On 4/25/13 1:18 PM, Wendy Cheng wrote:

On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote:

1)

The client slot count is not hard-coded, it can easily be changed by
writing a value to /proc and initiating a new mount. But I doubt that
increasing the slot table will improve performance much, unless this is
a small-random-read, and spindle-limited workload.

It was a shot in the dark :)  .. as our test bed has not been setup
yet .However, since I'll be working on (very) slow clients, increasing
this buffer is still interesting (to me). I don't see where it is
controlled by a /proc value (?) - but that is not a concern at this
moment as /proc entry is easy to add. More questions on the server
though (see below) ...


Might there be confusion between the RDMA slot table and the TCP/UDP
ones (which have proc entries under /proc/sys/sunrpc)?



No, the xprtrdma.ko creates similar slot table controls when it loads.
See the names below, prefixed with rdma:


tmt@Home:~$ ls /proc/sys/sunrpc
max_resvport  nfsd_debug  nlm_debug  tcp_fin_timeout 
tcp_slot_table_entries  udp_slot_table_entries
min_resvport  nfs_debug   rpc_debug  tcp_max_slot_table_entries  transports
tmt@Home:~$ sudo insmod xprtrdma
tmt@Home:~$ ls /proc/sys/sunrpc
max_resvport  nlm_debug  rdma_memreg_strategy 
tcp_fin_timeout udp_slot_table_entries
min_resvport  rdma_inline_write_padding  rdma_pad_optimize
tcp_max_slot_table_entries
nfsd_debugrdma_max_inline_read   rdma_slot_table_entries  
tcp_slot_table_entries
nfs_debug rdma_max_inline_write  rpc_debugtransports
tmt@Home:~$



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-24 Thread Tom Talpey

On 4/24/2013 2:04 PM, Wendy Cheng wrote:

On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote:

On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote:

On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:

On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:




Perf top for the CPU with high tasklet count gives:

  samples  pcnt RIPfunctionDSO
  ___ _  ___ 
___

  2787.00 24.1% 81062a00 mutex_spin_on_owner 
/root/vmlinux


I guess that means lots of contention on some mutex?  If only we knew
which one perf should also be able to collect stack statistics, I
forget how.


Googling around  I think we want:

 perf record -a --call-graph
 (give it a chance to collect some samples, then ^C)
 perf report --call-graph --stdio



I have not looked at NFS RDMA (and 3.x kernel) source yet. But see
that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere
in the paths ? Trees like that requires extensive lockings.



So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1
tar ball) ... Here is a random thought (not related to the rb tree
comment).

The inflight packet count seems to be controlled by
xprt_rdma_slot_table_entries that is currently hard-coded as
RPCRDMA_DEF_SLOT_TABLE (32) (?).  I'm wondering whether it could help
with the bandwidth number if we pump it up, say 64 instead ? Not sure
whether FMR pool size needs to get adjusted accordingly though.


1)

The client slot count is not hard-coded, it can easily be changed by
writing a value to /proc and initiating a new mount. But I doubt that
increasing the slot table will improve performance much, unless this is
a small-random-read, and spindle-limited workload.

2)

The observation appears to be that the bandwidth is server CPU limited.
Increasing the load offered by the client probably won't move the needle,
until that's addressed.




In short, if anyone has benchmark setup handy, bumping up the slot
table size as the following might be interesting:

--- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h
2013-03-21 09:19:36.233006570 -0700
+++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h  2013-04-24
10:52:20.934781304 -0700
@@ -59,7 +59,7 @@
   * a single chunk type per message is supported currently.
   */
  #define RPCRDMA_MIN_SLOT_TABLE (2U)
-#define RPCRDMA_DEF_SLOT_TABLE (32U)
+#define RPCRDMA_DEF_SLOT_TABLE (64U)
  #define RPCRDMA_MAX_SLOT_TABLE (256U)

  #define RPCRDMA_DEF_INLINE  (1024) /* default inline max */

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html