Re: [PATCH v3 5/6] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies
Three comments. On 12/7/2015 12:43 PM, Chuck Lever wrote: To support the NFSv4.1 backchannel on RDMA connections, add a capability for receiving an RPC/RDMA reply on a connection established by a client. (snip) +/* By convention, backchannel calls arrive via rdma_msg type "By convention" is ok, but it's important to note that this is actually not "by protocol". Therefore, the following check may reject valid messages. Even though it is unlikely an implementation will insert chunks, it's not illegal, and ignoring them will be less harmful. So I'm going to remake my earlier observation that three checks below should be removed: + * messages, and never populate the chunk lists. This makes + * the RPC/RDMA header small and fixed in size, so it is + * straightforward to check the RPC header's direction field. + */ +static bool +svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp) +{ + __be32 *p = (__be32 *)rmsgp; + + if (!xprt->xpt_bc_xprt) + return false; + + if (rmsgp->rm_type != rdma_msg) + return false; These three: + if (rmsgp->rm_body.rm_chunks[0] != xdr_zero) + return false; + if (rmsgp->rm_body.rm_chunks[1] != xdr_zero) + return false; + if (rmsgp->rm_body.rm_chunks[2] != xdr_zero) + return false; + (snip) diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index a1fd74a..3895574 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -309,6 +309,8 @@ struct rpcrdma_buffer { u32 rb_bc_srv_max_requests; spinlock_t rb_reqslock;/* protect rb_allreqs */ struct list_headrb_allreqs; + + u32 rb_bc_max_requests; Why does this need to be u32? Shouldn't it be an int, and also the rb_bc_srv_max_requests just above? The forward channel max_requests are int, btw. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/6] svcrdma: Do not send XDR roundup bytes for a write chunk
Two small comments. On 12/7/2015 12:42 PM, Chuck Lever wrote: Minor optimization: when dealing with write chunk XDR roundup, do not post a Write WR for the zero bytes in the pad. Simply update the write segment in the RPC-over-RDMA header to reflect the extra pad bytes. The Reply chunk is also a write chunk, but the server does not use send_write_chunks() to send the Reply chunk. That's OK in this case: the server Upper Layer typically marshals the Reply chunk contents in a single contiguous buffer, without a separate tail for the XDR pad. The comments and the variable naming refer to "chunks" but what is really meant is "segments." The existing code sends only one xdr_write_chunk per RPC reply. The fix assumes this as well. When the XDR pad in the first write chunk is reached, the assumption is the Write list is complete and send_write_chunks() returns. That will remain a valid assumption until the server Upper Layer can support multiple bulk payload results per RPC. Signed-off-by: Chuck Lever--- net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++ 1 file changed, 7 insertions(+) diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c index 969a1ab..bad5eaa 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, arg_ch->rs_handle, arg_ch->rs_offset, write_len); + + /* Do not send XDR pad bytes */ It might be clearer to say "marshal" instead of "send". + if (chunk_no && write_len < 4) { Why is it necessary to check for chunk_no == 0? It is not possible for leading data to ever be padding, nor is a leading data element ever less than 4 bytes long. Right? Tom. + chunk_no++; + break; + } + chunk_off = 0; while (write_len) { ret = send_write(xprt, rqstp, -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 6/7] xprtrdma: Add class for RDMA backwards direction transport
On 11/30/2015 5:25 PM, Chuck Lever wrote: To support the server-side of an NFSv4.1 backchannel on RDMA connections, add a transport class that enables backward direction messages on an existing forward channel connection. +static void * +xprt_rdma_bc_allocate(struct rpc_task *task, size_t size) +{ + struct rpc_rqst *rqst = task->tk_rqstp; + struct svc_rdma_op_ctxt *ctxt; + struct svcxprt_rdma *rdma; + struct svc_xprt *sxprt; + struct page *page; + + if (size > PAGE_SIZE) { + WARN_ONCE(1, "failed to handle buffer allocation (size %zu)\n", + size); You may want to add more context to that rather cryptic string, at least the function name. Also, it's not exactly "failed to handle", it's an invalid size. Why would this ever happen? Why even log it? +static int +rpcrdma_bc_send_request(struct svcxprt_rdma *rdma, struct rpc_rqst *rqst) +{ ... + +drop_connection: + dprintk("Failed to send backchannel call\n"); Ditto on the prefix / function context. And also... + dprintk("%s: sending request with xid: %08x\n", + __func__, be32_to_cpu(rqst->rq_xid)); ... + dprintk("RPC: %s: xprt %p\n", __func__, xprt); The format strings for many of the dprintk's are somewhat inconsistent. Some start with "RPC", some with the function name, and some (in other patches of this series) with "svcrdma". Confusing, and perhaps hard to pick out of the log. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Future of FMR support, was: Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
On 11/24/2015 1:52 AM, Christoph Hellwig wrote: On Mon, Nov 23, 2015 at 07:57:42PM -0500, Tom Talpey wrote: On 11/23/2015 5:14 PM, Chuck Lever wrote: FMR's ro_unmap method is already synchronous because ib_unmap_fmr() is a synchronous verb. However, some improvements can be made here. I thought FMR support was about to be removed in the core. Seems like everyone would love to kill it, but no one dares to do it just yet. Reasons to keep FMRs: - mthca doesn't support FRs but haven't been staged out - RDS only supports FRMs for the IB transports (it does support FRs for an entirely separate iWarp transports. I'd love to know why we can't just use that iWarp transport for IB as well). - mlx4 hardware might be slower with FRs than FRMs (Or mentioned this in reply to the iSER remote invalidation series). Ok, let me add then. Reasons to kill FMRs: - FMRs are UNSAFE. They protect on page boundaries, not byte, therefore a length error on an operation can corrupt data outside the i/o buffer remotely. To say nothing of maliciousness. - FMRs are SLOWER. The supposed performance improvement of FMR comes from the "mempool" that is often placed in front of them. These pools cache remotely registered regions, in the hope of reusing them without having to invalidate in between. See the first bullet for the ramifications of that. - FMRs are SYNCHRONOUS. The FMR verbs make direct calls into the verbs which block and/or do not return control to the caller promptly as do work queue operations. The processor spends more time servicing the device, typically at raised irq. - FMRs are PROPRIETARY. Only one vendor supports them, only one of their NICs has no decent alternative, and that NIC is functionally obsolete. If storage stacks continue to support them, I personally feel it is critical to carefully document the risks to customer data as part of shipping the code. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/9] IB: add a proper completion queue abstraction
On 11/24/2015 2:03 AM, Jason Gunthorpe wrote: On Mon, Nov 23, 2015 at 06:35:28PM -0800, Caitlin Bestler wrote: Are there actual HCAs that make this mistake? All IB HCAs have this behavior and require apps to see a send CQ completion before making any statements about the state of the send Q or buffers handed over to the HCA. Tom and I have seen this in real systems under proper stress conditions. [Which is why I am so certain about this, because when I first hit it years ago I dug into the spec and figured out it was not a HW bug I was looking at] To be clear, I saw the reply-completion-before-request-completion on Windows, not Linux, but the principle is identical. It's simply a fact of life on a multiprocessor, unless you want to throw in locks and synchronization rules that drivers have to follow to enforce ordered completions across queues. Which trust me, you don't. In Windows SMB Direct, we added reference counts around pretty much every verb interaction associated with each upper layer operation, and did not retire them until all refcounts went to zero. It is excruciatingly correct yet performs incredibly well. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method
On 11/24/2015 5:59 AM, Sagi Grimberg wrote: As I see it, if we don't wait for local-invalidate to complete before unmap and IO completion (and no one does) For the record, that is false. Windows quite strictly performs these steps, and deliver millions of real 4K IOPS over SMB Direct. Waiting for local invalidate to complete would be a really big sacrifice for our storage ULPs. Not waiting would also be a sacrifice, by compromising data integrity. It's a tough tradeoff, but if you choose to go that way it will be critical to be honest about the consequences to the data. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/8] svcrdma: Define maximum number of backchannel requests
On 11/23/2015 5:20 PM, Chuck Lever wrote: Extra resources for handling backchannel requests have to be pre-allocated when a transport instance is created. Set a limit. Signed-off-by: Chuck Lever--- include/linux/sunrpc/svc_rdma.h |5 + net/sunrpc/xprtrdma/svc_rdma_transport.c |6 +- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index f869807..478aa30 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -178,6 +178,11 @@ struct svcxprt_rdma { #define RPCRDMA_SQ_DEPTH_MULT 8 #define RPCRDMA_MAX_REQUESTS32 #define RPCRDMA_MAX_REQ_SIZE4096 +#if defined(CONFIG_SUNRPC_BACKCHANNEL) Why is this a config option? Why wouldn't you always want this? It's needed for any post-1990 NFS dialect. +#define RPCRDMA_MAX_BC_REQUESTS8 Why a constant 8? The forward channel value is apparently configurable, just a few lines down. +#else +#define RPCRDMA_MAX_BC_REQUESTS0 +#endif #define RPCSVC_MAXPAYLOAD_RDMARPCSVC_MAXPAYLOAD diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index b348b4a..01c7b36 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) (size_t)RPCSVC_MAXPAGES); newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd, RPCSVC_MAXPAGES); + /* XXX: what if HCA can't support enough WRs for bc operation? */ newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr, - (size_t)svcrdma_max_requests); + (size_t)(svcrdma_max_requests + + RPCRDMA_MAX_BC_REQUESTS)); newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests; /* @@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) qp_attr.event_handler = qp_event_handler; qp_attr.qp_context = >sc_xprt; qp_attr.cap.max_send_wr = newxprt->sc_sq_depth; + qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS; qp_attr.cap.max_recv_wr = newxprt->sc_max_requests; + qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS; qp_attr.cap.max_send_sge = newxprt->sc_max_sge; qp_attr.cap.max_recv_sge = newxprt->sc_max_sge; qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR; -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 2/8] svcrdma: Define maximum number of backchannel requests
On 11/23/2015 8:09 PM, Chuck Lever wrote: On Nov 23, 2015, at 7:39 PM, Tom Talpey <t...@talpey.com> wrote: On 11/23/2015 5:20 PM, Chuck Lever wrote: Extra resources for handling backchannel requests have to be pre-allocated when a transport instance is created. Set a limit. Signed-off-by: Chuck Lever <chuck.le...@oracle.com> --- include/linux/sunrpc/svc_rdma.h |5 + net/sunrpc/xprtrdma/svc_rdma_transport.c |6 +- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index f869807..478aa30 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -178,6 +178,11 @@ struct svcxprt_rdma { #define RPCRDMA_SQ_DEPTH_MULT 8 #define RPCRDMA_MAX_REQUESTS32 #define RPCRDMA_MAX_REQ_SIZE4096 +#if defined(CONFIG_SUNRPC_BACKCHANNEL) Why is this a config option? Why wouldn't you always want this? It's needed for any post-1990 NFS dialect. I think some distros want to be able to compile out NFSv4.x on small systems, and take all the backchannel cruft with it. So shouldn't it follow the NFSv4.x config options then? +#define RPCRDMA_MAX_BC_REQUESTS8 Why a constant 8? The forward channel value is apparently configurable, just a few lines down. The client side backward direction credit limit, now in 4.4, is already a constant. The client side ULP uses a constant for the slot table size: NFS4_MAX_BACK_CHANNEL_OPS. I’m not 100% sure but the server seems to just echo that number back to the client. I’d rather not add an admin knob for this. Why would it be necessary? Because no constant is ever correct. Why isn't it "1"? Do you allow multiple credits? Why not that value? For instance. +#else +#define RPCRDMA_MAX_BC_REQUESTS0 +#endif #define RPCSVC_MAXPAYLOAD_RDMARPCSVC_MAXPAYLOAD diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index b348b4a..01c7b36 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) (size_t)RPCSVC_MAXPAGES); newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd, RPCSVC_MAXPAGES); + /* XXX: what if HCA can't support enough WRs for bc operation? */ newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr, - (size_t)svcrdma_max_requests); + (size_t)(svcrdma_max_requests + + RPCRDMA_MAX_BC_REQUESTS)); newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests; /* @@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) qp_attr.event_handler = qp_event_handler; qp_attr.qp_context = >sc_xprt; qp_attr.cap.max_send_wr = newxprt->sc_sq_depth; + qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS; qp_attr.cap.max_recv_wr = newxprt->sc_max_requests; + qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS; qp_attr.cap.max_send_sge = newxprt->sc_max_sge; qp_attr.cap.max_recv_sge = newxprt->sc_max_sge; qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR; -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 6/8] xprtrdma: Add class for RDMA backwards direction transport
On 11/23/2015 5:21 PM, Chuck Lever wrote: To support the server-side of an NFSv4.1 backchannel on RDMA connections, add a transport class for backwards direction operation. So, what's special here is that it re-uses an existing forward channel's connection? If not, it would seem unnecessary to define a new type/transport semantic. Say this? Signed-off-by: Chuck Lever--- include/linux/sunrpc/xprt.h |1 net/sunrpc/xprt.c|1 net/sunrpc/xprtrdma/svc_rdma_transport.c | 14 +- net/sunrpc/xprtrdma/transport.c | 243 ++ net/sunrpc/xprtrdma/xprt_rdma.h |2 5 files changed, 256 insertions(+), 5 deletions(-) diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h index 69ef5b3..7637ccd 100644 --- a/include/linux/sunrpc/xprt.h +++ b/include/linux/sunrpc/xprt.h @@ -85,6 +85,7 @@ struct rpc_rqst { __u32 * rq_buffer; /* XDR encode buffer */ size_t rq_callsize, rq_rcvsize; + void * rq_privdata; /* xprt-specific per-rqst data */ size_t rq_xmit_bytes_sent; /* total bytes sent */ size_t rq_reply_bytes_recvd; /* total reply bytes */ /* received */ diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c index 2e98f4a..37edea6 100644 --- a/net/sunrpc/xprt.c +++ b/net/sunrpc/xprt.c @@ -1425,3 +1425,4 @@ void xprt_put(struct rpc_xprt *xprt) if (atomic_dec_and_test(>count)) xprt_destroy(xprt); } +EXPORT_SYMBOL_GPL(xprt_put); diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 58ec362..3768a7f 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -1182,12 +1182,14 @@ static void __svc_rdma_free(struct work_struct *work) { struct svcxprt_rdma *rdma = container_of(work, struct svcxprt_rdma, sc_work); - dprintk("svcrdma: svc_rdma_free(%p)\n", rdma); + struct svc_xprt *xprt = >sc_xprt; + + dprintk("svcrdma: %s(%p)\n", __func__, rdma); /* We should only be called from kref_put */ - if (atomic_read(>sc_xprt.xpt_ref.refcount) != 0) + if (atomic_read(>xpt_ref.refcount) != 0) pr_err("svcrdma: sc_xprt still in use? (%d)\n", - atomic_read(>sc_xprt.xpt_ref.refcount)); + atomic_read(>xpt_ref.refcount)); /* * Destroy queued, but not processed read completions. Note @@ -1222,6 +1224,12 @@ static void __svc_rdma_free(struct work_struct *work) pr_err("svcrdma: dma still in use? (%d)\n", atomic_read(>sc_dma_used)); + /* Final put of backchannel client transport */ + if (xprt->xpt_bc_xprt) { + xprt_put(xprt->xpt_bc_xprt); + xprt->xpt_bc_xprt = NULL; + } + /* De-allocate fastreg mr */ rdma_dealloc_frmr_q(rdma); diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c index 8c545f7..fda7488 100644 --- a/net/sunrpc/xprtrdma/transport.c +++ b/net/sunrpc/xprtrdma/transport.c @@ -51,6 +51,7 @@ #include #include #include +#include #include "xprt_rdma.h" @@ -148,7 +149,10 @@ static struct ctl_table sunrpc_table[] = { #define RPCRDMA_MAX_REEST_TO (30U * HZ) #define RPCRDMA_IDLE_DISC_TO (5U * 60 * HZ) -static struct rpc_xprt_ops xprt_rdma_procs;/* forward reference */ +static struct rpc_xprt_ops xprt_rdma_procs; +#if defined(CONFIG_SUNRPC_BACKCHANNEL) +static struct rpc_xprt_ops xprt_rdma_bc_procs; +#endif static void xprt_rdma_format_addresses4(struct rpc_xprt *xprt, struct sockaddr *sap) @@ -499,7 +503,7 @@ xprt_rdma_allocate(struct rpc_task *task, size_t size) if (req == NULL) return NULL; - flags = GFP_NOIO | __GFP_NOWARN; + flags = RPCRDMA_DEF_GFP; if (RPC_IS_SWAPPER(task)) flags = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN; @@ -684,6 +688,199 @@ xprt_rdma_disable_swap(struct rpc_xprt *xprt) { } +#if defined(CONFIG_SUNRPC_BACKCHANNEL) + +/* Server-side transport endpoint wants a whole page for its send + * buffer. The client RPC code constructs the RPC header in this + * buffer before it invokes ->send_request. + */ +static void * +xprt_rdma_bc_allocate(struct rpc_task *task, size_t size) +{ + struct rpc_rqst *rqst = task->tk_rqstp; + struct svc_rdma_op_ctxt *ctxt; + struct svcxprt_rdma *rdma; + struct svc_xprt *sxprt; + struct page *page; + + if (size > PAGE_SIZE) { + WARN_ONCE(1, "failed to handle buffer allocation (size %zu)\n", + size); + return NULL; + } + + page = alloc_page(RPCRDMA_DEF_GFP); + if
Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers
On 11/23/2015 5:13 PM, Chuck Lever wrote: Rarely, senders post a Send that is larger than the client's inline threshold. That can be due to a bug, or the client and server may not have communicated about their inline limits. RPC-over-RDMA currently doesn't specify any particular limit on inline size, so peers have to guess what it is. It is fatal to the connection if the size of a Send is larger than the client's receive buffer. The sender is likely to retry with the same message size, so the workload is stuck at that point. Follow Postel's robustness principal: Be conservative in what you do, be liberal in what you accept from others. Increase the size of client receive buffers by a safety margin, and add a warning when the inline threshold is exceeded during receive. Safety is good, but how do know the chosen value is enough? Isn't it better to fail the badly-composed request and be done with it? Even if the stupid sender loops, which it will do anyway. Note the Linux server's receive buffers are already page-sized. Signed-off-by: Chuck Lever--- net/sunrpc/xprtrdma/rpc_rdma.c |7 +++ net/sunrpc/xprtrdma/verbs.c |6 +- net/sunrpc/xprtrdma/xprt_rdma.h |5 + 3 files changed, 17 insertions(+), 1 deletion(-) diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c index c10d969..a169252 100644 --- a/net/sunrpc/xprtrdma/rpc_rdma.c +++ b/net/sunrpc/xprtrdma/rpc_rdma.c @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep) int rdmalen, status; unsigned long cwnd; u32 credits; + RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata); dprintk("RPC: %s: incoming rep %p\n", __func__, rep); @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep) goto out_badstatus; if (rep->rr_len < RPCRDMA_HDRLEN_MIN) goto out_shortreply; +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG) + cdata = _xprt->rx_data; + if (rep->rr_len > cdata->inline_rsize) + pr_warn("RPC: %u byte reply exceeds inline threshold\n", + rep->rr_len); +#endif headerp = rdmab_to_msg(rep->rr_rdmabuf); if (headerp->rm_vers != rpcrdma_version) diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index eadd1655..e3f12e2 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt) if (rep == NULL) goto out; - rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize, + /* The actual size of our receive buffers is increased slightly +* to prevent small receive overruns from killing our connection. +*/ + rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize + + RPCRDMA_RECV_MARGIN, GFP_KERNEL); if (IS_ERR(rep->rr_rdmabuf)) { rc = PTR_ERR(rep->rr_rdmabuf); diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index ac7f8d4..1b72ab1 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal { #define RPCRDMA_INLINE_PAD_VALUE(rq)\ rpcx_to_rdmad(rq->rq_xprt).padding +/* To help prevent spurious connection shutdown, allow senders to + * overrun our receive inline threshold by a small bit. + */ +#define RPCRDMA_RECV_MARGIN(128) + /* * Statistics for RPCRDMA */ -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 1/9] xprtrdma: Add a safety margin for receive buffers
On 11/23/2015 8:16 PM, Chuck Lever wrote: On Nov 23, 2015, at 7:55 PM, Tom Talpey <t...@talpey.com> wrote: On 11/23/2015 5:13 PM, Chuck Lever wrote: Rarely, senders post a Send that is larger than the client's inline threshold. That can be due to a bug, or the client and server may not have communicated about their inline limits. RPC-over-RDMA currently doesn't specify any particular limit on inline size, so peers have to guess what it is. It is fatal to the connection if the size of a Send is larger than the client's receive buffer. The sender is likely to retry with the same message size, so the workload is stuck at that point. Follow Postel's robustness principal: Be conservative in what you do, be liberal in what you accept from others. Increase the size of client receive buffers by a safety margin, and add a warning when the inline threshold is exceeded during receive. Safety is good, but how do know the chosen value is enough? Isn't it better to fail the badly-composed request and be done with it? Even if the stupid sender loops, which it will do anyway. It’s good enough to compensate for the most common sender bug, which is that the sender did not account for the 28 bytes of the RPC-over-RDMA header when it built the send buffer. The additional 100 byte margin is gravy. I think it's good to have sympathy and resilience to differing designs on the other end of the wire, but I fail to have it for stupid bugs. Unless this can take down the receiver, fail it fast. MHO. The loop occurs because the server gets a completion error. The client just sees a connection loss. There’s no way for it to know it should fail the RPC, so it keeps trying. Perhaps the server could remember that the reply failed, and when the client retransmits, it can simply return that XID with an RDMA_ERROR. Note the Linux server's receive buffers are already page-sized. Signed-off-by: Chuck Lever <chuck.le...@oracle.com> --- net/sunrpc/xprtrdma/rpc_rdma.c |7 +++ net/sunrpc/xprtrdma/verbs.c |6 +- net/sunrpc/xprtrdma/xprt_rdma.h |5 + 3 files changed, 17 insertions(+), 1 deletion(-) diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c index c10d969..a169252 100644 --- a/net/sunrpc/xprtrdma/rpc_rdma.c +++ b/net/sunrpc/xprtrdma/rpc_rdma.c @@ -776,6 +776,7 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep) int rdmalen, status; unsigned long cwnd; u32 credits; + RPC_IFDEBUG(struct rpcrdma_create_data_internal *cdata); dprintk("RPC: %s: incoming rep %p\n", __func__, rep); @@ -783,6 +784,12 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep) goto out_badstatus; if (rep->rr_len < RPCRDMA_HDRLEN_MIN) goto out_shortreply; +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG) + cdata = _xprt->rx_data; + if (rep->rr_len > cdata->inline_rsize) + pr_warn("RPC: %u byte reply exceeds inline threshold\n", + rep->rr_len); +#endif headerp = rdmab_to_msg(rep->rr_rdmabuf); if (headerp->rm_vers != rpcrdma_version) diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index eadd1655..e3f12e2 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -924,7 +924,11 @@ rpcrdma_create_rep(struct rpcrdma_xprt *r_xprt) if (rep == NULL) goto out; - rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize, + /* The actual size of our receive buffers is increased slightly +* to prevent small receive overruns from killing our connection. +*/ + rep->rr_rdmabuf = rpcrdma_alloc_regbuf(ia, cdata->inline_rsize + + RPCRDMA_RECV_MARGIN, GFP_KERNEL); if (IS_ERR(rep->rr_rdmabuf)) { rc = PTR_ERR(rep->rr_rdmabuf); diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index ac7f8d4..1b72ab1 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -337,6 +337,11 @@ struct rpcrdma_create_data_internal { #define RPCRDMA_INLINE_PAD_VALUE(rq)\ rpcx_to_rdmad(rq->rq_xprt).padding +/* To help prevent spurious connection shutdown, allow senders to + * overrun our receive inline threshold by a small bit. + */ +#define RPCRDMA_RECV_MARGIN(128) + /* * Statistics for RPCRDMA */ -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To u
Re: [PATCH 2/9] IB: add a proper completion queue abstraction
On 11/23/2015 7:00 PM, Jason Gunthorpe wrote: On Mon, Nov 23, 2015 at 03:30:42PM -0800, Caitlin Bestler wrote: The receive completion can be safely assumed to indicate transmit completion over a reliable connection unless your peer has gone completely bonkers and is replying to a command that it did not receive. Perhaps iWarp is different and does specify this ordering but IB does not. iWARP is not different. The situation you (Jason) describe has nothing to do with the transport. It has everything to do with as you point out the lack of causality between send and receive completions. It is entirely possible for the reply to be received before the send is fully processed. For example, the send might be issued on one core, and that core scheduled away before the completion for the send is ready. In the meantime, the request goes on the wire, the target processes it and replies, and the reply is processed. Boom, the send queue completion is still pending. Been there, seen that. Bluescreened on it, mysteriously. A really good way to see this is with software providers, btw. Try it with soft{roce,iwarp}, under heavy load. Tom. The issue with IB is how the ACK protocol is designed. There is not strong ordering between ACKs and data transfers. A HCA can send ACK,DATA and the network could drop the ACK. The recevier side does not know the ACK was lost and goes ahead to process DATA. Since only ACK advances the sendq and DATA advances the recvq it is trivial to get a case where the recvq is advanced with a reply while the sendq continues to wait for the ACK to be resent. Further IB allows ACK coalescing and has no rules for how an ACK is placed. It is entirely valid for a HCA to RECV,REPLY,ACK - for instance. I actually had a bug in an early iWARP emulation where the simulated peer, because it was simulated, responded instantly. The result was a TCP segment that both acked the transmission *and* contained the reply. The bug was that the code processed the reception before the transmission ack, causing the receive completion to be placed on the completion queue before transmit completion. I don't know if iWARP has the same lax ordering as IB, but certainly, what you describe is legal for IB verbs to do, and our kernel ULPs have to cope with it. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 8/9] xprtrdma: Invalidate in the RPC reply handler
On 11/23/2015 5:14 PM, Chuck Lever wrote: There is a window between the time the RPC reply handler wakes the waiting RPC task and when xprt_release() invokes ops->buf_free. During this time, memory regions containing the data payload may still be accessed by a broken or malicious server, but the RPC application has already been allowed access to the memory containing the RPC request's data payloads. The server should be fenced from client memory containing RPC data payloads _before_ the RPC application is allowed to continue. No concern, just Hurray for implementing this fundamental integrity requirement. It's even more important when krb5i is in play, to avoid truly malicious injection-after-integrity. This change also more strongly enforces send queue accounting. There is a maximum number of RPC calls allowed to be outstanding. When an RPC/RDMA transport is set up, just enough send queue resources are allocated to handle registration, Send, and invalidation WRs for each those RPCs at the same time. Before, additional RPC calls could be dispatched while invalidation WRs were still consuming send WQEs. When invalidation WRs backed up, dispatching additional RPCs resulted in a send queue overrun. Now, the reply handler prevents RPC dispatch until invalidation is complete. This prevents RPC call dispatch until there are enough send queue resources to proceed. Still to do: If an RPC exits early (say, ^C), the reply handler has no opportunity to perform invalidation. Currently, xprt_rdma_free() still frees remaining RDMA resources, which could deadlock. Additional changes are needed to handle invalidation properly in this case. Reported-by: Jason GunthorpeSigned-off-by: Chuck Lever --- net/sunrpc/xprtrdma/rpc_rdma.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c index d7b9156..4ad72ca 100644 --- a/net/sunrpc/xprtrdma/rpc_rdma.c +++ b/net/sunrpc/xprtrdma/rpc_rdma.c @@ -898,6 +898,16 @@ badheader: break; } + /* Invalidate and flush the data payloads before waking the +* waiting application. This guarantees the memory region is +* properly fenced from the server before the application +* accesses the data. It also ensures proper send flow +* control: waking the next RPC waits until this RPC has +* relinquished all its Send Queue entries. +*/ + if (req->rl_nchunks) + r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req); + credits = be32_to_cpu(headerp->rm_credit); if (credits == 0) credits = 1;/* don't deadlock */ -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 5/8] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies
On 11/23/2015 5:20 PM, Chuck Lever wrote: To support the NFSv4.1 backchannel on RDMA connections, add a capability for receiving an RPC/RDMA reply on a connection established by a client. Signed-off-by: Chuck Lever--- net/sunrpc/xprtrdma/rpc_rdma.c | 76 +++ net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 60 net/sunrpc/xprtrdma/xprt_rdma.h |4 ++ 3 files changed, 140 insertions(+) diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c index c10d969..fef0623 100644 --- a/net/sunrpc/xprtrdma/rpc_rdma.c +++ b/net/sunrpc/xprtrdma/rpc_rdma.c @@ -946,3 +946,79 @@ repost: if (rpcrdma_ep_post_recv(_xprt->rx_ia, _xprt->rx_ep, rep)) rpcrdma_recv_buffer_put(rep); } + +#if defined(CONFIG_SUNRPC_BACKCHANNEL) + +int +rpcrdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp, + struct xdr_buf *rcvbuf) +{ + struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt); + struct kvec *dst, *src = >head[0]; + struct rpc_rqst *req; + unsigned long cwnd; + u32 credits; + size_t len; + __be32 xid; + __be32 *p; + int ret; + + p = (__be32 *)src->iov_base; + len = src->iov_len; + xid = rmsgp->rm_xid; + + pr_info("%s: xid=%08x, length=%zu\n", + __func__, be32_to_cpu(xid), len); + pr_info("%s: RPC/RDMA: %*ph\n", + __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp); + pr_info("%s: RPC: %*ph\n", + __func__, (int)len, p); + + ret = -EAGAIN; + if (src->iov_len < 24) + goto out_shortreply; + + spin_lock_bh(>transport_lock); + req = xprt_lookup_rqst(xprt, xid); + if (!req) + goto out_notfound; + + dst = >rq_private_buf.head[0]; + memcpy(>rq_private_buf, >rq_rcv_buf, sizeof(struct xdr_buf)); + if (dst->iov_len < len) + goto out_unlock; + memcpy(dst->iov_base, p, len); + + credits = be32_to_cpu(rmsgp->rm_credit); + if (credits == 0) + credits = 1;/* don't deadlock */ + else if (credits > r_xprt->rx_buf.rb_bc_max_requests) + credits = r_xprt->rx_buf.rb_bc_max_requests; + + cwnd = xprt->cwnd; + xprt->cwnd = credits << RPC_CWNDSHIFT; + if (xprt->cwnd > cwnd) + xprt_release_rqst_cong(req->rq_task); + + ret = 0; + xprt_complete_rqst(req->rq_task, rcvbuf->len); + rcvbuf->len = 0; + +out_unlock: + spin_unlock_bh(>transport_lock); +out: + return ret; + +out_shortreply: + pr_info("svcrdma: short bc reply: xprt=%p, len=%zu\n", + xprt, src->iov_len); + goto out; + +out_notfound: + pr_info("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n", + xprt, be32_to_cpu(xid)); + + goto out_unlock; +} + +#endif /* CONFIG_SUNRPC_BACKCHANNEL */ diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index ff4f01e..2b762b5 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -47,6 +47,7 @@ #include #include #include +#include "xprt_rdma.h" #define RPCDBG_FACILITY RPCDBG_SVCXPRT @@ -567,6 +568,42 @@ static int rdma_read_complete(struct svc_rqst *rqstp, return ret; } +#if defined(CONFIG_SUNRPC_BACKCHANNEL) + +/* By convention, backchannel calls arrive via rdma_msg type + * messages, and never populate the chunk lists. This makes + * the RPC/RDMA header small and fixed in size, so it is + * straightforward to check the RPC header's direction field. + */ +static bool +svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp) +{ + __be32 *p = (__be32 *)rmsgp; + + if (!xprt->xpt_bc_xprt) + return false; + + if (rmsgp->rm_type != rdma_msg) + return false; + if (rmsgp->rm_body.rm_chunks[0] != xdr_zero) + return false; + if (rmsgp->rm_body.rm_chunks[1] != xdr_zero) + return false; + if (rmsgp->rm_body.rm_chunks[2] != xdr_zero) + return false; The above assertion is only true for the NFS behavior as spec'd today (no chunk-bearing bulk data on existing backchannel NFS protocol messages). That at least deserves a comment. Or, why not simply ignore the chunks? They're not the receiver's problem. + + /* sanity */ + if (p[7] != rmsgp->rm_xid) + return false; + /* call direction */ + if (p[8] == cpu_to_be32(RPC_CALL)) + return false; + + return true; +} + +#endif /* CONFIG_SUNRPC_BACKCHANNEL */ + /* * Set up the rqstp thread context to point to the RQ buffer. If * necessary, pull additional data from the client with an RDMA_READ @@ -632,6 +669,17 @@ int svc_rdma_recvfrom(struct svc_rqst *rqstp)
Re: [PATCH v1 8/8] svcrdma: Remove svc_rdma_fastreg_mr::access_flags field
On 11/23/2015 5:21 PM, Chuck Lever wrote: Clean up: The access_flags field is not used outside of rdma_read_chunk_frmr() and is always set to the same value. Signed-off-by: Chuck Lever--- include/linux/sunrpc/svc_rdma.h |1 - net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |3 +-- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index 243edf4..eee2a0d 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -107,7 +107,6 @@ struct svc_rdma_fastreg_mr { struct ib_mr *mr; struct scatterlist *sg; int sg_nents; - unsigned long access_flags; enum dma_data_direction direction; struct list_head frmr_list; }; diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 9480043..8ab1ab5 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -240,7 +240,6 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt, read = min_t(int, (nents << PAGE_SHIFT) - *page_offset, rs_length); frmr->direction = DMA_FROM_DEVICE; - frmr->access_flags = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE); frmr->sg_nents = nents; for (pno = 0; pno < nents; pno++) { @@ -308,7 +307,7 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt, reg_wr.wr.num_sge = 0; reg_wr.mr = frmr->mr; reg_wr.key = frmr->mr->lkey; - reg_wr.access = frmr->access_flags; + reg_wr.access = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE); Wait, the REMOTE_WRITE is there to support iWARP, but it isn't needed for IB or RoCE. Shouldn't this be updated to peek at those new attributes to decide, instead of remaining unconditional? reg_wr.wr.next = _wr.wr; /* Prepare RDMA_READ */ -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 5/9] xprtrdma: Add ro_unmap_sync method for FMR
On 11/23/2015 5:14 PM, Chuck Lever wrote: FMR's ro_unmap method is already synchronous because ib_unmap_fmr() is a synchronous verb. However, some improvements can be made here. I thought FMR support was about to be removed in the core. 1. Gather all the MRs for the RPC request onto a list, and invoke ib_unmap_fmr() once with that list. This reduces the number of doorbells when there is more than one MR to invalidate 2. Perform the DMA unmap _after_ the MRs are unmapped, not before. This is critical after invalidating a Write chunk. Signed-off-by: Chuck Lever--- net/sunrpc/xprtrdma/fmr_ops.c | 64 + 1 file changed, 64 insertions(+) diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c index f1e8daf..c14f3a4 100644 --- a/net/sunrpc/xprtrdma/fmr_ops.c +++ b/net/sunrpc/xprtrdma/fmr_ops.c @@ -179,6 +179,69 @@ out_maperr: return rc; } +static void +__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg) +{ + struct ib_device *device = r_xprt->rx_ia.ri_device; + struct rpcrdma_mw *mw = seg->rl_mw; + int nsegs = seg->mr_nsegs; + + seg->rl_mw = NULL; + + while (nsegs--) + rpcrdma_unmap_one(device, seg++); + + rpcrdma_put_mw(r_xprt, mw); +} + +/* Invalidate all memory regions that were registered for "req". + * + * Sleeps until it is safe for the host CPU to access the + * previously mapped memory regions. + */ +static void +fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req) +{ + struct rpcrdma_mr_seg *seg; + unsigned int i, nchunks; + struct rpcrdma_mw *mw; + LIST_HEAD(unmap_list); + int rc; + + dprintk("RPC: %s: req %p\n", __func__, req); + + /* ORDER: Invalidate all of the req's MRs first +* +* ib_unmap_fmr() is slow, so use a single call instead +* of one call per mapped MR. +*/ + for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) { + seg = >rl_segments[i]; + mw = seg->rl_mw; + + list_add(>r.fmr.fmr->list, _list); + + i += seg->mr_nsegs; + } + rc = ib_unmap_fmr(_list); + if (rc) + pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc); + + /* ORDER: Now DMA unmap all of the req's MRs, and return +* them to the free MW list. +*/ + for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) { + seg = >rl_segments[i]; + + __fmr_dma_unmap(r_xprt, seg); + + i += seg->mr_nsegs; + seg->mr_nsegs = 0; + } + + req->rl_nchunks = 0; +} + /* Use the ib_unmap_fmr() verb to prevent further remote * access via RDMA READ or RDMA WRITE. */ @@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf) const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = { .ro_map = fmr_op_map, + .ro_unmap_sync = fmr_op_unmap_sync, .ro_unmap = fmr_op_unmap, .ro_open= fmr_op_open, .ro_maxpages= fmr_op_maxpages, -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 5/8] svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies
I'll have to think about whether I agree with that as a protocol statement. Chunks in a reply are there to account for the data that is handled in the chunk of a request. So it kind of comes down to whether RDMA is allowed (or used) on the backchannel. I still think that is fundamentally an upper layer question, not an RPC one. On 11/23/2015 8:47 PM, Chuck Lever wrote: On Nov 23, 2015, at 7:44 PM, Tom Talpey <t...@talpey.com> wrote: On 11/23/2015 5:20 PM, Chuck Lever wrote: To support the NFSv4.1 backchannel on RDMA connections, add a capability for receiving an RPC/RDMA reply on a connection established by a client. Signed-off-by: Chuck Lever <chuck.le...@oracle.com> --- net/sunrpc/xprtrdma/rpc_rdma.c | 76 +++ net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 60 net/sunrpc/xprtrdma/xprt_rdma.h |4 ++ 3 files changed, 140 insertions(+) diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c index c10d969..fef0623 100644 --- a/net/sunrpc/xprtrdma/rpc_rdma.c +++ b/net/sunrpc/xprtrdma/rpc_rdma.c @@ -946,3 +946,79 @@ repost: if (rpcrdma_ep_post_recv(_xprt->rx_ia, _xprt->rx_ep, rep)) rpcrdma_recv_buffer_put(rep); } + +#if defined(CONFIG_SUNRPC_BACKCHANNEL) + +int +rpcrdma_handle_bc_reply(struct rpc_xprt *xprt, struct rpcrdma_msg *rmsgp, + struct xdr_buf *rcvbuf) +{ + struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt); + struct kvec *dst, *src = >head[0]; + struct rpc_rqst *req; + unsigned long cwnd; + u32 credits; + size_t len; + __be32 xid; + __be32 *p; + int ret; + + p = (__be32 *)src->iov_base; + len = src->iov_len; + xid = rmsgp->rm_xid; + + pr_info("%s: xid=%08x, length=%zu\n", + __func__, be32_to_cpu(xid), len); + pr_info("%s: RPC/RDMA: %*ph\n", + __func__, (int)RPCRDMA_HDRLEN_MIN, rmsgp); + pr_info("%s: RPC: %*ph\n", + __func__, (int)len, p); + + ret = -EAGAIN; + if (src->iov_len < 24) + goto out_shortreply; + + spin_lock_bh(>transport_lock); + req = xprt_lookup_rqst(xprt, xid); + if (!req) + goto out_notfound; + + dst = >rq_private_buf.head[0]; + memcpy(>rq_private_buf, >rq_rcv_buf, sizeof(struct xdr_buf)); + if (dst->iov_len < len) + goto out_unlock; + memcpy(dst->iov_base, p, len); + + credits = be32_to_cpu(rmsgp->rm_credit); + if (credits == 0) + credits = 1;/* don't deadlock */ + else if (credits > r_xprt->rx_buf.rb_bc_max_requests) + credits = r_xprt->rx_buf.rb_bc_max_requests; + + cwnd = xprt->cwnd; + xprt->cwnd = credits << RPC_CWNDSHIFT; + if (xprt->cwnd > cwnd) + xprt_release_rqst_cong(req->rq_task); + + ret = 0; + xprt_complete_rqst(req->rq_task, rcvbuf->len); + rcvbuf->len = 0; + +out_unlock: + spin_unlock_bh(>transport_lock); +out: + return ret; + +out_shortreply: + pr_info("svcrdma: short bc reply: xprt=%p, len=%zu\n", + xprt, src->iov_len); + goto out; + +out_notfound: + pr_info("svcrdma: unrecognized bc reply: xprt=%p, xid=%08x\n", + xprt, be32_to_cpu(xid)); + + goto out_unlock; +} + +#endif /* CONFIG_SUNRPC_BACKCHANNEL */ diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index ff4f01e..2b762b5 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -47,6 +47,7 @@ #include #include #include +#include "xprt_rdma.h" #define RPCDBG_FACILITY RPCDBG_SVCXPRT @@ -567,6 +568,42 @@ static int rdma_read_complete(struct svc_rqst *rqstp, return ret; } +#if defined(CONFIG_SUNRPC_BACKCHANNEL) + +/* By convention, backchannel calls arrive via rdma_msg type + * messages, and never populate the chunk lists. This makes + * the RPC/RDMA header small and fixed in size, so it is + * straightforward to check the RPC header's direction field. + */ +static bool +svc_rdma_is_backchannel_reply(struct svc_xprt *xprt, struct rpcrdma_msg *rmsgp) +{ + __be32 *p = (__be32 *)rmsgp; + + if (!xprt->xpt_bc_xprt) + return false; + + if (rmsgp->rm_type != rdma_msg) + return false; + if (rmsgp->rm_body.rm_chunks[0] != xdr_zero) + return false; + if (rmsgp->rm_body.rm_chunks[1] != xdr_zero) + return false; + if (rmsgp->rm_body.rm_chunks[2] != xdr_zero) + return false; The above assertion is only true for the NFS behavior as spec'd today (no chunk-bearing bulk data on existing b
Re: [PATCH v1 2/8] svcrdma: Define maximum number of backchannel requests
On 11/23/2015 8:36 PM, Chuck Lever wrote: On Nov 23, 2015, at 8:19 PM, Tom Talpey <t...@talpey.com> wrote: On 11/23/2015 8:09 PM, Chuck Lever wrote: On Nov 23, 2015, at 7:39 PM, Tom Talpey <t...@talpey.com> wrote: On 11/23/2015 5:20 PM, Chuck Lever wrote: Extra resources for handling backchannel requests have to be pre-allocated when a transport instance is created. Set a limit. Signed-off-by: Chuck Lever <chuck.le...@oracle.com> --- include/linux/sunrpc/svc_rdma.h |5 + net/sunrpc/xprtrdma/svc_rdma_transport.c |6 +- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index f869807..478aa30 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -178,6 +178,11 @@ struct svcxprt_rdma { #define RPCRDMA_SQ_DEPTH_MULT 8 #define RPCRDMA_MAX_REQUESTS32 #define RPCRDMA_MAX_REQ_SIZE4096 +#if defined(CONFIG_SUNRPC_BACKCHANNEL) Why is this a config option? Why wouldn't you always want this? It's needed for any post-1990 NFS dialect. I think some distros want to be able to compile out NFSv4.x on small systems, and take all the backchannel cruft with it. So shouldn't it follow the NFSv4.x config options then? Setting CONFIG_NFS_V4_1 sets CONFIG_SUNRPC_BACKCHANNEL. Adding #ifdef CONFIG_NFS_V4_1 in net/sunrpc would be a layering violation. Ok, I guess. It seems a fairly small and abstruse detail to surface as a config option. But if it's already there, sure, use it. I see however that CONFIG_SUNRPC_BACKCHANNEL controls only the client backchannel capability. Perhaps it is out of place to use it to enable the server’s backchannel capability. Ok, now it's even smaller and more abstruse. :-) To say nothing of multiplying. ;-) +#define RPCRDMA_MAX_BC_REQUESTS8 Why a constant 8? The forward channel value is apparently configurable, just a few lines down. The client side backward direction credit limit, now in 4.4, is already a constant. The client side ULP uses a constant for the slot table size: NFS4_MAX_BACK_CHANNEL_OPS. I’m not 100% sure but the server seems to just echo that number back to the client. I’d rather not add an admin knob for this. Why would it be necessary? Because no constant is ever correct. Why isn't it "1"? Do you allow multiple credits? Why not that value? For instance. There’s no justification for the forward channel credit limit either. The code in Linux assumes one session slot in the NFSv4.1 backchannel. When we get around to it, this can be made more flexible. Ok so again, why choose "8" here? It’s much easier to add flexibility and admin control later than it is to take it away when the knob becomes useless or badly designed. For now, 8 works, and it doesn’t have to be permanent. I could add a comment that says /* Arbitrary: support up to eight backward credits. */ +#else +#define RPCRDMA_MAX_BC_REQUESTS0 +#endif #define RPCSVC_MAXPAYLOAD_RDMARPCSVC_MAXPAYLOAD diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index b348b4a..01c7b36 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -923,8 +923,10 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) (size_t)RPCSVC_MAXPAGES); newxprt->sc_max_sge_rd = min_t(size_t, devattr.max_sge_rd, RPCSVC_MAXPAGES); + /* XXX: what if HCA can't support enough WRs for bc operation? */ newxprt->sc_max_requests = min((size_t)devattr.max_qp_wr, - (size_t)svcrdma_max_requests); + (size_t)(svcrdma_max_requests + + RPCRDMA_MAX_BC_REQUESTS)); newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_max_requests; /* @@ -964,7 +966,9 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) qp_attr.event_handler = qp_event_handler; qp_attr.qp_context = >sc_xprt; qp_attr.cap.max_send_wr = newxprt->sc_sq_depth; + qp_attr.cap.max_send_wr += RPCRDMA_MAX_BC_REQUESTS; qp_attr.cap.max_recv_wr = newxprt->sc_max_requests; + qp_attr.cap.max_recv_wr += RPCRDMA_MAX_BC_REQUESTS; qp_attr.cap.max_send_sge = newxprt->sc_max_sge; qp_attr.cap.max_recv_sge = newxprt->sc_max_sge; qp_attr.sq_sig_type = IB_SIGNAL_REQ_WR; -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger
Re: [PATCH libmlx5 0/7] Completion timestamping
On 11/15/2015 7:30 AM, Matan Barak wrote: This series adds support for completion timestamp. In order to support this feature, several extended verbs were implemented (as instructed in libibverbs). Can you describe what these timestamps are actually for? It's not clear at all from the comments. I'm assuming they are for some sort of fine-grained statistics? Are they purely for userspace consumers? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/3] svcrdma: Use device rdma_read_access_flags
On 11/10/2015 4:17 PM, Jason Gunthorpe wrote: On Tue, Nov 10, 2015 at 04:00:43PM -0500, Tom Talpey wrote: Hmm, agreed, but it must still be acceptable to perform a registration instead of using the local_dma_lkey. As mentioned earlier, there are scatter limits and other implications for all-physical addressing that an upper layer may choose to avoid. It is always acceptable to use a lkey MR instead of the local dma lkey, but ULPs should prefer to use the local dma lkey if possible, for performance reasons. Sure, the key words are "if possible". For example, a single 1MB RDMA Read is unlikely to be possible with the dma lkey, assuming worst-case physical page scatter it would need 256 SGEs. But 1MB can be registered easily. In any case, my point was that the ULP gets to choose, so, it looks like we agree. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 2/3] svcrdma: Use device rdma_read_access_flags
On 11/10/2015 1:25 PM, Jason Gunthorpe wrote: On Tue, Nov 10, 2015 at 04:04:32AM -0800, Christoph Hellwig wrote: On Tue, Nov 10, 2015 at 01:46:40PM +0200, Sagi Grimberg wrote: On 10/11/2015 13:41, Christoph Hellwig wrote: Oh, and while we're at it. Can someone explain why we're even using rdma_read_chunk_frmr for IB? It seems to work around the fact tat iWarp only allow a single RDMA READ SGE, but it's used whenever the device has IB_DEVICE_MEM_MGT_EXTENSIONS, which seems wrong. I think Steve can answer it better than I can. I think that it is just to have a single code path for both IB and iWARP. I agree that the condition seems wrong and for small transfers rdma_read_chunk_frmr is really a performance loss. Well, the code path already exists, but only is used fi IB_DEVICE_MEM_MGT_EXTENSIONS isn't set. Below is an untested patch that demonstrates how I think svcrdma should setup the reads. Note that this also allows to entirely remove it's allphys MR. Note that as a followon this would also allow to remove the non-READ_W_INV code path from rdma_read_chunk_frmr as a future step. I like this, my only comment is we should have a rdma_cap for this behavior, rdma_cap_needs_rdma_read_mr(pd) or something? Windows NDKPI has this, it's the oh-so-succinctly-named flag NDK_ADAPTER_FLAG_RDMA_READ_SINK_NOT_REQUIRED. The ULP is free to ignore it and pass the NDK_MR_FLAG_RDMA_READ_SINK flag anyway, the provider is expected to ignore it if not needed. + if (rdma_protocol_iwarp(dev, newxprt->sc_cm_id->port_num)) { Use here + /* +* iWARP requires remote write access for the data sink, and +* only supports a single SGE for RDMA_READ requests, so we'll +* have to use a memory registration for each RDMA_READ. +*/ + if (!(dev->device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS)) { Lets enforce this in the core, if rdma_cap_needs_rdma_read_mr is set the the device must also set IB_DEVICE_MEM_MGT_EXTENSIONS, check at device creation time. + } else if (rdma_ib_or_roce(dev, newxprt->sc_cm_id->port_num)) { + /* +* For IB or RoCE life is easy, no unsafe write access is +* required and multiple SGEs are supported, so we don't need +* to use MRs. +*/ + newxprt->sc_reader = rdma_read_chunk_lcl; + } else { + /* +* Neither iWarp nor IB-ish, we're out of luck. +*/ goto errout; No need for the else, !rdma_cap_needs_rdma_read_mr means pd->local_dma_lkey is okay to use. Hmm, agreed, but it must still be acceptable to perform a registration instead of using the local_dma_lkey. As mentioned earlier, there are scatter limits and other implications for all-physical addressing that an upper layer may choose to avoid. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/3] IB/core: Expose a device attribute for rdma_read access flags
On 11/10/2015 6:51 AM, Yann Droneaud wrote: Le mardi 10 novembre 2015 à 12:44 +0200, Sagi Grimberg a écrit : Also, set rdma_read_access_flags in the relevant device drivers: mlx4, mlx5, qib, ocrdma, nes: IB_ACCESS_LOCAL_WRITE cxgb3, cxgb4, hfi: IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE Why were those hw providers not modified to enforce IB_ACCESS_REMOTE_WRITE when needed, instead of asking users to set it for them ? Because the Linux verbs don't tell the driver whether the registration is for an RDMA Read sink buffer. Sagi, the Windows NDKPI has an NDK_MR_FLAG_RDMA_READ_SINK attribute which the upper layer can use to convey this information, I've mentioned it here before. https://msdn.microsoft.com/en-us/library/windows/hardware/hh439908(v=vs.85).aspx When this approach is used, the upper layer doesn't have to be aware at all of the lower layer's details. Tom. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RDMA/CM and multiple QPs
On 9/6/2015 11:15 AM, Bart Van Assche wrote: On 09/06/15 00:50, Christoph Hellwig wrote: Note that the SRP driver already in tree is a good example for this, although it doesn't use RDMA/CM and thus already operates on a per-ib_device level. The challenges with regard to adding RDMA/CM support to the SRP initiator and target drivers are: - IANA has not yet assigned a port number to the SRP protocol (see e.g. http://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml). IANA doesn't do this automatically. Has anyone made the request? You might want to think through why it needs a dedicated port number though. iSER reuses the iSCSI port, by negotiating RDMA during login. - The login request (struct srp_login_req) is too large for the RDMA/CM. A format for the login parameters for the RDMA/CM has not yet been standardized. Are you suggesting that RDMA/CM perform the login? That seems like a layering issue. Tom. Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 00/10] Add RoCE support to the mlx5 driver
On 8/25/2015 4:35 AM, Achiad Shochat wrote: On 8/24/2015 11:40 PM, Tom Talpey wrote: On 8/20/2015 12:46 PM, Achiad Shochat wrote: Hi Doug, This patchset adds RoCE V1 and RoCE V2 support to the mlx5 device driver. Question - assuming I read them correctly, these patches add the RoCE v1 and RoCE v2 support on a per-port basis. That is, a port can be either IB, RoCE v1 or RoCE v2, but not a combination. Has any thought been put toward supporting these protocols on a per-QP basis, i.e. the caller of rdma_connect() can specify the desired protocol? Or to have some sort of discovery? I know that there may be implementation restrictions on today's devices, but it's my personal belief that future devices will support multiple protocols (perhaps beyond the three above), and laying the groundwork for this today will be important. Tom. RoCE v1 and RoCE v2 are supported on a per-GID basis, not sure what got you to understand it is per-port. Because the protocol capabilities were being marked at the ib_device level. Ok, it's good that the protocol is per-endpoint. But I don't understand how it will work per-GID. What if the target node is RoCEv2 and on another subnet? How will it discover the remote's capability and establish the right protocol? How does the initiator select the protocol, if there is a choice? Another way of asking this question is, why is all this stuff in the driver, at the bottom of the stack? I think it should be in the rdmacm layer. Tom. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next 00/10] Add RoCE support to the mlx5 driver
On 8/20/2015 12:46 PM, Achiad Shochat wrote: Hi Doug, This patchset adds RoCE V1 and RoCE V2 support to the mlx5 device driver. Question - assuming I read them correctly, these patches add the RoCE v1 and RoCE v2 support on a per-port basis. That is, a port can be either IB, RoCE v1 or RoCE v2, but not a combination. Has any thought been put toward supporting these protocols on a per-QP basis, i.e. the caller of rdma_connect() can specify the desired protocol? Or to have some sort of discovery? I know that there may be implementation restrictions on today's devices, but it's my personal belief that future devices will support multiple protocols (perhaps beyond the three above), and laying the groundwork for this today will be important. Tom. This patchset was applied and tested over patchset Add RoCE v2 support which was sent to the mailing list by Matan Barak. Achiad Shochat (10): IB/mlx5: Support IB device's callback for getting the link layer IB/mlx5: Support IB device's callback for getting its netdev net/mlx5_core: Break down the vport mac address query function net/mlx5_core: Introduce access functions to enable/disable RoCE net/mlx5_core: Introduce access functions to query vport RoCE fields IB/mlx5: Extend query_device/port to support RoCE IB/mlx5: Set network_hdr_type upon RoCE responder completion IB/mlx5: Support IB device's callbacks for adding/deleting GIDs IB/mlx5: Add RoCE fields to Address Vector IB/mlx5: Support RoCE drivers/infiniband/hw/mlx5/ah.c | 32 ++- drivers/infiniband/hw/mlx5/cq.c | 17 ++ drivers/infiniband/hw/mlx5/main.c | 318 ++-- drivers/infiniband/hw/mlx5/mlx5_ib.h| 15 +- drivers/infiniband/hw/mlx5/qp.c | 42 +++- drivers/net/ethernet/mellanox/mlx5/core/vport.c | 139 ++- include/linux/mlx5/device.h | 26 ++ include/linux/mlx5/driver.h | 7 - include/linux/mlx5/mlx5_ifc.h | 10 +- include/linux/mlx5/qp.h | 21 +- include/linux/mlx5/vport.h | 8 + 11 files changed, 578 insertions(+), 57 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site
On 7/21/2015 7:33 AM, Steve Wise wrote: -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Tom Talpey Sent: Monday, July 20, 2015 7:15 PM To: Steve Wise; 'Jason Gunthorpe' Cc: 'Chuck Lever'; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site On 7/20/2015 3:41 PM, Steve Wise wrote: -Original Message- From: Tom Talpey [mailto:t...@talpey.com] Sent: Monday, July 20, 2015 5:04 PM To: Steve Wise; 'Jason Gunthorpe' Cc: 'Chuck Lever'; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site On 7/20/2015 2:16 PM, Steve Wise wrote: -Original Message- From: linux-nfs-ow...@vger.kernel.org [mailto:linux-nfs-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe Sent: Monday, July 20, 2015 4:06 PM To: Tom Talpey; Steve Wise Cc: Chuck Lever; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site On Mon, Jul 20, 2015 at 01:34:16PM -0700, Tom Talpey wrote: On 7/20/2015 12:03 PM, Chuck Lever wrote: All HCA providers have an ib_get_dma_mr() verb. Thus rpcrdma_ia_open() will either grab the device's local_dma_key if one is available, or it will call ib_get_dma_mr() which is a 100% guaranteed fallback. I recall that in the past, some providers did not support mapping all of the machine's potential physical memory with a single dma_mr. If an rnic did/does not support 44-ish bits of length per region, for example. Looks like you are right, but the standard in kernel is to require ib_get_dma_mr, if the HCA can't do that, then it cannot be used on a big memory machine with kernel ULPs. Looking deeper, both amso1100 and cxgb3 seem limited to 32 bits of physical memory, and silently break all kernel ULPs if they are used on a modern machine with 4G. Is that right Steve? Yes. Based on that, should we remove the cxgb3 driver as well? Or at least can you fix it up to at least fail get_dma_mr if there is too much ram? I would like to keep cxgb3 around. I can add code to fail if the memory is 32b. Do you know how I get the amount of available ram? A) are you sure it's an unsigned length, i.e. is it really 31 bits? yes. B) why bother to check? Are machines with 4GB interesting, and worth supporting a special optimization? No, but cxgb3 is still interesting to user applications, and perhaps NFSRDMA using FRMRs. I'm obviously not making myself clear. I am suggesting that cxgb3 fail the ib_get_dma_mr() verb, regardless of installed memory. I am not suggesting it fail to load, or fail other memreg requests. It should work normally in all other respects. Even with its limitation, doesn't it have utility for someone using cxgb3 in an embedded 32b environment? Sure, do you mean making it conditional on #if sizeof(physaddr) = 32? That would make sense I guess. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site
On 7/20/2015 1:55 PM, Chuck Lever wrote: On Jul 20, 2015, at 4:34 PM, Tom Talpey t...@talpey.com wrote: On 7/20/2015 12:03 PM, Chuck Lever wrote: All HCA providers have an ib_get_dma_mr() verb. Thus rpcrdma_ia_open() will either grab the device's local_dma_key if one is available, or it will call ib_get_dma_mr() which is a 100% guaranteed fallback. I recall that in the past, some providers did not support mapping all of the machine's potential physical memory with a single dma_mr. If an rnic did/does not support 44-ish bits of length per region, for example. The buffers affected by this change are small, so I’m confident that restriction would not be a problem here. It's not about the buffer size, it's about the region. Because the get_dma_mr does not specify a base address and length, the rnic must basically attempt to map a base of zero and a length of the largest physical offset. This is not the case with the previous phys_reg_mr, which specified the exact phys page range. What might break with such restricted hardware is ALLPHYSICAL on large-memory systems. ALLPHYSICAL does rely on a single DMA MR that covers all of the NFS client’s memory. Yes, indeed it always did, but that mode was not intended for general use. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site
On 7/20/2015 2:16 PM, Steve Wise wrote: -Original Message- From: linux-nfs-ow...@vger.kernel.org [mailto:linux-nfs-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe Sent: Monday, July 20, 2015 4:06 PM To: Tom Talpey; Steve Wise Cc: Chuck Lever; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site On Mon, Jul 20, 2015 at 01:34:16PM -0700, Tom Talpey wrote: On 7/20/2015 12:03 PM, Chuck Lever wrote: All HCA providers have an ib_get_dma_mr() verb. Thus rpcrdma_ia_open() will either grab the device's local_dma_key if one is available, or it will call ib_get_dma_mr() which is a 100% guaranteed fallback. I recall that in the past, some providers did not support mapping all of the machine's potential physical memory with a single dma_mr. If an rnic did/does not support 44-ish bits of length per region, for example. Looks like you are right, but the standard in kernel is to require ib_get_dma_mr, if the HCA can't do that, then it cannot be used on a big memory machine with kernel ULPs. Looking deeper, both amso1100 and cxgb3 seem limited to 32 bits of physical memory, and silently break all kernel ULPs if they are used on a modern machine with 4G. Is that right Steve? Yes. Based on that, should we remove the cxgb3 driver as well? Or at least can you fix it up to at least fail get_dma_mr if there is too much ram? I would like to keep cxgb3 around. I can add code to fail if the memory is 32b. Do you know how I get the amount of available ram? A) are you sure it's an unsigned length, i.e. is it really 31 bits? B) why bother to check? Are machines with 4GB interesting, and worth supporting a special optimization? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site
On 7/20/2015 12:03 PM, Chuck Lever wrote: All HCA providers have an ib_get_dma_mr() verb. Thus rpcrdma_ia_open() will either grab the device's local_dma_key if one is available, or it will call ib_get_dma_mr() which is a 100% guaranteed fallback. I recall that in the past, some providers did not support mapping all of the machine's potential physical memory with a single dma_mr. If an rnic did/does not support 44-ish bits of length per region, for example. Have you verified that all current providers do in fact support the necessary physical address range for this, and that the requirement is stated in the verbs going forward? There is never any need to use the ib_reg_phys_mr() code path in rpcrdma_register_internal(), so it can be removed. The remaining logic in rpcrdma_{de}register_internal() is folded into rpcrdma_{alloc,free}_regbuf(). This is clean up only. No behavior change is expected. Signed-off-by: Chuck Lever chuck.le...@oracle.com Reviewed-by: Devesh Sharma devesh.sha...@avagotech.com Reviewed-By: Sagi Grimberg sa...@mellanox.com Tested-by: Devesh Sharma devesh.sha...@avagotech.com --- net/sunrpc/xprtrdma/verbs.c | 102 --- net/sunrpc/xprtrdma/xprt_rdma.h |1 2 files changed, 21 insertions(+), 82 deletions(-) diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index 1065808..da184f9 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -1229,75 +1229,6 @@ rpcrdma_mapping_error(struct rpcrdma_mr_seg *seg) (unsigned long long)seg-mr_dma, seg-mr_dmalen); } -static int -rpcrdma_register_internal(struct rpcrdma_ia *ia, void *va, int len, - struct ib_mr **mrp, struct ib_sge *iov) -{ - struct ib_phys_buf ipb; - struct ib_mr *mr; - int rc; - - /* -* All memory passed here was kmalloc'ed, therefore phys-contiguous. -*/ - iov-addr = ib_dma_map_single(ia-ri_device, - va, len, DMA_BIDIRECTIONAL); - if (ib_dma_mapping_error(ia-ri_device, iov-addr)) - return -ENOMEM; - - iov-length = len; - - if (ia-ri_have_dma_lkey) { - *mrp = NULL; - iov-lkey = ia-ri_dma_lkey; - return 0; - } else if (ia-ri_bind_mem != NULL) { - *mrp = NULL; - iov-lkey = ia-ri_bind_mem-lkey; - return 0; - } - - ipb.addr = iov-addr; - ipb.size = iov-length; - mr = ib_reg_phys_mr(ia-ri_pd, ipb, 1, - IB_ACCESS_LOCAL_WRITE, iov-addr); - - dprintk(RPC: %s: phys convert: 0x%llx - registered 0x%llx length %d\n, - __func__, (unsigned long long)ipb.addr, - (unsigned long long)iov-addr, len); - - if (IS_ERR(mr)) { - *mrp = NULL; - rc = PTR_ERR(mr); - dprintk(RPC: %s: failed with %i\n, __func__, rc); - } else { - *mrp = mr; - iov-lkey = mr-lkey; - rc = 0; - } - - return rc; -} - -static int -rpcrdma_deregister_internal(struct rpcrdma_ia *ia, - struct ib_mr *mr, struct ib_sge *iov) -{ - int rc; - - ib_dma_unmap_single(ia-ri_device, - iov-addr, iov-length, DMA_BIDIRECTIONAL); - - if (NULL == mr) - return 0; - - rc = ib_dereg_mr(mr); - if (rc) - dprintk(RPC: %s: ib_dereg_mr failed %i\n, __func__, rc); - return rc; -} - /** * rpcrdma_alloc_regbuf - kmalloc and register memory for SEND/RECV buffers * @ia: controlling rpcrdma_ia @@ -1317,26 +1248,30 @@ struct rpcrdma_regbuf * rpcrdma_alloc_regbuf(struct rpcrdma_ia *ia, size_t size, gfp_t flags) { struct rpcrdma_regbuf *rb; - int rc; + struct ib_sge *iov; - rc = -ENOMEM; rb = kmalloc(sizeof(*rb) + size, flags); if (rb == NULL) goto out; - rb-rg_size = size; - rb-rg_owner = NULL; - rc = rpcrdma_register_internal(ia, rb-rg_base, size, - rb-rg_mr, rb-rg_iov); - if (rc) + iov = rb-rg_iov; + iov-addr = ib_dma_map_single(ia-ri_device, + (void *)rb-rg_base, size, + DMA_BIDIRECTIONAL); + if (ib_dma_mapping_error(ia-ri_device, iov-addr)) goto out_free; + iov-length = size; + iov-lkey = ia-ri_have_dma_lkey ? + ia-ri_dma_lkey : ia-ri_bind_mem-lkey; + rb-rg_size = size; + rb-rg_owner = NULL; return rb; out_free: kfree(rb); out: - return ERR_PTR(rc); + return ERR_PTR(-ENOMEM); } /** @@ -1347,10 +1282,15 @@ out: void rpcrdma_free_regbuf(struct rpcrdma_ia *ia, struct rpcrdma_regbuf *rb) { -
Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site
On 7/20/2015 3:17 PM, Jason Gunthorpe wrote: On Mon, Jul 20, 2015 at 03:04:21PM -0700, Tom Talpey wrote: B) why bother to check? Are machines with 4GB interesting, and worth supporting a special optimization? mainline drivers shouldn't silently malfunction. I meant why bother to check, just always return failure. It's more honest, and robust. Someone might test with 2GB, conclude it worked, then deploy with 8GB and boom. Also, what about hotplug RAM? Same situation. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site
On 7/20/2015 5:34 PM, Chuck Lever wrote: On Jul 20, 2015, at 8:11 PM, Tom Talpey t...@talpey.com wrote: On 7/20/2015 4:36 PM, Chuck Lever wrote: On Jul 20, 2015, at 6:41 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: On Mon, Jul 20, 2015 at 06:31:11PM -0400, Chuck Lever wrote: On Jul 20, 2015, at 6:26 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: On Mon, Jul 20, 2015 at 03:03:11PM -0400, Chuck Lever wrote: + iov-length = size; + iov-lkey = ia-ri_have_dma_lkey ? + ia-ri_dma_lkey : ia-ri_bind_mem-lkey; + rb-rg_size = size; + rb-rg_owner = NULL; return rb; There is something odd looking about this.. ri_bind_mem is only setup in the RPCRDMA_ALLPHYSICAL and RPCRDMA_MTHCAFMR cases. RPCRDMA_FRMR doesn't set it up. So this code in rpcrdma_alloc_regbuf is never called for the FRMR case? If yes, then, how is FRMR working? There is absolutely no reason to use FRMR to register local send buffers, just use the global all memory lkey... If no, then that is an oops? I’ve tested this code, no oops. FRWR always uses the DMA lkey. xprtrdma does not use FRWR if IB_DEVICE_LOCAL_DMA_LKEY is not asserted. Ah, I see. Ok. Is there a reason to link FRWR and LOCAL_DMA_LKEY together? Tom would know. Tom Talpey or Tom Tucker? I don't think I used a dma_lkey at all in the original work, but there were some changes later. The commit which adds FRWR support to xprtrdma introduced a check in rpcrdma_ia_open: + case RPCRDMA_FRMR: + /* Requires both frmr reg and local dma lkey */ + if ((devattr.device_cap_flags +(IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) != + (IB_DEVICE_MEM_MGT_EXTENSIONS|IB_DEVICE_LOCAL_DMA_LKEY)) { This seems to be the only place the local DMA lkey requirement is expressed or enforced. But yeah, the local DMA lkey doesn’t seem to be used anywhere in the FRWR-specific parts of the code. I now vaguely remember someone telling me that both attributes were required, but a) I may have misunderstood and b) I'm sure something has changed. FRMR was brand new at the time. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site
On 7/20/2015 3:21 PM, Chuck Lever wrote: On Jul 20, 2015, at 5:55 PM, Tom Talpey t...@talpey.com wrote: On 7/20/2015 1:55 PM, Chuck Lever wrote: On Jul 20, 2015, at 4:34 PM, Tom Talpey t...@talpey.com wrote: On 7/20/2015 12:03 PM, Chuck Lever wrote: All HCA providers have an ib_get_dma_mr() verb. Thus rpcrdma_ia_open() will either grab the device's local_dma_key if one is available, or it will call ib_get_dma_mr() which is a 100% guaranteed fallback. I recall that in the past, some providers did not support mapping all of the machine's potential physical memory with a single dma_mr. If an rnic did/does not support 44-ish bits of length per region, for example. The buffers affected by this change are small, so I’m confident that restriction would not be a problem here. It's not about the buffer size, it's about the region. Because the get_dma_mr does not specify a base address and length, the rnic must basically attempt to map a base of zero and a length of the largest physical offset. Understood now, but: This is not the case with the previous phys_reg_mr, which specified the exact phys page range. rpcrdma_ia_open() fails immediately if IB_DEVICE_LOCAL_DMA_LKEY is not asserted _and_ ib_get_dma_mr() fails. We never get to the logic in rpcrdma_regbuf_alloc() in that case, so ib_reg_phys_mr() would still never be invoked. That really is dead code. If you prefer, I can adjust the patch description to remove the “100% guaranteed fallback” line. Ok, good and yes 100% sounds like a risky statement. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site
On 7/20/2015 4:36 PM, Chuck Lever wrote: On Jul 20, 2015, at 6:41 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: On Mon, Jul 20, 2015 at 06:31:11PM -0400, Chuck Lever wrote: On Jul 20, 2015, at 6:26 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: On Mon, Jul 20, 2015 at 03:03:11PM -0400, Chuck Lever wrote: + iov-length = size; + iov-lkey = ia-ri_have_dma_lkey ? + ia-ri_dma_lkey : ia-ri_bind_mem-lkey; + rb-rg_size = size; + rb-rg_owner = NULL; return rb; There is something odd looking about this.. ri_bind_mem is only setup in the RPCRDMA_ALLPHYSICAL and RPCRDMA_MTHCAFMR cases. RPCRDMA_FRMR doesn't set it up. So this code in rpcrdma_alloc_regbuf is never called for the FRMR case? If yes, then, how is FRMR working? There is absolutely no reason to use FRMR to register local send buffers, just use the global all memory lkey... If no, then that is an oops? I’ve tested this code, no oops. FRWR always uses the DMA lkey. xprtrdma does not use FRWR if IB_DEVICE_LOCAL_DMA_LKEY is not asserted. Ah, I see. Ok. Is there a reason to link FRWR and LOCAL_DMA_LKEY together? Tom would know. Tom Talpey or Tom Tucker? I don't think I used a dma_lkey at all in the original work, but there were some changes later. I thought not all rnic's supported a dma_lkey. So requiring it seems like a bad idea, to me. Just use the code you have for fmr with frwr to get the lkey, probably move it to rpcrdma_ia_open .. Physical MR should create a 2nd MR dedicated for rkey use. But FMR is not supported by modern mlx5 and cxgb4? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site
On 7/20/2015 3:41 PM, Steve Wise wrote: -Original Message- From: Tom Talpey [mailto:t...@talpey.com] Sent: Monday, July 20, 2015 5:04 PM To: Steve Wise; 'Jason Gunthorpe' Cc: 'Chuck Lever'; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site On 7/20/2015 2:16 PM, Steve Wise wrote: -Original Message- From: linux-nfs-ow...@vger.kernel.org [mailto:linux-nfs-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe Sent: Monday, July 20, 2015 4:06 PM To: Tom Talpey; Steve Wise Cc: Chuck Lever; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site On Mon, Jul 20, 2015 at 01:34:16PM -0700, Tom Talpey wrote: On 7/20/2015 12:03 PM, Chuck Lever wrote: All HCA providers have an ib_get_dma_mr() verb. Thus rpcrdma_ia_open() will either grab the device's local_dma_key if one is available, or it will call ib_get_dma_mr() which is a 100% guaranteed fallback. I recall that in the past, some providers did not support mapping all of the machine's potential physical memory with a single dma_mr. If an rnic did/does not support 44-ish bits of length per region, for example. Looks like you are right, but the standard in kernel is to require ib_get_dma_mr, if the HCA can't do that, then it cannot be used on a big memory machine with kernel ULPs. Looking deeper, both amso1100 and cxgb3 seem limited to 32 bits of physical memory, and silently break all kernel ULPs if they are used on a modern machine with 4G. Is that right Steve? Yes. Based on that, should we remove the cxgb3 driver as well? Or at least can you fix it up to at least fail get_dma_mr if there is too much ram? I would like to keep cxgb3 around. I can add code to fail if the memory is 32b. Do you know how I get the amount of available ram? A) are you sure it's an unsigned length, i.e. is it really 31 bits? yes. B) why bother to check? Are machines with 4GB interesting, and worth supporting a special optimization? No, but cxgb3 is still interesting to user applications, and perhaps NFSRDMA using FRMRs. I'm obviously not making myself clear. I am suggesting that cxgb3 fail the ib_get_dma_mr() verb, regardless of installed memory. I am not suggesting it fail to load, or fail other memreg requests. It should work normally in all other respects. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
On 7/14/2015 5:22 AM, Sagi Grimberg wrote: On 7/14/2015 10:37 AM, 'Christoph Hellwig' wrote: On Mon, Jul 13, 2015 at 03:36:44PM -0400, Tom Talpey wrote: On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote: I think what we need to support for now is FRMR as the primary target, and FMR as a secondar[y]. FMR is a *very* bad choice, for several reasons. 1) It's not supported by very many devices, in fact it might even be thought to be obsolete. It's support by the Mellanox adapters, *Older* Mellanox adapters. mlx5 (and future drivers I assume) will no longer support FMRs. Right, but drop the word will. Mlx5 (the current ConnectX-4) doesn't support FMR. It's only supported by mlx4 and mthca drivers. Based on looking at the consumers and the above table I think FMR are still the better fallback for devices that don't support FR, It's better if you want it fast. Do you guys think FMR is actually fast? Because it's not. Measure it. It might have been marginally faster than other schemes in like, 2007, and only when the mempool was there to leave the registrations open. Don't go back there. Tom. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
On 7/14/2015 11:36 AM, 'Christoph Hellwig' wrote: On Tue, Jul 14, 2015 at 12:10:36PM +0300, Sagi Grimberg wrote: Having an API that does FRMR/FMR/PHYS_MR is even worse from the ULP PoV. If you expose an API that might schedule (PHYS_MR) it limits the context that the caller is allowed to call in. I'm 100% against an registration API that *might* schedule. Oh, I had missed that PHYS_MR might sleep. That might be the reasons why everyone is avoiding them despite Tom preferring them over FMR. Any synchronous verb might sleep. They issue commands to the adapter across the PCIe bus, and need to wait for the result. FMR does, too. At least the work-request-based ones are explicit about waiting for the completion. Are you saying you want an API that does memory registration without blocking AND without sending a work request to hardware, ever? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
On 7/14/2015 3:32 PM, Steve Wise wrote: On Tue, Jul 14, 2015 at 02:25:50PM -0500, Steve Wise wrote: The benefit is that we don't have to check for iWARP protocol in the ULP. IB should have to pay the cost of FRMR for lkey on RDMA READ, I'm pretty sure you have to check for iWarp at somepoint.. You mean should not, yea? Ok. I'll check for iWARP. But don't tell me to remove the transport-specific hacks in this series when I post it! ;) FYI, in the Windows NDKPI (verbs-like kernel provider interface), there is a device attribute defined as follows: https://msdn.microsoft.com/en-us/library/windows/hardware/hh439851(v=vs.85).aspx NDK_ADAPTER_FLAG_RDMA_READ_SINK_NOT_REQUIRED 0x0002 Set if the provider does not require special access rights on the sink buffer for an RDMA read request. When this flag is set, the consumer is not required to use the NDK_MR_FLAG_RDMA_READ_SINK or NDK_OP_FLAG_RDMA_READ_SINK flags when it registers sink buffers for RDMA read requests. The consumer can also use logical address mappings directly (with a token obtained with the NDK_FN_GET_PRIVILEGED_MEMORY_REGION_TOKEN function) as RDMA read sink buffers. This is similar to access to local buffers for RDMA write, send, and receive operations. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/14] xprtrdma: Remove logic that constructs RDMA_MSGP type calls
On 7/13/2015 12:30 PM, Chuck Lever wrote: RDMA_MSGP type calls insert a zero pad in the middle of the RPC message to align the RPC request's data payload to the server's alignment preferences. A server can then page flip the payload into place to avoid a data copy in certain circumstances. However: ... Clean up the marshaling code by removing the logic that constructs RDMA_MSGP type calls. This also reduces the maximum send iovec size from four to just two elements. diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index 8219011..0b50103 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h ... +#define RPCRDMA_MAX_IOVS (4) + So, shouldn't this constant be 2? The extra 2 iov's were used only for constructing the pad. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel fast memory registration API proposal [RFC]
On 7/14/2015 12:16 PM, Steve Wise wrote: -Original Message- From: Sagi Grimberg [mailto:sa...@dev.mellanox.co.il] Sent: Tuesday, July 14, 2015 11:12 AM To: Christoph Hellwig Cc: Jason Gunthorpe; linux-rdma@vger.kernel.org; Steve Wise; Or Gerlitz; Oren Duer; Chuck Lever; Bart Van Assche; Liran Liss; Hefty, Sean; Doug Ledford; Tom Talpey Subject: Re: Kernel fast memory registration API proposal [RFC] On 7/14/2015 6:33 PM, Christoph Hellwig wrote: On Tue, Jul 14, 2015 at 11:39:24AM +0300, Sagi Grimberg wrote: This is exactly what I don't want to do. I don't think that implicit posting is a good idea for reasons that I mentioned earlier: This is where I have a problem. Providing an API that may or may not post a work request on my QP is confusing, and I don't understand its semantics at all. Do I need to reserve slots on my QP? should I ask for a completion? If we suppress the completion will I see an error completion? What should I expect to find in the wr_id? We're much better off with keeping the post interface in place but have it much simpler. The ULP doesn't care if it needs to reserver the slot, and it generally doesn't care about the notification either unless it needs to handle an error. That's generally correct. But the ULP may want to suppress notifications of other posts as well (for example a storage initiator may want to suppress send completions since it needs to reserve the request context until it receives the status anyway - which is a receive completion). It needs to take what was posted and what not into account to avoid wrapping. If I'm not mistaken iWARP requires to ask for a completion once every send-queue length and empirically, IB drivers requires it too. So again, Correct. Indeed, correct. But this has nothing to do with the protocol. It's necessary because without a completion, the driver can't keep the work queue and completion queue pointers properly in sync with the hardware. If the tail pointer catches up and crosses the head pointer, boom. This is a pure verbs issue, completely local behavior, and applies to all providers. I don't think implicit posting is a very good idea. Specifically, it will work only if completions are never suppressed, if the upper layer is aware of the additional work requests and their ordering implications, and the overhead of the interrupts is acceptable. None of these preconditions are desirable. The completions are another mad mightmare in the RDMA stack API. Every other subsystem would just allow submitter to attach a function pointer that handles the completion to the posted item. This is actually a good idea. And it can be easily added I think (at least to mlx drivers which I better). It can help removing a switch statement for the ULPs when handling the completions. But no, the RDMA stack instead require ID allocators and gigantic boilerplate code in the consumer to untangle that gigantic mess again. I don't think the wr_id was ever intended to be an allocated tag. It's the ULP context, usually a pointer to a transaction structure. Even with passing a function pointer you need to get back your original context don't you? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
On 7/11/2015 6:25 AM, 'Christoph Hellwig' wrote: I think what we need to support for now is FRMR as the primary target, and FMR as a secondar[y]. FMR is a *very* bad choice, for several reasons. 1) It's not supported by very many devices, in fact it might even be thought to be obsolete. 2) It does not protect on byte boundaries, therefore it is unsafe to use for anything but page-sized, page-aligned RDMA operations. 3) It is a synchronous API, i.e. it is not work-request based and therefore is not very high performance. 4) It sometimes is used with a pool, which defers deregistration in the hopes of amortizing overhead. However, this deferral further increases the risk of remote access, including altering of memory contents after the fact. Personally, I'd recommend ib_get_phys_mr() over FMR. It at least doesn't suffer from issues 1, 2 and 4. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
On 7/13/2015 1:18 PM, Jason Gunthorpe wrote: On Fri, Jul 10, 2015 at 11:10:23PM -0400, Doug Ledford wrote: Black hat server is beyond the scope of this discussion. We cannot assume an all-trusted model here, there are many configurations to deploy NFS/iSCSI that don't assume that. Even if you assume it for the RDMA cases (which I stronlgy disagree with), it still must be proven to not weaken the existing TCP/IP case. So, a black hat server is on the table, attacking a client that the admin is not intending to use with RDMA, by forcing it to switch to RDMA before auth and exploiting the RDMA side. This is where the iwarp guys have to analyze and come back to say it is OK. Maybe iwarp can't get to rdma without auth or something... Two observations. One, auth is an Upper Layer matter. It's not the job of the transport to authenticate the peer, the user, etc. Upper layers do this, and iSCSI performs a login, NFSv4.1+ creates a session, SMB3 creates sessions on multiple connections, etc. All this happens after the connection is established. Two, the iSCSI and NFSv4.1 protocols have explicit support for iWARP step-up mode, which supports an initial connection in TCP (i.e. with RDMA disabled), and switching to RDMA mode dynamically. The IB and RoCE protocols do not support step-up mode, so in fact one could argue that iWARP is *better* positioned to support this. Tom. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 09/12] xprtrdma: Prepare rpcrdma_ep_post() for RDMA_NOMSG calls
On 7/10/2015 7:29 AM, Devesh Sharma wrote: we need to honor the max limits of device by checking dev_attr.max_sge? a vendor may not support 4 sges. iWARP requires a minimum of 4 send SGEs (draft-hilland-verbs 8.1.3.2) An RI MUST support at least four Scatter/Gather Elements per Scatter/Gather List when the Scatter/Gather List refers to the Data Source of a Send Operation Type or the Data Sink of a Receive Operation. An RI is NOT REQUIRED to support more than one Scatter/Gather Element per Scatter/Gather List when the Scatter/Gather List refers to the Data Source of an RDMA Write. I'm not certain if IB and RoCE state a similar minimum requirement, but it seems a very bad idea to have fewer. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
On 7/9/2015 6:53 PM, Jason Gunthorpe wrote: On Thu, Jul 09, 2015 at 06:18:26PM -0400, Doug Ledford wrote: A lot of this discussion is interesting and has gone off in an area that I think is worth pursuing in the long run. However, in the short run, this patch was a minor cleanup/simplification patch. These discussions are moving into complete redesigns and rearchitecting. Steve, I'm OK with the cleanup and would prefer that it stay separate from these larger issues. So, I was originally of the view the flags change was fine. But when I realized that it basically hides a ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) behind an opaque API: rdma_get_dma_mr(RDMA_MRR_READ_DEST) I changed my mind. Yes, I agree. Creating a writable privileged rmr for the RDMA Read sink buffer exposes all of physical memory to a simple arithmetic mistake. This should never be allowed in a production config. That said, for the use by the NFS/RDMA Server, the risk is significantly mitigated. The sink MR is never sent to the peer by the NFS upper layer, it is protected by the connection's PD, and it is enabled only when the RDMA Read is active. However, the very real risk remains. This should definitely not be the standard API. For iWarp the above creates a rkey that can RDMA WRITE to any place in system memory. If a remote guesses that rkey your system is totally compromised. So it is insecure, and contrary to the advice in RFC5042. Seeing rdma_get_dma_mr(RDMA_MRR_READ_DEST) added all over the code base terrifies me, because it is utterly wrong, and looks harmless. The mistep, is that enabling iSER for iWarp is not just this trivial change: - device-mr = ib_get_dma_mr(device-pd, IB_ACCESS_LOCAL_WRITE); + mr_roles = RDMA_MRR_RECV | RDMA_MRR_SEND | RDMA_MRR_WRITE_SOURCE | + RDMA_MRR_READ_DEST; + device-mr = rdma_get_dma_mr(device-pd, mr_roles, 0); Yep. It's the only safe, correct solution. But, it must also follow the path of NFS and use FRMR on iWarp for RDMA READ lkeys. This is the real quirk of iWarp, not that the MR flags are slightly different. From there, it is a logical wish: If Steve is going to FRMR'ize iSER to be acceptable security wise, I'd rather see that work done on in a general way. Hence the API discussion. Well, it's important to realize that NFS already does the Right Thing. So it isn't actually an urgent issue. There is time to discuss. Tom. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
On 7/10/2015 1:56 PM, Doug Ledford wrote: On 07/10/2015 12:11 PM, Jason Gunthorpe wrote: On Fri, Jul 10, 2015 at 09:22:24AM -0400, Tom Talpey wrote: and it is enabled only when the RDMA Read is active. ??? How is that done? ib_get_dma_mr is defined to return a remote usable rkey that is valid from when it it returns until the MR is destroyed. NFS creates one mr early on, it does not seem to make one per RDMA READ? Actually you're right - the NFS server never tears down the MR. So, the old code is just as vulnerable as the new code. For the proposed iSER patch the problem is very acute, iser makes a single PD and phys MR at boot time for each device. This means there is a single machine wide unchanging rkey that allows remote physical memory access. An attacker only has to repeatedly open QPs to iSER and guess rkey values until they find it. Add in likely non-crypto randomness for rkeys, and I bet it isn't even that hard to do. The rkeys have a PD, wich cannot be forged, so it's not a matter of attacking, but it is most definitely a system integrity risk, as I mentioned earlier, a simple arithmetic offset mistake can overwrite anything. So this seems to be a catastrophic security failing. s/security/integrity/ and I agree. As I recall, didn't the NFSoRDMA stack do even worse in the beginning (meaning it previously had a memory model that was simply register all memory on the client and pass the rkey to the server and then tell the server when/where to read/write, which was subsequently removed in Chuck's NFSoRDMA cleanups over the last 5 or so kernel versions)? Yes, Doug, shamefully it did, but a) That was the client and Jason and I are talking about the server b) When that code was written (2007), FRMR did not exist, and other memory registration methods had abysmal performance c) It was heavily deprecated with console warnings emitted d) Chuck did away with it since then. I'm not bringing this up to downplay the current iSER issue, but to demonstrate that we have long had a different trust model than the one in RFC5042. More below on that. I'll say something after requoting you. From there, it is a logical wish: If Steve is going to FRMR'ize iSER to be acceptable security wise, I'd rather see that work done on in a general way. Hence the API discussion. Well, it's important to realize that NFS already does the Right Thing. So it isn't actually an urgent issue. There is time to discuss. It depends on what is going on with iWarp iSER. If the iWarp community thinks it should go ahead insecurely then fine, put a giant warning in dmesg and go ahead, otherwise iWarp needs to address this difference with a proper generic API, which appears, must wrapper ib_post_send. Harder job :\ I'm sorry Steve for leading you down the wrong path with these flags, I did not fully realize exactly what the iWarp difference was at the start :( Are there security issues? Yes. Are we going to solve them in this patch set? No. Especially since those security issues extend beyond iSER + iWARP. There are currently 4 RDMA capable link types/protocols. Of those 4, 3 enforce locality (IB, OPA, RoCE all require lossless fabrics and cannot be sent out into the wild internet). RFC5042 is written with iWARP in mind because it *can* be sent out over the Internet and therefore it is possible for someone to place an RNIC card directly on the Internet and it must therefore deal with all of the attack vectors this entails. The linux RDMA stack, as a whole, is no where near RFC5042 compliant. And it isn't just the kernel space at fault here. Early versions of opensm used to require that the master and slave machines must have passwordless root ssh access to each other to copy around the guid2lid file so when the master failed over, the slave would have an up to date list. Because 3 of the 4 RDMA technology types enforce localized clusters, the majority of the stack and the implementations have utilized the cluster as the point of trust. In truth, I think it's simply been the easy way to do things. RFC5042 is very clearly a non-trust model where each machine, and each connection, is assumed to be a threat. So, long story short(ish): While this discussion is taking place in the cleanup thread, part of it involves the preceding patch set: iSER support for iWARP. Patch 4/5 in the iSER/iWARP series enables the remote write on device-mr registration which constitutes the security issue we've been discussing. This cleanup patch set, then, is merely guilty of hiding that security issue, not introducing it. I think Steve and the other iSER folks should consider the wisdom of including the iWARP support as it stands. If they think that the common use case is in a sequestered cluster, and that the current trust domain is already not RFC5042 compliant, then I'm OK with enabling iSER support for iWARP, but I would want there to be a very clear warning in the dmesg about the trust domain issue. I might
Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
On 7/9/2015 1:01 PM, Jason Gunthorpe wrote: Laid out like this, I think it even means we can nuke the IB DMA API for these cases. rdma_post_read and rdma_post_complete_read are the two points that need dma api calls (cache flushes), and they can just do them internally. This also tells me that the above call sites must already exist in every ULP, so we, again, are not really substantially changing core control flow for the ULP. Are there more details that wreck the world? Two things come to mind - PD's, and virtualization. If there's no ib_get_dma_mr() call, what PD does the region get? One could argue it inherits the QP's (Emulex proposed such a PD-less MR in this year's OFS Developer's Workshop). But this could impose new conditions on ULP's; they would have to be aware of this affinity and it could affect their QP use. More importantly, if a guest can post FRWR work requests with physical addresses, what enforces their validity? The dma_mr provides a PD but it also allows the host partition to interpose on the call, setting up an IOMMU mapping, creating a new NIC TPT mapping, etc. Without this, it may be possible for hostile guest to forge FRMR's and attack the host, or other guests. I didn't explore how errors work, but, I think, errors are just a labeling exercise: if (wc is error wc.wrid == read_wrid rdma_error_complete_read(...,read_wrid,wc) Error recovery blows up the QP, so we just need to book keep and get the MRs accounted for. The driver could do a synchronous clean up of whatever mess is left during the next create_qp, or on the PD destroy. This is a subtle area. If the driver posts silenced completions as you describe, there may not be a wc to reap. So either the driver or the ULP will need to post a sentinel, the completion of which indicates any prior silenced operations have actually done so. This can be hard to get right. And if the driver posts everything signaled, well, performance at high IOPS will be a challenge. The ULP is much better positioned to manage that. I'm with you on the flow control, btw. It's a new rule for the ULP to obey, but probably not too onerous. Remember though, verbs today return EAGAIN when the queue you're posting is full (a terrible choice IMO). So upper layers don't actually need to count WR's, unless they want to. Tom. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 1/5] RDMA/core: Transport-independent access flags
On 7/8/2015 3:08 PM, Jason Gunthorpe wrote: The MR stuff was never really designed, the HW people provided some capability and the SW side just raw exposed it, thoughtlessly. Jason, I don't disagree that the API can be improved. I have some responses to your statements below though. Why is code using iWarp calling ib_get_dma_mr with RDMA_MRR_READ_DEST/IB_ACCESS_REMOTE_WRITE ? That is insecure. Because the iWARP protocol requires it, which was very much an intentional decision. It actually is not insecure, as discussed in detail in RFC5042. However, it is different from Infiniband. Why on earth is NFS using frmr to create RDMA READ lkeys? Because NFS desires to have a single code path that works for all transports. In addition, using the IB dma_mr as an lkey means that large I/O (common with NFS) would require multiple RDMA Read operations, when the page list exceeded the local scatter/gather limit of the device. I think when you do that, it quickly becomes clear that iWarp's problem is not a seemingly minor issue with different flag bits, but that iWarp *cannot* use local_dma_lkey as a RDMA READ local buffer. Using ib_get_dma_mr(IB_ACCESS_REMOTE_WRITE) is an insecure work around. So iWarp (and only iWarp) should take the pain of spinning up temporary local MRs for RDMA READ. That is entirely the wrong layer to address this. It would prevent reuse of MRs, and require upper layers to be aware that this was the case - which is exactly the opposite of what you are trying to achieve. This should all be hidden under a common API and it shouldn't be sprinkled all over the place in NFS, iSER, etc. Are you arguing that all upper layer storage protocols take a single common approach to memory registration? That is a very different discussion. Tom. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 01/26] IB/Verbs: Implement new callback query_transport()
On 4/27/2015 2:52 PM, ira.weiny wrote: On Mon, Apr 27, 2015 at 09:39:05AM +0200, Michael Wang wrote: On 04/24/2015 05:12 PM, Liran Liss wrote: From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- [snip] a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 65994a1..d54f91e 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -75,10 +75,13 @@ enum rdma_node_type { }; enum rdma_transport_type { + /* legacy for users */ RDMA_TRANSPORT_IB, RDMA_TRANSPORT_IWARP, RDMA_TRANSPORT_USNIC, - RDMA_TRANSPORT_USNIC_UDP + RDMA_TRANSPORT_USNIC_UDP, + /* new transport */ + RDMA_TRANSPORT_IBOE, Remove RDMA_TRANSPORT_IBOE - it is not a transport. ROCE uses IBTA transport. If any code should test for ROCE should invoke a specific helper, e.g., rdma_protocol_iboe(). This is what you currently call rdma_tech_iboe is patch 02/26. I think that pretty much everybody agrees that rdma_protocol_*() is a better name than rdma_tech_*(), right? So, let's change this. Sure, sounds reasonable now, about the IBOE, we still need it to separate the port support IB/ETH without the check on link-layer, So what about a new enum on protocol type? Like: enum rdma_protocol { RDMA_PROTOCOL_IB, RDMA_PROTOCOL_IBOE, RDMA_PROTOCOL_IWARP, RDMA_PROTOCOL_USNIC_UDP }; So we could use query_protocol() to ask device provide the protocol type, and there will be no mixing with the legacy transport type anymore :-) I'm ok with that. I like introducing a unique namespace which is clearly different from the previous transport one. I agree the word transport takes things into the weeds. But on the topic of naming protocols, I've been wondering, is there some reason that IBOE is being used instead of RoCE? The IBOE protocol used to exist and is not the same as the currently standardized RoCE, right? Also wondering, why add UDP to USNIC, is there a different USNIC? Naming multiple layers together seems confusing and maybe in the end will create more code to deal with the differences. For example, what token will RoCEv2 take? RoCE_UDP, RoCE_v2 or ... ? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 01/26] IB/Verbs: Implement new callback query_transport()
On 4/27/2015 5:36 PM, Doug Ledford wrote: On Mon, 2015-04-27 at 17:16 -0700, Tom Talpey wrote: On 4/27/2015 2:52 PM, ira.weiny wrote: On Mon, Apr 27, 2015 at 09:39:05AM +0200, Michael Wang wrote: On 04/24/2015 05:12 PM, Liran Liss wrote: [snip] Like: enum rdma_protocol { RDMA_PROTOCOL_IB, RDMA_PROTOCOL_IBOE, RDMA_PROTOCOL_IWARP, RDMA_PROTOCOL_USNIC_UDP }; So we could use query_protocol() to ask device provide the protocol type, and there will be no mixing with the legacy transport type anymore :-) I'm ok with that. I like introducing a unique namespace which is clearly different from the previous transport one. I agree the word transport takes things into the weeds. But on the topic of naming protocols, I've been wondering, is there some reason that IBOE is being used instead of RoCE? Because back in the day, when RoCE was accepted into the kernel, I'm pretty sure it was prior to the IBTA's final stamp of approval and before the name was set on RoCE, so IBoE was chosen upstream as the more correct name because it properly denoted what it was deemed to truly be: IB Verbs over Ethernet. Well history is all well and good, but it seems weird to not use the current, standard name in new code. It confuses me, anyway, because it seems like IBOE could easily mean something else. Also wondering, why add UDP to USNIC, is there a different USNIC? Yes, there are two transports, one a distinct ethertype and one that encapsulates USNIC in UDP. But this new enum isn't about transport, it's about protocol. So is there one USNIC protocol, with a raw layering and a separate one with UDP? Or is it one USNIC protocol with two different framings? Seems there should be at least the USNIC protocol, without the _UDP decoration, and I don't see it in the enum. Naming multiple layers together seems confusing and maybe in the end will create more code to deal with the differences. For example, what token will RoCEv2 take? RoCE_UDP, RoCE_v2 or ... ? Uncertain as of now. Ok, but it's imminent, right? What's the preference/guidance? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 01/26] IB/Verbs: Implement new callback query_transport()
On 4/27/2015 6:24 PM, Doug Ledford wrote: On Mon, 2015-04-27 at 17:53 -0700, Tom Talpey wrote: On 4/27/2015 5:36 PM, Doug Ledford wrote: On Mon, 2015-04-27 at 17:16 -0700, Tom Talpey wrote: On 4/27/2015 2:52 PM, ira.weiny wrote: On Mon, Apr 27, 2015 at 09:39:05AM +0200, Michael Wang wrote: On 04/24/2015 05:12 PM, Liran Liss wrote: [snip] Like: enum rdma_protocol { RDMA_PROTOCOL_IB, RDMA_PROTOCOL_IBOE, RDMA_PROTOCOL_IWARP, RDMA_PROTOCOL_USNIC_UDP }; So we could use query_protocol() to ask device provide the protocol type, and there will be no mixing with the legacy transport type anymore :-) I'm ok with that. I like introducing a unique namespace which is clearly different from the previous transport one. I agree the word transport takes things into the weeds. But on the topic of naming protocols, I've been wondering, is there some reason that IBOE is being used instead of RoCE? Because back in the day, when RoCE was accepted into the kernel, I'm pretty sure it was prior to the IBTA's final stamp of approval and before the name was set on RoCE, so IBoE was chosen upstream as the more correct name because it properly denoted what it was deemed to truly be: IB Verbs over Ethernet. Well history is all well and good, but it seems weird to not use the current, standard name in new code. It confuses me, anyway, because it seems like IBOE could easily mean something else. Having some of it refer to things as IBOE and some as ROCE would be similarly confusing, and switching existing IBOE usage to ROCE would cause pain to people with out of tree drivers (Lustre is the main one I know of). There's not a good answer here. There's only less sucky ones. Hrm. Well, avoiding churn is good but legacies can wear ya down. MHO it is worth doing since these are new enums/new patches. Also wondering, why add UDP to USNIC, is there a different USNIC? Yes, there are two transports, one a distinct ethertype and one that encapsulates USNIC in UDP. But this new enum isn't about transport, it's about protocol. So is there one USNIC protocol, with a raw layering and a separate one with UDP? Or is it one USNIC protocol with two different framings? Seems there should be at least the USNIC protocol, without the _UDP decoration, and I don't see it in the enum. Keep in mind that this enum was Liran's response to Michael's original patch. In the enum in Michael's patch, there was both USNIC and USNIC_UDP. Right! That's why I'm confused. Seems wrong to drop it, right? Naming multiple layers together seems confusing and maybe in the end will create more code to deal with the differences. For example, what token will RoCEv2 take? RoCE_UDP, RoCE_v2 or ... ? Uncertain as of now. Ok, but it's imminent, right? What's the preference/guidance? There is a patchset from Devesh Sharma at Emulex. It added the RoCEv2 capability. As I recall, it used a new flag added to the existing port capabilities bitmask and notably did not modify either the node type or link layer that are currently used to differentiate between the different protocols. That's from memory though, so I could be mistaken. But that patchset was not written with this patchset in mind, and merging the two may well change that. In any case, there is a proposed spec to follow, so for now that's the preference/guidance (unless this rework means that we need to depart from the spec on internals for implementation reasons). Well, if RoCEv2 uses the same protocol enum, that may introduce new confusion, for example there will be some new CM handling for UDP encap, source port selection, and of course vlan/tag assignment, etc. But if there is support under way, and everyone is clear, then, ok. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 01/26] IB/Verbs: Implement new callback query_transport()
On 4/24/2015 8:23 AM, Michael Wang wrote: Add new callback query_transport() and implement for each HW. Mapping List: node-type link-layer old-transport new-transport ... mlx4IB_CA IB/ETH IB IB/IBOE mlx5IB_CA IB IB IB ... diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index 57c9809..b6f2f58 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -262,6 +262,12 @@ out: return err; } +static enum rdma_transport_type +mlx5_ib_query_transport(struct ib_device *device, u8 port_num) +{ + return RDMA_TRANSPORT_IB; +} + Just noticed that mlx5 is not being coded as RoCE-capable like mlx4. The mlx5 driver is for the new ConnectX-4, which implements all three of IB, RoCE and RoCEv2, right? Are those last two not supported? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6 01/26] IB/Verbs: Implement new callback query_transport()
On 4/24/2015 10:35 AM, Michael Wang wrote: On 04/24/2015 04:29 PM, Tom Talpey wrote: On 4/24/2015 8:23 AM, Michael Wang wrote: [snip] +static enum rdma_transport_type +mlx5_ib_query_transport(struct ib_device *device, u8 port_num) +{ +return RDMA_TRANSPORT_IB; +} + Just noticed that mlx5 is not being coded as RoCE-capable like mlx4. The mlx5 driver is for the new ConnectX-4, which implements all three of IB, RoCE and RoCEv2, right? Are those last two not supported? I'm not sure about the details of mlx5, but according to the current implementation, it's transport is IB without a link-layer callback, which means it doesn't support IBoE... And there is no method to change the port link-layer type as mlx4 did. Hal, is that correct? From the Mellanox web: http://www.mellanox.com/related-docs/products/IB_Adapter_card_brochure_c_2_3.pdf ConnectX-4 ConnectX-4 adapter cards with Virtual Protocol Interconnect (VPI), supporting EDR 100Gb/s InfiniBand and 100Gb/s Ethernet connectivity, provide the... Virtual Protocol Interconnect VPI® flexibility enables any standard networking, clustering, storage, and management protocol to seamlessly operate over any converged network leveraging a consolidated software stack. Each port can operate on InfiniBand, Ethernet, or Data Center Bridging (DCB) fabrics, and supports IP over InfiniBand (IPoIB), Ethernet over InfiniBand (EoIB) and RDMA over Converged Ethernet (RoCE and RoCEv2). -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 10/27] IB/Verbs: Reform cm related part in IB-core cma/ucm
On 4/16/2015 11:22 AM, Michael Wang wrote: On 04/16/2015 04:31 PM, Hefty, Sean wrote: This is equivalent to today where the checks are per node rather than per port. Should all checks here be port 1 based or only certain ones like listen ? For example, in connect/reject/disconnect, don't we already have port ? Guess this can be dealt with later as this is not a regression from the current implementation. Yeah, these parts of cma may need more carve in future, like some new callback for different CM type as Sean suggested. Maybe directly using 1 could help to highlight the problem ;-) Only a few checks need to be per device. I think I pointed those out previously. Testing should show anywhere that we miss fairly quickly, since port would still be 0. For the checks that can be updated to be per port, I would rather go ahead and convert them. Got it, will be changed in next version :-) To be confirmed: PORT ASSIGNED rdma_init_qp_attr Y rdma_destroy_id unknown cma_listen_on_dev N cma_bind_loopback N rdma_listen N Why N? rdma_listen() can be constrained to a single port, right? And even if wildcarded, it needs to act on multiple ports, which is to say, it will fail only if no ports are eligible. Tom. rdma_connectY rdma_accept Y rdma_reject Y rdma_disconnect Y ib_ucm_add_one N Is this list correct? Regards, Michael Wang - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW
On 4/10/2015 5:06 PM, ira.weiny wrote: On Fri, Apr 10, 2015 at 01:17:23PM -0600, Jason Gunthorpe wrote: ... So trying to sum up. Have we settled on the following capabilities? Helper function names aside. /* legacy to communicate to userspace */ RDMA_LINK_LAYER_IB = 0x0001, RDMA_LINK_LAYER_ETH = 0x0002, RDMA_LINK_LAYER_MASK = 0x000f, /* more bits? */ /* I'm hoping we don't need more bits here */ /* legacy to communicate to userspace */ RDMA_TRANSPORT_IB= 0x0010, RDMA_TRANSPORT_IWARP = 0x0020, RDMA_TRANSPORT_USNIC = 0x0040, RDMA_TRANSPORT_USNIC_UDP = 0x0080, RDMA_TRANSPORT_MASK = 0x00f0, /* more bits? */ /* I'm hoping we don't need more bits here */ /* New flags */ RDMA_MGMT_IB_MAD = 0x0100, /* ib_mad module support */ RDMA_MGMT_QP0= 0x0200, /* ib_mad QP0 support */ RDMA_MGMT_IB_SA = 0x0400, /* ib_sa module support */ /* NOTE includes IB Mcast */ RDMA_MGMT_IB_CM = 0x0800, /* ib_cm module support */ RDMA_MGMT_OPA_MAD= 0x1000, /* ib_mad OPA MAD support */ RDMA_MGMT_MASK = 0x000fff00, You explicitly say userspace - why would an upper layer need to know the link, transport and management details? These seem to be mid-layer matters. RDMA_ADDR_IB = 0x0010, /* Port does IB AH, PR, Pkey */ RDMA_ADDR_IBoE = 0x0020, /* Port does IBoE AH, PR, Pkey */ /* Do we need iWarp (TCP) here? */ RDMA_ADDR_IB_MASK= 0x0ff0, I do see a ULP needing to know the address family needed to pass to rdma_connect and rdma_listen, so I would add IP, but not iWARP. RDMA_SEPARATE_READ_SGE = 0x1000, RDMA_QUIRKS_MASK = 0x00fff000 This is good, but it also needs an attribute to signal the need for a remote-writable RDMA Read sink buffer, for today's iWARP. Tom. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/17] IB/Verbs: IB Management Helpers
On 4/7/2015 8:25 AM, Michael Wang wrote: Mapping List: node-type link-layer old-transport new-transport nes RNICETH IWARP IWARP amso1100RNICETH IWARP IWARP cxgb3 RNICETH IWARP IWARP cxgb4 RNICETH IWARP IWARP usnic USNIC_UDP ETH USNIC_UDP USNIC_UDP ocrdma IB_CA ETH IB IBOE mlx4IB_CA IB/ETH IB IB/IBOE mlx5IB_CA IB IB IB ehcaIB_CA IB IB IB ipath IB_CA IB IB IB mthca IB_CA IB IB IB qib IB_CA IB IB IB Can I rewind to ask a high-level question - what's the testing plan for all of this? Do you have folks lined up for verifying each of these adapters/networks, and what tests will they run? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW
On 4/8/2015 4:10 PM, Jason Gunthorpe wrote: On Wed, Apr 08, 2015 at 02:29:46PM -0400, Doug Ledford wrote: ... rdma_port_get_read_sge(dev, port) { if (rdma_transport_is_iwarp) return 1; return dev-port[port]-max_sge; } Hum, that is nice, but it doesn't quite fit with how the ULP needs to work. The max limit when creating a WR is the value passed into the qp_cap, not the device maximum limit. Agreed, and I will again say that not all devices necessarily support the same max_sge for all WR types. The current one-size-fits-all API may make the upper layer think so, but it's possibly being lied to. To do this properly we need to extend the qp_cap, and that is just too big a change. A one bit iWarp quirk is OK for now. Yes, it would be a large-ish change, and I like Doug's choice of word quirk to capture these as exceptions, until the means for addressing them is decided. Overall, I like Doug's proposals, especially from an upper layer perspective. I might suggest further refining them into categories, perhaps management, primarily of interest to kernel and the plumbing of connections; and actual RDMA semantics, of interest to RDMA consumers. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 09/17] IB/Verbs: Use helper cap_read_multi_sge() and reform svc_rdma_accept()
On 4/7/2015 8:34 AM, Michael Wang wrote: /** + * cap_read_multi_sge - Check if the port of device has the capability + * RDMA Read Multiple Scatter-Gather Entries. + * + * @device: Device to be checked + * @port_num: Port number of the device + * + * Return 0 when port of the device don't support + * RDMA Read Multiple Scatter-Gather Entries. + */ +static inline int cap_read_multi_sge(struct ib_device *device, u8 port_num) +{ + return !rdma_transport_iwarp(device, port_num); +} This just papers over the issue we discussed earlier. How *many* entries does the device support? If a device supports one, or two, is that enough? How does the upper layer know the limit? This needs an explicit device attribute, to be fixed properly. + +/** * cap_ipoib - Check if the port of device has the capability * IP over Infiniband. * diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index e011027..604d035 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -118,8 +118,8 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp, static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count) { - if (rdma_node_get_transport(xprt-sc_cm_id-device-node_type) == -RDMA_TRANSPORT_IWARP) + if (!cap_read_multi_sge(xprt-sc_cm_id-device, + xprt-sc_cm_id-port_num)) return 1; else return min_t(int, sge_count, xprt-sc_max_sge); This is incorrect. The RDMA Read max is not at all the same as the max_sge. It is a different operation, with a different set of work request parameters. In other words, the above same comment applies. diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 4e61880..e75175d 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -979,8 +979,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) /* * Determine if a DMA MR is required and if so, what privs are required */ - switch (rdma_node_get_transport(newxprt-sc_cm_id-device-node_type)) { - case RDMA_TRANSPORT_IWARP: + if (rdma_transport_iwarp(newxprt-sc_cm_id-device, +newxprt-sc_cm_id-port_num)) { newxprt-sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV; Do I read this correctly that it is forcing the read with invalidate capability to on for all iWARP devices? I don't think that is correct, for the legacy devices you're also supporting. @@ -992,8 +992,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) dma_mr_acc = IB_ACCESS_LOCAL_WRITE; } else need_dma_mr = 0; - break; - case RDMA_TRANSPORT_IB: + } else if (rdma_ib_mgmt(newxprt-sc_cm_id-device, + newxprt-sc_cm_id-port_num)) { if (!(newxprt-sc_dev_caps SVCRDMA_DEVCAP_FAST_REG)) { need_dma_mr = 1; dma_mr_acc = IB_ACCESS_LOCAL_WRITE; Now I'm even more confused. How is the presence of IB management related to needing a privileged lmr? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 08/11] IB/Verbs: Use management helper has_iwarp() for, iwarp-check
On 3/31/2015 3:39 AM, Michael Wang wrote: On 03/31/2015 12:35 AM, Jason Gunthorpe wrote: On Mon, Mar 30, 2015 at 05:10:12PM +0200, Michael Wang wrote: I found that actually we don't have to touch this one which only used by HW driver currently. I'm having a hard time understanding this, the code in question was in net/sunrpc/xprtrdma/svc_rdma_recvfrom.c Which is the NFS ULP, not a device driver. I'm not familiar with this part too :-P but yes, it looks like an ulp to support NFS. It's the NFS server itself. Actually I'm thinking about Doug's idea to use rdma_transport_is_XX() instead of the current basic helper, thus may be use rdma_transport_is_iwarp() in here could be better, since it's actually a feature of iwarp tech that RDMA Read only support one scatter-gather entry. No, you should expose an attribute to surface the maximum length of the remote gather list, which varies by adapter as well as protocol. The fact that iWARP is different from IB is not relevant, and conflates unrelated properties. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 08/11] IB/Verbs: Use management helper has_iwarp() for, iwarp-check
On 3/31/2015 7:41 AM, Michael Wang wrote: Hi, Tom Thanks for the comments :-) On 03/31/2015 01:19 PM, Tom Talpey wrote: [oops - repeating my reply with full cc's] [snip] Actually I'm thinking about Doug's idea to use rdma_transport_is_XX() instead of the current basic helper, thus may be use rdma_transport_is_iwarp() in here could be better, since it's actually a feature of iwarp tech that RDMA Read only support one scatter-gather entry. No, you should expose an attribute to surface the maximum length of the remote gather list, which varies by adapter as well as protocol. The fact that iWARP is different from IB is not relevant, and conflates unrelated properties. To be confirmed, so your point is that the max-read-sges will be different even the transport is the same IWRAP, and that depends on the capability of adapter, correct? Yes, in fact the iWARP protocol does not preclude multiple read SGEs, even though most iWARP implementations have chosen to support just one. Even for multi-SGE-capable adapters, there is a limit of SGL size, based on the adapter's work request format and other factors. So my argument is that upper layers can and should query that, not make a blanket decision based on protocol type. I currently only find this one place where infer max-read-sges from transport type, it looks more like a special case to me rather than a generic method we could exposed... and also not very related with IB management helper concept IMHO. It is most certainly not a special case, but you could decide to introduce it in many ways. I'm not commenting on that. My main concern is that you do not introduce a new and clumsy is iWARP rule as an adapter-specific API requirement to expose the RDMA Read SGE behavior. That's what your initial message seemed to imply? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Proposal for simplifying NFS/RDMA client memory registration
On 2/26/2014 8:44 AM, Chuck Lever wrote: Hi- Shirley Ma and I are reviving work on the NFS/RDMA client code base in the Linux kernel. So far we’ve built and run functional tests to determine what is working and what is broken. One complication is the number of memory registration modes supported by the RPC/RDMA transport: there are seven. These were added over the years to support particular HCAs or as proof-of-concept. The transport chooses a registration mode at mount time based on what the link HCA supports. Not all HCAs support all memory registration modes, so our test matrix is quite large. I’d like to propose removing support for one or more of these memory registration modes in the name of making it easier to change this code and test it without breaking something that we can’t test. BOUNCEBUFFERS - All HCAs support this mode. Does not use RDMA READ and WRITE, and the client end copies data into place. RDMA is offloaded, by data copy is not. I’m told it was never intended for production use. REGISTER - Safe but relatively slow. Uses reg_phys_mr verb which is not supported in mlx4/mlx5, but all other HCAs/providers can use this mode. MEM_WINDOWS - Uses bind_mr verb. Safe, but supports only a narrow range of HCAs. MEM_WINDOWS_ASYNC - Not always safe, and only a narrow range of HCAs is supported. MTHCA_FMR - Uses alloc_fmr verb. Safe, reasonably fast, but only a narrow range of older HCAs is supported. The MTHCA FMR is not completely safe - it protects only on page boundaries, therefore the neighboring bytes are vulnerable to silent corruption (reads) and exposure (write). It is quite correct that they are supported on only a specific set of legacy Mellanox HCA. You should consider removing the code that looked for this PCI ID and attempted to alter the device's wire MTU, to overcome another of its limitations. FRMR - Safe, generally fast. Currently the preferred registration mode, but is not supported with some older HCAs/providers. This should be, by far, the preferred mode. Also, if I recall correctly, the server depends on this mode being available/supported. However, it may not be supported by Soft iWARP. Physical addressing is used. ALLPHYSICAL - Usually fast, but not safe as it exposes client memory. All HCAs support this mode. Not safe is an understatement. It exposes all of client physical memory to the peer, for both read and write. A simple pointer error on the server will silently corrupt the client. This mode was intended only for testing, and in experimental deployments. Tom. I propose removing BOUNCEBUFFERS since it is not intended for production use. I propose removing ALLPHYSICAL and MEM_WINDOWS_ASYNC as they are not generally safe. RFC 5666 suggests that unsafe memory registration modes be avoided. I propose removing MEM_WINDOWS as it adds complexity without adding a lot of HCA compatibility. I propose removing MTHCA_FMR as I’m told it is hard to obtain HCAs we would need for testing this registration mode, and these are all old adapters anyway. This leaves NFS/RDMA client support for REGISTER and FRMR, which should cover all existing HCAs, and it is easy to test both of these memory registration modes with just one or two well-picked HCAs. We would contribute these changes to the client code base. The NFS/RDMA server code could use similar attention, but we are not volunteering to change it at this time. Thoughts/comments? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Helps to Decode rpc_debug Output
On 8/21/2013 11:55 AM, Wendy Cheng wrote: On Thu, Aug 15, 2013 at 11:08 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Thu, Aug 15, 2013 at 5:46 AM, Tom Talpey t...@talpey.com wrote: On 8/14/2013 8:14 PM, Wendy Cheng wrote: Longer version of the question: I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38 kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver. Memory registration is RPCRDMA_ALLPHYSICAL. There are many issues so far but I do manage to get nfs mount working. Simple file operations (such as ls, file read/write, scp, etc) seem to work as well. Yay ... got this up .. amazingly on a uOS that does not have much of the conventional kernel debug facilities. Congrats! One thing I'm still scratching my head is that ... by looking at the raw IOPS, I don't see dramatic difference between NFS-RDMA vs. NFS over IPOIB (TCP). Sounds like your bottleneck lies in some other component. What's the storage, for example? RDMA won't do a thing to improve a slow disk. Or, what kind of IOPS rate are you seeing? If these systems aren't generating enough load to push a CPU limit, then shifting the protocol on the same link might not yield much. However, the total run time differs greatly. NFS over RDMA seems to take a much longer time to finish (vs. NFS over IPOIB). Not sure why is that Maybe by the constant connect/disconnect triggered by reestablish_timeout ? The connection re-establish is known to be expensive on this uOS. Um, yes, of course. Fix that before drawing any conclusions. Tom. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Helps to Decode rpc_debug Output
On 8/14/2013 8:14 PM, Wendy Cheng wrote: Longer version of the question: I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38 kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver. Memory registration is RPCRDMA_ALLPHYSICAL. There are many issues so far but I do manage to get nfs mount working. Simple file operations (such as ls, file read/write, scp, etc) seem to work as well. You're probably seeing connection loss from bad RDMA handles, which come into play when you send large r/w traffic. The fact that small i/o such as simple ls, and non-NFS traffic such as scp, means the network itself is ok. While trying to run iozone to see whether the performance gain can be justified for the development efforts, the program runs until it reaches 2MB file size - at that point, RDMA CM sends out TIMEWAIT_EXIT event, the xprt is disconnected, and all IOs on that share hang. IPOIB still works though. Not sure what would be the best way to debug this. I would suggest enabling RPC transport debugging, and any tracing in the IB stack itself, looking to see if you can find any patterns. You may want to look at the server side, too. Unfortunately, because you are using Infiniband, packet capture is going to be next to impossible. You might try using an iWARP adapter, or one which can be sniffed, if you suspect a traffic issue. Why did you replace the Linux IB stack with OFED? Did you also take the NFS/RDMA from that package, and if so are you sure that it all is is working properly? Doesn't 2.6.38 already have all this? Tom. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/30/2013 1:09 AM, Yan Burman wrote: -Original Message- From: J. Bruce Fields [mailto:bfie...@fieldses.org] Sent: Sunday, April 28, 2013 17:43 To: Yan Burman Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4- 512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200- 980MB/sec. ... I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now). For some reason when I had intel IOMMU enabled, the performance dropped significantly. I now get up to ~95K IOPS and 4.1GB/sec bandwidth. Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps. What is the client CPU percentage you see under this workload, and how different are the NFS/RDMA and NFS/IPoIB overheads? Now I will take care of the issue that I am running only at 40Gbit/s instead of 56Gbit/s, but that is another unrelated problem (I suspect I have a cable issue). This is still strange, since ib_send_bw with intel iommu enabled did get up to 4.5GB/sec, so why did intel iommu affect only nfs code? You'll need to do more profiling to track that down. I would suspect that ib_send_bw is using some sort of direct hardware access, bypassing the IOMMU management and possibly performing no dynamic memory registration. The NFS/RDMA code goes via the standard kernel DMA API, and correctly registers/deregisters memory on a per-i/o basis in order to provide storage data integrity. Perhaps there are overheads in the IOMMU management which can be addressed. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/30/2013 1:09 AM, Yan Burman wrote: I now get up to ~95K IOPS and 4.1GB/sec bandwidth. ... ib_send_bw with intel iommu enabled did get up to 4.5GB/sec BTW, you may want to verify that these are the same GB. Many benchmarks say KB/MB/GB when they really mean KiB/MiB/GiB. At GB/GiB, the difference is about 7.5%, very close to the difference between 4.1 and 4.5. Just a thought. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/30/2013 10:23 AM, Yan Burman wrote: -Original Message- From: Tom Talpey [mailto:t...@talpey.com] On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote: I finally got up to 4.1GB/sec bandwidth with RDMA (ipoib-CM bandwidth is also way higher now). For some reason when I had intel IOMMU enabled, the performance dropped significantly. I now get up to ~95K IOPS and 4.1GB/sec bandwidth. Excellent, but is that 95K IOPS a typo? At 4KB, that's less than 400MBps. That is not a typo. I get 95K IOPS with randrw test with block size 4K. I get 4.1GBps with block size 256K randread test. Well, then I suggest you focus on whether you are satisfied with a high bandwidth goal or a high IOPS goal. They are two very different things, and clearly there are still significant issues to track down in the server. What is the client CPU percentage you see under this workload, and how different are the NFS/RDMA and NFS/IPoIB overheads? NFS/RDMA has about more 20-30% CPU usage than NFS/IPoIB, but RDMA has almost twice the bandwidth of IPoIB. So, for 125% of the CPU, RDMA is delivering 200% of the bandwidth. A common reporting approach is to calculate cycles per Byte (roughly, CPU/MB/sec), and you'll find this can be a great tool for comparison when overhead is a consideration. Overall, CPU usage gets up to about 20% for randread and 50% for randwrite. This is *client* CPU? Writes require the server to take additional overhead to make RDMA Read requests, but the client side is doing practically the same thing for the read vs write path. Again, you may want to profile more deeply to track that difference down. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/25/2013 1:18 PM, Wendy Cheng wrote: On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote: On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote: So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1 tar ball) ... Here is a random thought (not related to the rb tree comment). The inflight packet count seems to be controlled by xprt_rdma_slot_table_entries that is currently hard-coded as RPCRDMA_DEF_SLOT_TABLE (32) (?). I'm wondering whether it could help with the bandwidth number if we pump it up, say 64 instead ? Not sure whether FMR pool size needs to get adjusted accordingly though. 1) The client slot count is not hard-coded, it can easily be changed by writing a value to /proc and initiating a new mount. But I doubt that increasing the slot table will improve performance much, unless this is a small-random-read, and spindle-limited workload. Hi Tom ! It was a shot in the dark :) .. as our test bed has not been setup yet .However, since I'll be working on (very) slow clients, increasing this buffer is still interesting (to me). I don't see where it is controlled by a /proc value (?) - but that is not a concern at this The entries show up in /proc/sys/sunrpc (IIRC). The one you're looking for is called rdma_slot_table_entries. moment as /proc entry is easy to add. More questions on the server though (see below) ... 2) The observation appears to be that the bandwidth is server CPU limited. Increasing the load offered by the client probably won't move the needle, until that's addressed. Could you give more hints on which part of the path is CPU limited ? Sorry, I don't. The profile showing 25% of the 16-core, 2-socket server spinning on locks is a smoking, flaming gun though. Maybe Tom Tucker has some ideas on the srv rdma code, but it could also be in the sunrpc or infiniband driver layers, can't really tell without the call stacks. Is there a known Linux-based filesystem that is reasonbly tuned for NFS-RDMA ? Any specific filesystem features would work well with NFS-RDMA ? I'm wondering when disk+FS are added into the configuration, how much advantages would NFS-RDMA get when compared with a plain TCP/IP, say IPOIB on CM , transport ? NFS-RDMA is not really filesystem dependent, but certainly there are considerations for filesystems to support NFS, and of course the goal in general is performance. NFS-RDMA is a network transport, applicable to both client and server. Filesystem choice is a server consideration. I don't have a simple answer to your question about how much better NFS-RDMA is over other transports. Architecturally, a lot. In practice, there are many, many variables. Have you seen RFC5532, that I cowrote with the late Chet Juszczak? You may find it's still quite relevant. http://tools.ietf.org/html/rfc5532 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/25/2013 3:01 PM, Phil Pishioneri wrote: On 4/25/13 1:18 PM, Wendy Cheng wrote: On Wed, Apr 24, 2013 at 11:26 AM, Tom Talpey t...@talpey.com wrote: 1) The client slot count is not hard-coded, it can easily be changed by writing a value to /proc and initiating a new mount. But I doubt that increasing the slot table will improve performance much, unless this is a small-random-read, and spindle-limited workload. It was a shot in the dark :) .. as our test bed has not been setup yet .However, since I'll be working on (very) slow clients, increasing this buffer is still interesting (to me). I don't see where it is controlled by a /proc value (?) - but that is not a concern at this moment as /proc entry is easy to add. More questions on the server though (see below) ... Might there be confusion between the RDMA slot table and the TCP/UDP ones (which have proc entries under /proc/sys/sunrpc)? No, the xprtrdma.ko creates similar slot table controls when it loads. See the names below, prefixed with rdma: tmt@Home:~$ ls /proc/sys/sunrpc max_resvport nfsd_debug nlm_debug tcp_fin_timeout tcp_slot_table_entries udp_slot_table_entries min_resvport nfs_debug rpc_debug tcp_max_slot_table_entries transports tmt@Home:~$ sudo insmod xprtrdma tmt@Home:~$ ls /proc/sys/sunrpc max_resvport nlm_debug rdma_memreg_strategy tcp_fin_timeout udp_slot_table_entries min_resvport rdma_inline_write_padding rdma_pad_optimize tcp_max_slot_table_entries nfsd_debugrdma_max_inline_read rdma_slot_table_entries tcp_slot_table_entries nfs_debug rdma_max_inline_write rpc_debugtransports tmt@Home:~$ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On 4/24/2013 2:04 PM, Wendy Cheng wrote: On Wed, Apr 24, 2013 at 9:27 AM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 24, 2013 at 8:26 AM, J. Bruce Fields bfie...@fieldses.org wrote: On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote: On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote: Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunctionDSO ___ _ ___ ___ 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux I guess that means lots of contention on some mutex? If only we knew which one perf should also be able to collect stack statistics, I forget how. Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio I have not looked at NFS RDMA (and 3.x kernel) source yet. But see that rb_prev up in the #7 spot ? Do we have Red Black tree somewhere in the paths ? Trees like that requires extensive lockings. So I did a quick read on sunrpc/xprtrdma source (based on OFA 1.5.4.1 tar ball) ... Here is a random thought (not related to the rb tree comment). The inflight packet count seems to be controlled by xprt_rdma_slot_table_entries that is currently hard-coded as RPCRDMA_DEF_SLOT_TABLE (32) (?). I'm wondering whether it could help with the bandwidth number if we pump it up, say 64 instead ? Not sure whether FMR pool size needs to get adjusted accordingly though. 1) The client slot count is not hard-coded, it can easily be changed by writing a value to /proc and initiating a new mount. But I doubt that increasing the slot table will improve performance much, unless this is a small-random-read, and spindle-limited workload. 2) The observation appears to be that the bandwidth is server CPU limited. Increasing the load offered by the client probably won't move the needle, until that's addressed. In short, if anyone has benchmark setup handy, bumping up the slot table size as the following might be interesting: --- ofa_kernel-1.5.4.1.orig/include/linux/sunrpc/xprtrdma.h 2013-03-21 09:19:36.233006570 -0700 +++ ofa_kernel-1.5.4.1/include/linux/sunrpc/xprtrdma.h 2013-04-24 10:52:20.934781304 -0700 @@ -59,7 +59,7 @@ * a single chunk type per message is supported currently. */ #define RPCRDMA_MIN_SLOT_TABLE (2U) -#define RPCRDMA_DEF_SLOT_TABLE (32U) +#define RPCRDMA_DEF_SLOT_TABLE (64U) #define RPCRDMA_MAX_SLOT_TABLE (256U) #define RPCRDMA_DEF_INLINE (1024) /* default inline max */ -- Wendy -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html