Re: [PATCH 3/3] IB/mlx4: Report checksum offload cap when query device

2015-09-21 Thread Or Gerlitz
On Tue, Sep 22, 2015 at 12:59 AM, Doug Ledford  wrote:

> Here's the only matrix of IP checksumming that makes sense:
>
> 1) UD over IB (because it is one of the supported IPoIB types)
> 2) RC over IB (same as above)
> 3) Raw ETH over Eth (because IP over Eth makes sense and is a common
> type of packet to send on Raw Eth, but Raw Eth will never be sent on IB
> as it isn't supported there at all)
>
> Anything else would require adding more Raw ETH QPs elsewhere, or
> expanding the IPoIB spec to include more connection types.

OK, I'm good with that, so you want two news bits of device caps and
not the too fancy
matrix of checksum-caps-for-qp-type-and-link-type.

>> Really, there's enough spare bits in ib_device_cap_flags that
>> you could do away with the new caps entirely.

Yes, Bodong, please go ahead and use bits 26,27 of
include/rdma/ib_verbs.h :: enum ib_device_cap_flags
for the two new caps Doug asked.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/24] New fast registration API

2015-09-21 Thread santosh.shilim...@oracle.com

Hi Sagi,

On 9/20/15 2:36 AM, Sagi Grimberg wrote:

Hi Santosh,


Nice to see this consolidaton happening. I too don't have access to
iWARP hardware for RDS test but will use this series and convert our WIP
IB fastreg code and see how it goes.


I'm very pleased to hear about this WIP. Please feel free to share
anything you have (code and questions/dilemmas) with the list. Also, if
you have more suggestions on how we can do better from your PoV we'd
love to hear about it.


So as promised, I tried to test your series.  Your github branch [1]
'reg_api.3' though mostly has 4.3-rc1 contents, it isn't
based of 4.3-rc1 so I just cherry picked the patches and created
'rdma/sagi/reg_api.3_cherrypick' [2]. I had conflict with iser
patch so I just dropped that one.

As mentioned earlier, I have a WIP RDS fastreg branch [3]
which is functional (at least I can RDMA messages across
nodes ;-)). So merging [2] and [3], I created [4] and applied
a delta change based on your other patches. I saw ib_post_send
failure with my HCA driver returning '-EINVAL'. I didn't
debug it further but at least opcode and num_sge were set
correctly so I shouldn't have seen it. So I did memset()
on reg_wr which seems to have helped to fix the ib_post_send()
failure.

But I got into remote access errors which tells me that I
have messed up setup(rkey, sge setup or access flags)
or missing some other patch(s) in my test tree[4]. Delta
patch is top commit on [4].

Please let me know if you spot something which I missed.

Regards,
Santosh

[1] https://github.com/sagigrimberg/linux/tree/reg_api.3
[2] 
https://git.kernel.org/cgit/linux/kernel/git/ssantosh/linux.git/log/?h=rdma/sagi/reg_api.3_cherrypick
[3] 
https://git.kernel.org/cgit/linux/kernel/git/ssantosh/linux.git/log/?h=net/rds/4.3-fr-wip

[4]https://git.kernel.org/cgit/linux/kernel/git/ssantosh/linux.git/commit/?h=test/reg_api.3/rds
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 05/18] xprtrdma: Replace send and receive arrays

2015-09-21 Thread Chuck Lever

> On Sep 20, 2015, at 3:52 AM, Sagi Grimberg  wrote:
> 
> On 9/17/2015 11:44 PM, Chuck Lever wrote:
>> The rb_send_bufs and rb_recv_bufs arrays are used to implement a
>> pair of stacks for keeping track of free rpcrdma_req and rpcrdma_rep
>> structs. Replace those arrays with free lists.
>> 
>> To allow more than 512 RPCs in-flight at once, each of these arrays
>> would be larger than a page (assuming 8-byte addresses and 4KB
>> pages). Allowing up to 64K in-flight RPCs (as TCP now does), each
>> buffer array would have to be 128 pages. That's an order-6
>> allocation. (Not that we're going there.)
>> 
>> A list is easier to expand dynamically. Instead of allocating a
>> larger array of pointers and copying the existing pointers to the
>> new array, simply append more buffers to each list.
>> 
>> This also makes it simpler to manage receive buffers that might
>> catch backwards-direction calls, or to post receive buffers in
>> bulk to amortize the overhead of ib_post_recv.
>> 
>> Signed-off-by: Chuck Lever 
> 
> Hi Chuck,
> 
> I get the idea of this patch, but it is a bit confusing (to a
> non-educated reader).

OK, let’s see if there’s room for additional improvement.


> Can you explain why sometimes you call put/get_locked routines
> and sometimes you open-code them?

Are you talking about the later patch that adds support for
receiving backwards calls? That probably should use the
existing helpers, shouldn’t it.


> And is it mandatory to have
> the callers lock before calling get/put? Perhaps the code would
> be simpler if the get/put routines would take care of locking
> since rb_lock looks dedicated to them.

Not sure I understand this comment, I thought that the helpers
were already doing the locking.


> 
>> ---
>>  net/sunrpc/xprtrdma/verbs.c |  141 
>> +--
>>  net/sunrpc/xprtrdma/xprt_rdma.h |9 +-
>>  2 files changed, 66 insertions(+), 84 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index ac1345b..8d99214 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -962,44 +962,18 @@ rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
>>  {
>>  struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
>>  struct rpcrdma_ia *ia = &r_xprt->rx_ia;
>> -struct rpcrdma_create_data_internal *cdata = &r_xprt->rx_data;
>> -char *p;
>> -size_t len;
>>  int i, rc;
>> 
>> -buf->rb_max_requests = cdata->max_requests;
>> +buf->rb_max_requests = r_xprt->rx_data.max_requests;
>>  spin_lock_init(&buf->rb_lock);
>> 
>> -/* Need to allocate:
>> - *   1.  arrays for send and recv pointers
>> - *   2.  arrays of struct rpcrdma_req to fill in pointers
>> - *   3.  array of struct rpcrdma_rep for replies
>> - * Send/recv buffers in req/rep need to be registered
>> - */
>> -len = buf->rb_max_requests *
>> -(sizeof(struct rpcrdma_req *) + sizeof(struct rpcrdma_rep *));
>> -
>> -p = kzalloc(len, GFP_KERNEL);
>> -if (p == NULL) {
>> -dprintk("RPC:   %s: req_t/rep_t/pad kzalloc(%zd) failed\n",
>> -__func__, len);
>> -rc = -ENOMEM;
>> -goto out;
>> -}
>> -buf->rb_pool = p;   /* for freeing it later */
>> -
>> -buf->rb_send_bufs = (struct rpcrdma_req **) p;
>> -p = (char *) &buf->rb_send_bufs[buf->rb_max_requests];
>> -buf->rb_recv_bufs = (struct rpcrdma_rep **) p;
>> -p = (char *) &buf->rb_recv_bufs[buf->rb_max_requests];
>> -
>>  rc = ia->ri_ops->ro_init(r_xprt);
>>  if (rc)
>>  goto out;
>> 
>> +INIT_LIST_HEAD(&buf->rb_send_bufs);
>>  for (i = 0; i < buf->rb_max_requests; i++) {
>>  struct rpcrdma_req *req;
>> -struct rpcrdma_rep *rep;
>> 
>>  req = rpcrdma_create_req(r_xprt);
>>  if (IS_ERR(req)) {
>> @@ -1008,7 +982,12 @@ rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
>>  rc = PTR_ERR(req);
>>  goto out;
>>  }
>> -buf->rb_send_bufs[i] = req;
>> +list_add(&req->rl_free, &buf->rb_send_bufs);
>> +}
>> +
>> +INIT_LIST_HEAD(&buf->rb_recv_bufs);
>> +for (i = 0; i < buf->rb_max_requests + 2; i++) {
>> +struct rpcrdma_rep *rep;
>> 
>>  rep = rpcrdma_create_rep(r_xprt);
>>  if (IS_ERR(rep)) {
>> @@ -1017,7 +996,7 @@ rpcrdma_buffer_create(struct rpcrdma_xprt *r_xprt)
>>  rc = PTR_ERR(rep);
>>  goto out;
>>  }
>> -buf->rb_recv_bufs[i] = rep;
>> +list_add(&rep->rr_list, &buf->rb_recv_bufs);
>>  }
>> 
>>  return 0;
>> @@ -1051,25 +1030,26 @@ void
>>  rpcrdma_buffer_destroy(struct rpcrdma_buffer *buf)
>>  {
>>  struct rpcrdma_ia *ia = rdmab_to_ia(buf);
>> -int i;
>> 
>> -/* clean up in reverse order from create
>> - *   1.  recv mr memory (mr free, then kfree)

Re: [PATCH v1 08/18] xprtrdma: Pre-allocate Work Requests for backchannel

2015-09-21 Thread Chuck Lever

> On Sep 21, 2015, at 3:33 AM, Devesh Sharma  
> wrote:
> 
> On Fri, Sep 18, 2015 at 2:15 AM, Chuck Lever  wrote:
>> Pre-allocate extra send and receive Work Requests needed to handle
>> backchannel receives and sends.
>> 
>> The transport doesn't know how many extra WRs to pre-allocate until
>> the xprt_setup_backchannel() call, but that's long after the WRs are
>> allocated during forechannel setup.
>> 
>> So, use a fixed value for now.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>> net/sunrpc/xprtrdma/backchannel.c |4 
>> net/sunrpc/xprtrdma/verbs.c   |   14 --
>> net/sunrpc/xprtrdma/xprt_rdma.h   |   10 ++
>> 3 files changed, 26 insertions(+), 2 deletions(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/backchannel.c 
>> b/net/sunrpc/xprtrdma/backchannel.c
>> index c0a42ad..f5c7122 100644
>> --- a/net/sunrpc/xprtrdma/backchannel.c
>> +++ b/net/sunrpc/xprtrdma/backchannel.c
>> @@ -123,6 +123,9 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned 
>> int reqs)
>> * Twice as many rpc_rqsts are prepared to ensure there is
>> * always an rpc_rqst available as soon as a reply is sent.
>> */
>> +   if (reqs > RPCRDMA_BACKWARD_WRS >> 1)
>> +   goto out_err;
>> +
>>for (i = 0; i < (reqs << 1); i++) {
>>rqst = kzalloc(sizeof(*rqst), GFP_KERNEL);
>>if (!rqst) {
>> @@ -159,6 +162,7 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned 
>> int reqs)
>> out_free:
>>xprt_rdma_bc_destroy(xprt, reqs);
>> 
>> +out_err:
>>pr_err("RPC:   %s: setup backchannel transport failed\n", 
>> __func__);
>>return -ENOMEM;
>> }
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index 1e4a948..133c720 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -614,6 +614,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
>> rpcrdma_ia *ia,
>>struct ib_device_attr *devattr = &ia->ri_devattr;
>>struct ib_cq *sendcq, *recvcq;
>>struct ib_cq_init_attr cq_attr = {};
>> +   unsigned int max_qp_wr;
>>int rc, err;
>> 
>>if (devattr->max_sge < RPCRDMA_MAX_IOVS) {
>> @@ -622,18 +623,27 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
>> rpcrdma_ia *ia,
>>return -ENOMEM;
>>}
>> 
>> +   if (devattr->max_qp_wr <= RPCRDMA_BACKWARD_WRS) {
>> +   dprintk("RPC:   %s: insufficient wqe's available\n",
>> +   __func__);
>> +   return -ENOMEM;
>> +   }
>> +   max_qp_wr = devattr->max_qp_wr - RPCRDMA_BACKWARD_WRS;
>> +
>>/* check provider's send/recv wr limits */
>> -   if (cdata->max_requests > devattr->max_qp_wr)
>> -   cdata->max_requests = devattr->max_qp_wr;
>> +   if (cdata->max_requests > max_qp_wr)
>> +   cdata->max_requests = max_qp_wr;
> 
> should we
> cdata->max_request = max_qp_wr - RPCRDMA_BACKWARD_WRS?

cdata->max_request is an input parameter to rpcrdma_ep_create().
We can’t simply overwrite it here with a new larger value.


>>ep->rep_attr.event_handler = rpcrdma_qp_async_error_upcall;
>>ep->rep_attr.qp_context = ep;
>>ep->rep_attr.srq = NULL;
>>ep->rep_attr.cap.max_send_wr = cdata->max_requests;
>> +   ep->rep_attr.cap.max_send_wr += RPCRDMA_BACKWARD_WRS;
> 
> Looks like will cause a qp-create failure if any hypothetical device
> supports devattr->max_qp_wr = cdata->max_requests

We’ve already capped cdata->max_requests at
“devattr->max_qp_wr - RPCRDMA_BACKWARD_WRS” above. So, the logic
should prevent that, unless I’ve made a mistake.


>>rc = ia->ri_ops->ro_open(ia, ep, cdata);
>>if (rc)
>>return rc;
>>ep->rep_attr.cap.max_recv_wr = cdata->max_requests;
>> +   ep->rep_attr.cap.max_recv_wr += RPCRDMA_BACKWARD_WRS;
>>ep->rep_attr.cap.max_send_sge = RPCRDMA_MAX_IOVS;
>>ep->rep_attr.cap.max_recv_sge = 1;
>>ep->rep_attr.cap.max_inline_data = 0;
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h 
>> b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index 2ca0567..37d0d7f 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -101,6 +101,16 @@ struct rpcrdma_ep {
>>  */
>> #define RPCRDMA_IGNORE_COMPLETION  (0ULL)
>> 
>> +/* Pre-allocate extra Work Requests for handling backward receives
>> + * and sends. This is a fixed value because the Work Queues are
>> + * allocated when the forward channel is set up.
>> + */
>> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
>> +#define RPCRDMA_BACKWARD_WRS   (8)
>> +#else
>> +#define RPCRDMA_BACKWARD_WRS   (0)
>> +#endif
>> +
>> /* Registered buffer -- registered kmalloc'd memory for RDMA SEND/RECV
>>  *
>>  * The below structure appears at the front of a large region of kmalloc'd
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majord...@v

Re: [PATCH v1 15/18] svcrdma: Add svc_rdma_get_context() API that is allowed to fail

2015-09-21 Thread Chuck Lever

> On Sep 20, 2015, at 5:40 AM, Sagi Grimberg  wrote:
> 
> On 9/17/2015 11:46 PM, Chuck Lever wrote:
>> To support backward direction calls, I'm going to add an
>> svc_rdma_get_context() call in the client RDMA transport.
>> 
>> Called from ->buf_alloc(), we can't sleep waiting for memory.
>> So add an API that can get a server op_ctxt but won't sleep.
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>>  include/linux/sunrpc/svc_rdma.h  |2 ++
>>  net/sunrpc/xprtrdma/svc_rdma_transport.c |   28 +++-
>>  2 files changed, 25 insertions(+), 5 deletions(-)
>> 
>> diff --git a/include/linux/sunrpc/svc_rdma.h 
>> b/include/linux/sunrpc/svc_rdma.h
>> index 6ce7495..2500dd1 100644
>> --- a/include/linux/sunrpc/svc_rdma.h
>> +++ b/include/linux/sunrpc/svc_rdma.h
>> @@ -224,6 +224,8 @@ extern void svc_rdma_send_error(struct svcxprt_rdma *, 
>> struct rpcrdma_msg *,
>>  extern int svc_rdma_post_recv(struct svcxprt_rdma *);
>>  extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr 
>> *);
>>  extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
>> +extern struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct 
>> svcxprt_rdma *,
>> + gfp_t);
>>  extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
>>  extern void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt);
>>  extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
>> b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index 23aba30..c4083a3 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -153,17 +153,35 @@ static void svc_rdma_bc_free(struct svc_xprt *xprt)
>>  }
>>  #endif  /* CONFIG_SUNRPC_BACKCHANNEL */
>> 
>> -struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
>> +static void svc_rdma_init_context(struct svcxprt_rdma *xprt,
>> +  struct svc_rdma_op_ctxt *ctxt)
>>  {
>> -struct svc_rdma_op_ctxt *ctxt;
>> -
>> -ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
>> -GFP_KERNEL | __GFP_NOFAIL);
>>  ctxt->xprt = xprt;
>>  INIT_LIST_HEAD(&ctxt->dto_q);
>>  ctxt->count = 0;
>>  ctxt->frmr = NULL;
>>  atomic_inc(&xprt->sc_ctxt_used);
>> +}
>> +
>> +struct svc_rdma_op_ctxt *svc_rdma_get_context_gfp(struct svcxprt_rdma *xprt,
>> +  gfp_t flags)
>> +{
>> +struct svc_rdma_op_ctxt *ctxt;
>> +
>> +ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep, flags);
>> +if (!ctxt)
>> +return NULL;
>> +svc_rdma_init_context(xprt, ctxt);
>> +return ctxt;
>> +}
>> +
>> +struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
>> +{
> 
> Why not:
>   return svc_rdma_get_context_gfp(xprt, GFP_KERNEL | __GFP_NOFAIL);

The “if (!ctxt) return NULL;” is unneeded if __GFP_NOFAIL is
specified.

I’ll wait for additional comments on this one, I could go
either way.


> ?
> 
>> +struct svc_rdma_op_ctxt *ctxt;
>> +
>> +ctxt = kmem_cache_alloc(svc_rdma_ctxt_cachep,
>> +GFP_KERNEL | __GFP_NOFAIL);
>> +svc_rdma_init_context(xprt, ctxt);
>>  return ctxt;
>>  }
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

—
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] IB/hfi1: checking for NULL instead of IS_ERR

2015-09-21 Thread Greg Kroah-Hartman
On Mon, Sep 21, 2015 at 01:03:52PM -0400, Doug Ledford wrote:
> On 09/21/2015 12:48 PM, Greg Kroah-Hartman wrote:
> 
> > But, that's already not happening, as is obvious by my inbox.
> 
> Yeah, I see that.
> 
> > So, how about you forward on what you have so far to me, and I'll keep
> > these.  Otherwise you will end up with nasty merge issues very quickly
> > as people will continue to send stuff to me.
> 
> As you wish.  I pulled in a number of patches related to hfi1, but I've
> already sent the pull request to Linus and they made it as of 4.1-rc2.
> So, just start from there and you will have all the patches I've taken.

Sounds good, thanks.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB/mlx4: Report checksum offload cap when query device

2015-09-21 Thread Doug Ledford
On 09/21/2015 05:41 PM, Or Gerlitz wrote:
> On Fri, Sep 18, 2015 at 5:46 AM, Doug Ledford  wrote:
>> On 09/16/2015 11:56 AM, Bodong Wang wrote:
>>> Signed-off-by: Bodong Wang 
>>> ---
>>>  drivers/infiniband/hw/mlx4/main.c | 3 +++
>>>  1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/infiniband/hw/mlx4/main.c 
>>> b/drivers/infiniband/hw/mlx4/main.c
>>> index 8be6db8..a70ca6a 100644
>>> --- a/drivers/infiniband/hw/mlx4/main.c
>>> +++ b/drivers/infiniband/hw/mlx4/main.c
>>> @@ -217,6 +217,9 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
>>>   props->device_cap_flags |= IB_DEVICE_MANAGED_FLOW_STEERING;
>>>   }
>>>
>>> + props->csum_cap.eth_csum_cap |= IB_CSUM_SUPPORT_RAW;
>>> + props->csum_cap.ib_csum_cap |= IB_CSUM_SUPPORT_UD;
>>> +
> 
>> This patch highlights something I didn't think about on the previous
>> patch.  Why separate eth/ib if you have per QP flags?  The QP denotes
>> the ib/eth relationship without the need to separate it into two
>> different caps.  In other words, you can never have an IB qp type on eth
>> because the only eth QP types we support other than RAW are all RDMA and
>> not IP.  Really, there's enough spare bits in ib_device_cap_flags that
>> you could do away with the new caps entirely.  Right now, we support UD
>> (which we already have a flag for), we can add two flags (for RAW and
>> RC) and that should cover all of the foreseeable options as that would
>> allow us to extend IP CSUM support to cover connected mode and cover all
>> of the current options.  I don't see us doing IP traffic in any other
>> situation, so I thing that should suffice.  Bits 25 and 26 could be used
>> for the two new bits.  Then you just need to extend the bits to user space.
> 
> Doug,
> 
> The vendor may support the offload for a certain QP type only over
> certain link.

Exactly my point.

> E.g mlx4 support checksum for UD QPs only over IB but
> not over Eth,

As it should be.  We are, after all, talking about IP embedded in UD
RDMA.  Over IB that makes sense.  Over Eth it makes no sense.  If you
are going to do IP on Eth, then just do IP, don't do IP in UD.  I see no
reason to support this construct.

>  checksum for RC QPs isn't supported,

But could be for IB.  The fact that it isn't is why there is an ongoing
effort to work around this issue.

> and RAW_PACKET QPs
> are available anyway only for Eth links.

Correct.  And this will never be available on IB.

> But if this is what you think
> needs to be done, I guess we can do that.

Here's the only matrix of IP checksumming that makes sense:

1) UD over IB (because it is one of the supported IPoIB types)
2) RC over IB (same as above)
3) Raw ETH over Eth (because IP over Eth makes sense and is a common
type of packet to send on Raw Eth, but Raw Eth will never be sent on IB
as it isn't supported there at all)

Anything else would require adding more Raw ETH QPs elsewhere, or
expanding the IPoIB spec to include more connection types.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/3] IB/mlx4: Report checksum offload cap when query device

2015-09-21 Thread Or Gerlitz
On Fri, Sep 18, 2015 at 5:46 AM, Doug Ledford  wrote:
> On 09/16/2015 11:56 AM, Bodong Wang wrote:
>> Signed-off-by: Bodong Wang 
>> ---
>>  drivers/infiniband/hw/mlx4/main.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/infiniband/hw/mlx4/main.c 
>> b/drivers/infiniband/hw/mlx4/main.c
>> index 8be6db8..a70ca6a 100644
>> --- a/drivers/infiniband/hw/mlx4/main.c
>> +++ b/drivers/infiniband/hw/mlx4/main.c
>> @@ -217,6 +217,9 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
>>   props->device_cap_flags |= IB_DEVICE_MANAGED_FLOW_STEERING;
>>   }
>>
>> + props->csum_cap.eth_csum_cap |= IB_CSUM_SUPPORT_RAW;
>> + props->csum_cap.ib_csum_cap |= IB_CSUM_SUPPORT_UD;
>> +

> This patch highlights something I didn't think about on the previous
> patch.  Why separate eth/ib if you have per QP flags?  The QP denotes
> the ib/eth relationship without the need to separate it into two
> different caps.  In other words, you can never have an IB qp type on eth
> because the only eth QP types we support other than RAW are all RDMA and
> not IP.  Really, there's enough spare bits in ib_device_cap_flags that
> you could do away with the new caps entirely.  Right now, we support UD
> (which we already have a flag for), we can add two flags (for RAW and
> RC) and that should cover all of the foreseeable options as that would
> allow us to extend IP CSUM support to cover connected mode and cover all
> of the current options.  I don't see us doing IP traffic in any other
> situation, so I thing that should suffice.  Bits 25 and 26 could be used
> for the two new bits.  Then you just need to extend the bits to user space.

Doug,

The vendor may support the offload for a certain QP type only over
certain link. E.g mlx4 support checksum for UD QPs only over IB but
not over Eth,  checksum for RC QPs isn't supported, and RAW_PACKET QPs
are available anyway only for Eth links. But if this is what you think
needs to be done, I guess we can do that.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


merge struct ib_device_attr into struct ib_device

2015-09-21 Thread Christoph Hellwig
This patch gets rid of struct ib_device_attr and cleans up drivers nicely.

It goes on top of my send_wr cleanups and the memory registration udpates
from Sagi.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-rc v1 0/4] Remove mlx5 support for IB_DEVICE_LOCAL_DMA_LKEY

2015-09-21 Thread Sagi Grimberg

On 9/21/2015 11:03 PM, Or Gerlitz wrote:

On Mon, Sep 21, 2015 at 11:03 PM, Or Gerlitz  wrote:

On Mon, Sep 21, 2015 at 7:41 PM, Sagi Grimberg  wrote:

Changes from v0:
- Replace xprtrdma patch to Chuck's one
- Fixed typo in iser modparam description


The change log of patch #3 needs isn't phrased clearly, this is quick
feedback I provided you for V0 and you didn't apply here, please do.


Again, the change log of patch #3 isn't phrased clearly, this is quick
feedback I provided you for V0 and you didn't apply here, please do.


You're right. Sorry.

I'll send out v2 tomorrow morning.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-rc v1 1/4] xprtrdma: Replace global lkey with lkey local to PD

2015-09-21 Thread Sagi Grimberg

On 9/21/2015 8:03 PM, Chuck Lever wrote:



On Sep 21, 2015, at 9:41 AM, Sagi Grimberg  wrote:

From: Chuck Lever 

The core API has changed so that devices that do not have a global
DMA lkey automatically create an mr, per-PD, and make that lkey
available. The global DMA lkey interface is going away in favor of
the per-PD DMA lkey.

The per-PD DMA lkey is always available. Convert xprtrdma to use the
device's per-PD DMA lkey for regbufs, no matter which memory
registration scheme is in use.

Signed-off-by: Chuck Lever 
Signed-off-by: Sagi Grimberg 


You’ll probably need an Acked-by from Anna or Trond.


Yea, I have a v2 as I forgot to address a comment from Or.

I'll CC linux-nfs on this patch. Sorry for forgetting.

Thanks Chuck.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-rc v1 0/4] Remove mlx5 support for IB_DEVICE_LOCAL_DMA_LKEY

2015-09-21 Thread Or Gerlitz
On Mon, Sep 21, 2015 at 11:03 PM, Or Gerlitz  wrote:
> On Mon, Sep 21, 2015 at 7:41 PM, Sagi Grimberg  wrote:
>> Changes from v0:
>> - Replace xprtrdma patch to Chuck's one
>> - Fixed typo in iser modparam description
>
> The change log of patch #3 needs isn't phrased clearly, this is quick
> feedback I provided you for V0 and you didn't apply here, please do.

Again, the change log of patch #3 isn't phrased clearly, this is quick
feedback I provided you for V0 and you didn't apply here, please do.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-rc v1 0/4] Remove mlx5 support for IB_DEVICE_LOCAL_DMA_LKEY

2015-09-21 Thread Or Gerlitz
On Mon, Sep 21, 2015 at 7:41 PM, Sagi Grimberg  wrote:
> Changes from v0:
> - Replace xprtrdma patch to Chuck's one
> - Fixed typo in iser modparam description

The change log of patch #3 needs isn't phrased clearly, this is quick
feedback I provided you for V0 and you didn't apply here, please do.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] xprtrdma: don't log warnings for flushed completions

2015-09-21 Thread Steve Wise
Unsignaled send WRs can get flushed as part of normal unmount, so don't
log them as warnings.

Signed-off-by: Steve Wise 
---

 net/sunrpc/xprtrdma/frwr_ops.c |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index d6653f5..f1868ba 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -257,8 +257,11 @@ frwr_sendcompletion(struct ib_wc *wc)
 
/* WARNING: Only wr_id and status are reliable at this point */
r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
-   pr_warn("RPC:   %s: frmr %p flushed, status %s (%d)\n",
-   __func__, r, ib_wc_status_msg(wc->status), wc->status);
+   if (wc->status == IB_WC_WR_FLUSH_ERR)
+   dprintk("RPC:   %s: frmr %p flushed\n", __func__, r);
+   else
+   pr_warn("RPC:   %s: frmr %p error, status %s (%d)\n",
+   __func__, r, ib_wc_status_msg(wc->status), wc->status);
r->r.frmr.fr_state = FRMR_IS_STALE;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] xprtrdma: disconnect and flush cqs before freeing buffers

2015-09-21 Thread Steve Wise
Otherwise a FRMR completion can cause a touch-after-free crash.

In xprt_rdma_destroy(), call rpcrdma_buffer_destroy() only after calling
rpcrdma_ep_destroy().

In rpcrdma_ep_destroy(), disconnect the cm_id first which should flush the
qp, then drain the cqs, then destroy the qp, and finally destroy the cqs.

Signed-off-by: Steve Wise 
Tested-by: Chuck Lever 
---

 net/sunrpc/xprtrdma/transport.c |2 +-
 net/sunrpc/xprtrdma/verbs.c |9 ++---
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 64443eb..41e452b 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -270,8 +270,8 @@ xprt_rdma_destroy(struct rpc_xprt *xprt)
 
xprt_clear_connected(xprt);
 
-   rpcrdma_buffer_destroy(&r_xprt->rx_buf);
rpcrdma_ep_destroy(&r_xprt->rx_ep, &r_xprt->rx_ia);
+   rpcrdma_buffer_destroy(&r_xprt->rx_buf);
rpcrdma_ia_close(&r_xprt->rx_ia);
 
xprt_rdma_free_addresses(xprt);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 6829967..01a314a 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -755,19 +755,22 @@ rpcrdma_ep_destroy(struct rpcrdma_ep *ep, struct 
rpcrdma_ia *ia)
 
cancel_delayed_work_sync(&ep->rep_connect_worker);
 
-   if (ia->ri_id->qp) {
+   if (ia->ri_id->qp)
rpcrdma_ep_disconnect(ep, ia);
+
+   rpcrdma_clean_cq(ep->rep_attr.recv_cq);
+   rpcrdma_clean_cq(ep->rep_attr.send_cq);
+
+   if (ia->ri_id->qp) {
rdma_destroy_qp(ia->ri_id);
ia->ri_id->qp = NULL;
}
 
-   rpcrdma_clean_cq(ep->rep_attr.recv_cq);
rc = ib_destroy_cq(ep->rep_attr.recv_cq);
if (rc)
dprintk("RPC:   %s: ib_destroy_cq returned %i\n",
__func__, rc);
 
-   rpcrdma_clean_cq(ep->rep_attr.send_cq);
rc = ib_destroy_cq(ep->rep_attr.send_cq);
if (rc)
dprintk("RPC:   %s: ib_destroy_cq returned %i\n",

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] svcrdma: handle rdma read with a non-zero initial page offset

2015-09-21 Thread Steve Wise
The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions
were not taking into account the initial page_offset when determining
the rdma read length.  This resulted in a read who's starting address
and length exceeded the base/bounds of the frmr.

Most work loads don't tickle this bug apparently, but one test hit it
every time: building the linux kernel on a 16 core node with 'make -j
16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA.

This bug seems to only be tripped with devices having small fastreg page
list depths.  I didn't see it with mlx4, for instance.

Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic')
Signed-off-by: Steve Wise 
Tested-by: Chuck Lever 
---

 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index cb51742..5f6ca47 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -136,7 +136,8 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
ctxt->direction = DMA_FROM_DEVICE;
ctxt->read_hdr = head;
pages_needed = min_t(int, pages_needed, xprt->sc_max_sge_rd);
-   read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
+   read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset,
+rs_length);
 
for (pno = 0; pno < pages_needed; pno++) {
int len = min_t(int, rs_length, PAGE_SIZE - pg_off);
@@ -235,7 +236,8 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
ctxt->direction = DMA_FROM_DEVICE;
ctxt->frmr = frmr;
pages_needed = min_t(int, pages_needed, xprt->sc_frmr_pg_list_len);
-   read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
+   read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset,
+rs_length);
 
frmr->kva = page_address(rqstp->rq_arg.pages[pg_no]);
frmr->direction = DMA_FROM_DEVICE;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-rc v1 1/4] xprtrdma: Replace global lkey with lkey local to PD

2015-09-21 Thread Chuck Lever

> On Sep 21, 2015, at 9:41 AM, Sagi Grimberg  wrote:
> 
> From: Chuck Lever 
> 
> The core API has changed so that devices that do not have a global
> DMA lkey automatically create an mr, per-PD, and make that lkey
> available. The global DMA lkey interface is going away in favor of
> the per-PD DMA lkey.
> 
> The per-PD DMA lkey is always available. Convert xprtrdma to use the
> device's per-PD DMA lkey for regbufs, no matter which memory
> registration scheme is in use.
> 
> Signed-off-by: Chuck Lever 
> Signed-off-by: Sagi Grimberg 

You’ll probably need an Acked-by from Anna or Trond.


> ---
> net/sunrpc/xprtrdma/fmr_ops.c  | 19 ---
> net/sunrpc/xprtrdma/frwr_ops.c |  5 -
> net/sunrpc/xprtrdma/physical_ops.c | 10 +-
> net/sunrpc/xprtrdma/verbs.c|  2 +-
> net/sunrpc/xprtrdma/xprt_rdma.h|  1 -
> 5 files changed, 2 insertions(+), 35 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> index cb25c89da623..f1e8dafbd507 100644
> --- a/net/sunrpc/xprtrdma/fmr_ops.c
> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> @@ -39,25 +39,6 @@ static int
> fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
>   struct rpcrdma_create_data_internal *cdata)
> {
> - struct ib_device_attr *devattr = &ia->ri_devattr;
> - struct ib_mr *mr;
> -
> - /* Obtain an lkey to use for the regbufs, which are
> -  * protected from remote access.
> -  */
> - if (devattr->device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY) {
> - ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
> - } else {
> - mr = ib_get_dma_mr(ia->ri_pd, IB_ACCESS_LOCAL_WRITE);
> - if (IS_ERR(mr)) {
> - pr_err("%s: ib_get_dma_mr for failed with %lX\n",
> -__func__, PTR_ERR(mr));
> - return -ENOMEM;
> - }
> - ia->ri_dma_lkey = ia->ri_dma_mr->lkey;
> - ia->ri_dma_mr = mr;
> - }
> -
>   return 0;
> }
> 
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index d6653f5d0830..5318951b3b53 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -189,11 +189,6 @@ frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep 
> *ep,
>   struct ib_device_attr *devattr = &ia->ri_devattr;
>   int depth, delta;
> 
> - /* Obtain an lkey to use for the regbufs, which are
> -  * protected from remote access.
> -  */
> - ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
> -
>   ia->ri_max_frmr_depth =
>   min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
> devattr->max_fast_reg_page_list_len);
> diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
> b/net/sunrpc/xprtrdma/physical_ops.c
> index 72cf8b15bbb4..617b76f22154 100644
> --- a/net/sunrpc/xprtrdma/physical_ops.c
> +++ b/net/sunrpc/xprtrdma/physical_ops.c
> @@ -23,7 +23,6 @@ static int
> physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
>struct rpcrdma_create_data_internal *cdata)
> {
> - struct ib_device_attr *devattr = &ia->ri_devattr;
>   struct ib_mr *mr;
> 
>   /* Obtain an rkey to use for RPC data payloads.
> @@ -37,15 +36,8 @@ physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep 
> *ep,
>  __func__, PTR_ERR(mr));
>   return -ENOMEM;
>   }
> - ia->ri_dma_mr = mr;
> -
> - /* Obtain an lkey to use for regbufs.
> -  */
> - if (devattr->device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
> - ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
> - else
> - ia->ri_dma_lkey = ia->ri_dma_mr->lkey;
> 
> + ia->ri_dma_mr = mr;
>   return 0;
> }
> 
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 682996779970..eb081ad05e33 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -1252,7 +1252,7 @@ rpcrdma_alloc_regbuf(struct rpcrdma_ia *ia, size_t 
> size, gfp_t flags)
>   goto out_free;
> 
>   iov->length = size;
> - iov->lkey = ia->ri_dma_lkey;
> + iov->lkey = ia->ri_pd->local_dma_lkey;
>   rb->rg_size = size;
>   rb->rg_owner = NULL;
>   return rb;
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index d252457ff21a..df5ad4e15702 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -66,7 +66,6 @@ struct rpcrdma_ia {
>   struct rdma_cm_id   *ri_id;
>   struct ib_pd*ri_pd;
>   struct ib_mr*ri_dma_mr;
> - u32 ri_dma_lkey;
>   struct completion   ri_done;
>   int ri_async_rc;
>   unsigned intri_max_frmr_depth;
> -- 
> 1.8.4.3
> 

—
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kerne

Re: [patch] IB/hfi1: checking for NULL instead of IS_ERR

2015-09-21 Thread Doug Ledford
On 09/21/2015 12:48 PM, Greg Kroah-Hartman wrote:

> But, that's already not happening, as is obvious by my inbox.

Yeah, I see that.

> So, how about you forward on what you have so far to me, and I'll keep
> these.  Otherwise you will end up with nasty merge issues very quickly
> as people will continue to send stuff to me.

As you wish.  I pulled in a number of patches related to hfi1, but I've
already sent the pull request to Linus and they made it as of 4.1-rc2.
So, just start from there and you will have all the patches I've taken.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [patch] IB/hfi1: checking for NULL instead of IS_ERR

2015-09-21 Thread Greg Kroah-Hartman
On Mon, Sep 21, 2015 at 11:42:28AM -0400, Doug Ledford wrote:
> On 09/18/2015 11:01 PM, Greg Kroah-Hartman wrote:
> > On Fri, Sep 18, 2015 at 11:51:09AM -0400, Doug Ledford wrote:
> >> On 09/16/2015 02:22 AM, Dan Carpenter wrote:
> >>> __get_txreq() returns an ERR_PTR() but this checks for NULL so it would
> >>> oops on failure.
> >>>
> >>> Signed-off-by: Dan Carpenter 
> >>
> >> Thanks, applied.
> > 
> > Applied to what?  Should I just ignore these types of patches and not
> > take them in my tree and you will send them on later on?  I don't
> > remember what we agreed to do, sorry.
> 
> My understanding was that I would handle everything in the staging/rdma
> area.  To that end, I tried to make it explicit so that people would
> know that via the following things:
> 
> From MAINTAINERS:
> 
> INFINIBAND SUBSYSTEM
> M:  Doug Ledford 
> M:  Sean Hefty 
> M:  Hal Rosenstock 
> L:  linux-rdma@vger.kernel.org
> W:  http://www.openfabrics.org/
> Q:  http://patchwork.kernel.org/project/linux-rdma/list/
> T:  git git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
> S:  Supported
> F:  Documentation/infiniband/
> F:  drivers/infiniband/
> F:  drivers/staging/rdma/
> F:  include/uapi/linux/if_infiniband.h
> F:  include/uapi/rdma/
> F:  include/rdma/
> 
> 
> And from drivers/staging/rdma/Kconfig:
> 
> menuconfig STAGING_RDMA
> bool "RDMA staging drivers"
>   depends on INFINIBAND
>   depends on PCI || BROKEN
>   depends on HAS_IOMEM
>   depends on NET
>   depends on INET
> default n
> ---help---
>   This option allows you to select a number of RDMA drivers that
> fall into one of two categories: deprecated drivers being held
> here before finally being removed or new drivers that still need
> some work before being moved to the normal RDMA driver area.
> 
>   If you wish to work on these drivers, to help improve them, or
>   to report problems you have with them, please use the
> linux-rdma@vger.kernel.org mailing list.
> 
>   If in doubt, say N here.
> 
> I was hoping those two items would be sufficient to keep people from
> flooding devel@ and yourself personally with fixups for these items and
> instead they would send them through linux-rdma@.

But, that's already not happening, as is obvious by my inbox.

So, how about you forward on what you have so far to me, and I'll keep
these.  Otherwise you will end up with nasty merge issues very quickly
as people will continue to send stuff to me.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-rc 2/4] IB/iser: Add module parameter for always register memory

2015-09-21 Thread Sagi Grimberg

On 9/21/2015 7:39 PM, Jason Gunthorpe wrote:

On Sun, Sep 20, 2015 at 12:52:45PM +0300, Sagi Grimberg wrote:

This module parameter forces memory registration even for
a continuous memory region. It is true by default as sending
an all-physical rkey with remote permissions might be insecure.

Signed-off-by: Sagi Grimberg 
  drivers/infiniband/ulp/iser/iscsi_iser.c  |  5 +
  drivers/infiniband/ulp/iser/iscsi_iser.h  |  1 +
  drivers/infiniband/ulp/iser/iser_memory.c | 18 --
  drivers/infiniband/ulp/iser/iser_verbs.c  | 21 +
  4 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c 
b/drivers/infiniband/ulp/iser/iscsi_iser.c
index 1ace5d83a4d7..bad9dd701d3c 100644
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -97,6 +97,11 @@ unsigned int iser_max_sectors = ISER_DEF_MAX_SECTORS;
  module_param_named(max_sectors, iser_max_sectors, uint, S_IRUGO | S_IWUSR);
  MODULE_PARM_DESC(max_sectors, "Max number of sectors in a single scsi command 
(default:1024");

+bool iser_always_reg = true;
+module_param_named(always_register, iser_always_reg, bool, S_IRUGO);
+MODULE_PARM_DESC(always_register,
+"Always register memory, even for continuous memory regions 
(default:false)");

  
^^^
Description doesn't match implementation?


Yea... I noticed that too...

v1 is already on the list.

Thanks for commenting.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH rdma-rc v1 2/4] IB/iser: Add module parameter for always register memory

2015-09-21 Thread Sagi Grimberg
This module parameter forces memory registration even for
a continuous memory region. It is true by default as sending
an all-physical rkey with remote permissions might be insecure.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/ulp/iser/iscsi_iser.c  |  5 +
 drivers/infiniband/ulp/iser/iscsi_iser.h  |  1 +
 drivers/infiniband/ulp/iser/iser_memory.c | 18 --
 drivers/infiniband/ulp/iser/iser_verbs.c  | 21 +
 4 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c 
b/drivers/infiniband/ulp/iser/iscsi_iser.c
index 1ace5d83a4d7..f58ff96b6cbb 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -97,6 +97,11 @@ unsigned int iser_max_sectors = ISER_DEF_MAX_SECTORS;
 module_param_named(max_sectors, iser_max_sectors, uint, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(max_sectors, "Max number of sectors in a single scsi command 
(default:1024");
 
+bool iser_always_reg = true;
+module_param_named(always_register, iser_always_reg, bool, S_IRUGO);
+MODULE_PARM_DESC(always_register,
+"Always register memory, even for continuous memory regions 
(default:true)");
+
 bool iser_pi_enable = false;
 module_param_named(pi_enable, iser_pi_enable, bool, S_IRUGO);
 MODULE_PARM_DESC(pi_enable, "Enable T10-PI offload support 
(default:disabled)");
diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h 
b/drivers/infiniband/ulp/iser/iscsi_iser.h
index 86f6583485ef..a5edd6ede692 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -611,6 +611,7 @@ extern int iser_debug_level;
 extern bool iser_pi_enable;
 extern int iser_pi_guard;
 extern unsigned int iser_max_sectors;
+extern bool iser_always_reg;
 
 int iser_assign_reg_ops(struct iser_device *device);
 
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index 2493cc748db8..4c46d67d37a1 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -803,11 +803,12 @@ static int
 iser_reg_prot_sg(struct iscsi_iser_task *task,
 struct iser_data_buf *mem,
 struct iser_fr_desc *desc,
+bool use_dma_key,
 struct iser_mem_reg *reg)
 {
struct iser_device *device = task->iser_conn->ib_conn.device;
 
-   if (mem->dma_nents == 1)
+   if (use_dma_key)
return iser_reg_dma(device, mem, reg);
 
return device->reg_ops->reg_mem(task, mem, &desc->pi_ctx->rsc, reg);
@@ -817,11 +818,12 @@ static int
 iser_reg_data_sg(struct iscsi_iser_task *task,
 struct iser_data_buf *mem,
 struct iser_fr_desc *desc,
+bool use_dma_key,
 struct iser_mem_reg *reg)
 {
struct iser_device *device = task->iser_conn->ib_conn.device;
 
-   if (mem->dma_nents == 1)
+   if (use_dma_key)
return iser_reg_dma(device, mem, reg);
 
return device->reg_ops->reg_mem(task, mem, &desc->rsc, reg);
@@ -836,14 +838,17 @@ int iser_reg_rdma_mem(struct iscsi_iser_task *task,
struct iser_mem_reg *reg = &task->rdma_reg[dir];
struct iser_mem_reg *data_reg;
struct iser_fr_desc *desc = NULL;
+   bool use_dma_key;
int err;
 
err = iser_handle_unaligned_buf(task, mem, dir);
if (unlikely(err))
return err;
 
-   if (mem->dma_nents != 1 ||
-   scsi_get_prot_op(task->sc) != SCSI_PROT_NORMAL) {
+   use_dma_key = (mem->dma_nents == 1 && !iser_always_reg &&
+  scsi_get_prot_op(task->sc) == SCSI_PROT_NORMAL);
+
+   if (!use_dma_key) {
desc = device->reg_ops->reg_desc_get(ib_conn);
reg->mem_h = desc;
}
@@ -853,7 +858,7 @@ int iser_reg_rdma_mem(struct iscsi_iser_task *task,
else
data_reg = &task->desc.data_reg;
 
-   err = iser_reg_data_sg(task, mem, desc, data_reg);
+   err = iser_reg_data_sg(task, mem, desc, use_dma_key, data_reg);
if (unlikely(err))
goto err_reg;
 
@@ -866,7 +871,8 @@ int iser_reg_rdma_mem(struct iscsi_iser_task *task,
if (unlikely(err))
goto err_reg;
 
-   err = iser_reg_prot_sg(task, mem, desc, prot_reg);
+   err = iser_reg_prot_sg(task, mem, desc,
+  use_dma_key, prot_reg);
if (unlikely(err))
goto err_reg;
}
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c 
b/drivers/infiniband/ulp/iser/iser_verbs.c
index ae70cc1463ac..85132d867bc8 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -133,11 +133,15 @@ static int iser_create_device_ib_res(struct iser_device 
*device)
   

[PATCH rdma-rc v1 1/4] xprtrdma: Replace global lkey with lkey local to PD

2015-09-21 Thread Sagi Grimberg
From: Chuck Lever 

The core API has changed so that devices that do not have a global
DMA lkey automatically create an mr, per-PD, and make that lkey
available. The global DMA lkey interface is going away in favor of
the per-PD DMA lkey.

The per-PD DMA lkey is always available. Convert xprtrdma to use the
device's per-PD DMA lkey for regbufs, no matter which memory
registration scheme is in use.

Signed-off-by: Chuck Lever 
Signed-off-by: Sagi Grimberg 
---
 net/sunrpc/xprtrdma/fmr_ops.c  | 19 ---
 net/sunrpc/xprtrdma/frwr_ops.c |  5 -
 net/sunrpc/xprtrdma/physical_ops.c | 10 +-
 net/sunrpc/xprtrdma/verbs.c|  2 +-
 net/sunrpc/xprtrdma/xprt_rdma.h|  1 -
 5 files changed, 2 insertions(+), 35 deletions(-)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index cb25c89da623..f1e8dafbd507 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -39,25 +39,6 @@ static int
 fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
struct rpcrdma_create_data_internal *cdata)
 {
-   struct ib_device_attr *devattr = &ia->ri_devattr;
-   struct ib_mr *mr;
-
-   /* Obtain an lkey to use for the regbufs, which are
-* protected from remote access.
-*/
-   if (devattr->device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY) {
-   ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
-   } else {
-   mr = ib_get_dma_mr(ia->ri_pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(mr)) {
-   pr_err("%s: ib_get_dma_mr for failed with %lX\n",
-  __func__, PTR_ERR(mr));
-   return -ENOMEM;
-   }
-   ia->ri_dma_lkey = ia->ri_dma_mr->lkey;
-   ia->ri_dma_mr = mr;
-   }
-
return 0;
 }
 
diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index d6653f5d0830..5318951b3b53 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -189,11 +189,6 @@ frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
struct ib_device_attr *devattr = &ia->ri_devattr;
int depth, delta;
 
-   /* Obtain an lkey to use for the regbufs, which are
-* protected from remote access.
-*/
-   ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
-
ia->ri_max_frmr_depth =
min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
  devattr->max_fast_reg_page_list_len);
diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index 72cf8b15bbb4..617b76f22154 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -23,7 +23,6 @@ static int
 physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
 struct rpcrdma_create_data_internal *cdata)
 {
-   struct ib_device_attr *devattr = &ia->ri_devattr;
struct ib_mr *mr;
 
/* Obtain an rkey to use for RPC data payloads.
@@ -37,15 +36,8 @@ physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep 
*ep,
   __func__, PTR_ERR(mr));
return -ENOMEM;
}
-   ia->ri_dma_mr = mr;
-
-   /* Obtain an lkey to use for regbufs.
-*/
-   if (devattr->device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
-   ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
-   else
-   ia->ri_dma_lkey = ia->ri_dma_mr->lkey;
 
+   ia->ri_dma_mr = mr;
return 0;
 }
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 682996779970..eb081ad05e33 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1252,7 +1252,7 @@ rpcrdma_alloc_regbuf(struct rpcrdma_ia *ia, size_t size, 
gfp_t flags)
goto out_free;
 
iov->length = size;
-   iov->lkey = ia->ri_dma_lkey;
+   iov->lkey = ia->ri_pd->local_dma_lkey;
rb->rg_size = size;
rb->rg_owner = NULL;
return rb;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index d252457ff21a..df5ad4e15702 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -66,7 +66,6 @@ struct rpcrdma_ia {
struct rdma_cm_id   *ri_id;
struct ib_pd*ri_pd;
struct ib_mr*ri_dma_mr;
-   u32 ri_dma_lkey;
struct completion   ri_done;
int ri_async_rc;
unsigned intri_max_frmr_depth;
-- 
1.8.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH rdma-rc v1 0/4] Remove mlx5 support for IB_DEVICE_LOCAL_DMA_LKEY

2015-09-21 Thread Sagi Grimberg
The Connect-IB device has a specific issue with memory registration using
the reserved lkey (device global_dma_lkey). This caused user-space memory
registration which usually uses cached pre-registered memory keys to fail
due to a device access error during registration. kernel-space memory
registrations used an internal instance a physical memory key allocated with
the private pd context, so this error didn't happen there.

Since the reserved lkey is not fully functional, remove the support for
it altogether. Jason's patches commit 96249d70dd70 ("IB/core: Guarantee
that a local_dma_lkey is available") address consumers by allocating a physical
lkey per PD in the core layer. It also allows mlx5 driver to get rid of it's
private physical lkey (patch #2).

The ConnectX-4 device should have this issue fixed so the capability will be
restored depending on a FW query information.

Also, fix NFS client to use the PD local_dma_lkey instead of the device
local_dma_lkey (which requires a missing check of the device capability in
frwr mode). And, fix iser initiator which encountered some issues when
registering a signature capable memory region with an indirect dma_lkey
area. The fix covers a larger ground as it does not allow using a global MR
with remote access (long standing issue) but on the way makes the specific
registration issue go away.

Thanks to Haggai for catching this early enough.

Changes from v0:
- Replace xprtrdma patch to Chuck's one
- Fixed typo in iser modparam description

Chuck Lever (1):
  xprtrdma: Replace global lkey with lkey local to PD

Sagi Grimberg (3):
  IB/iser: Add module parameter for always register memory
  IB/mlx5: Remove support for IB_DEVICE_LOCAL_DMA_LKEY
  IB/mlx5: Remove pa_lkey usages

 drivers/infiniband/hw/mlx5/main.c| 67 +---
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  2 -
 drivers/infiniband/hw/mlx5/qp.c  |  4 +-
 drivers/infiniband/ulp/iser/iscsi_iser.c |  5 +++
 drivers/infiniband/ulp/iser/iscsi_iser.h |  1 +
 drivers/infiniband/ulp/iser/iser_memory.c| 18 +---
 drivers/infiniband/ulp/iser/iser_verbs.c | 21 +
 drivers/net/ethernet/mellanox/mlx5/core/fw.c | 22 -
 include/linux/mlx5/device.h  | 11 -
 include/linux/mlx5/driver.h  |  1 -
 net/sunrpc/xprtrdma/fmr_ops.c| 19 
 net/sunrpc/xprtrdma/frwr_ops.c   |  5 ---
 net/sunrpc/xprtrdma/physical_ops.c   | 10 +
 net/sunrpc/xprtrdma/verbs.c  |  2 +-
 net/sunrpc/xprtrdma/xprt_rdma.h  |  1 -
 15 files changed, 35 insertions(+), 154 deletions(-)

-- 
1.8.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH rdma-rc v1 4/4] IB/mlx5: Remove pa_lkey usages

2015-09-21 Thread Sagi Grimberg
Since mlx5 driver cannot rely on registration using the
reserved lkey (global_dma_lkey) it used to allocate a private
physical address lkey for each allocated pd.
Commit 96249d70dd70 ("IB/core: Guarantee that a local_dma_lkey
is available") just does it in the core layer so we can go ahead
and use that.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/mlx5/main.c| 57 
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  2 --
 drivers/infiniband/hw/mlx5/qp.c  |  4 +--
 3 files changed, 1 insertion(+), 62 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 0ab9625911a1..f1ccd40beae9 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -794,53 +794,6 @@ static int mlx5_ib_mmap(struct ib_ucontext *ibcontext, 
struct vm_area_struct *vm
return 0;
 }
 
-static int alloc_pa_mkey(struct mlx5_ib_dev *dev, u32 *key, u32 pdn)
-{
-   struct mlx5_create_mkey_mbox_in *in;
-   struct mlx5_mkey_seg *seg;
-   struct mlx5_core_mr mr;
-   int err;
-
-   in = kzalloc(sizeof(*in), GFP_KERNEL);
-   if (!in)
-   return -ENOMEM;
-
-   seg = &in->seg;
-   seg->flags = MLX5_PERM_LOCAL_READ | MLX5_ACCESS_MODE_PA;
-   seg->flags_pd = cpu_to_be32(pdn | MLX5_MKEY_LEN64);
-   seg->qpn_mkey7_0 = cpu_to_be32(0xff << 8);
-   seg->start_addr = 0;
-
-   err = mlx5_core_create_mkey(dev->mdev, &mr, in, sizeof(*in),
-   NULL, NULL, NULL);
-   if (err) {
-   mlx5_ib_warn(dev, "failed to create mkey, %d\n", err);
-   goto err_in;
-   }
-
-   kfree(in);
-   *key = mr.key;
-
-   return 0;
-
-err_in:
-   kfree(in);
-
-   return err;
-}
-
-static void free_pa_mkey(struct mlx5_ib_dev *dev, u32 key)
-{
-   struct mlx5_core_mr mr;
-   int err;
-
-   memset(&mr, 0, sizeof(mr));
-   mr.key = key;
-   err = mlx5_core_destroy_mkey(dev->mdev, &mr);
-   if (err)
-   mlx5_ib_warn(dev, "failed to destroy mkey 0x%x\n", key);
-}
-
 static struct ib_pd *mlx5_ib_alloc_pd(struct ib_device *ibdev,
  struct ib_ucontext *context,
  struct ib_udata *udata)
@@ -866,13 +819,6 @@ static struct ib_pd *mlx5_ib_alloc_pd(struct ib_device 
*ibdev,
kfree(pd);
return ERR_PTR(-EFAULT);
}
-   } else {
-   err = alloc_pa_mkey(to_mdev(ibdev), &pd->pa_lkey, pd->pdn);
-   if (err) {
-   mlx5_core_dealloc_pd(to_mdev(ibdev)->mdev, pd->pdn);
-   kfree(pd);
-   return ERR_PTR(err);
-   }
}
 
return &pd->ibpd;
@@ -883,9 +829,6 @@ static int mlx5_ib_dealloc_pd(struct ib_pd *pd)
struct mlx5_ib_dev *mdev = to_mdev(pd->device);
struct mlx5_ib_pd *mpd = to_mpd(pd);
 
-   if (!pd->uobject)
-   free_pa_mkey(mdev, mpd->pa_lkey);
-
mlx5_core_dealloc_pd(mdev->mdev, mpd->pdn);
kfree(mpd);
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index a5fa0b9c7580..b0a22fea76f1 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -103,7 +103,6 @@ static inline struct mlx5_ib_ucontext *to_mucontext(struct 
ib_ucontext *ibuconte
 struct mlx5_ib_pd {
struct ib_pdibpd;
u32 pdn;
-   u32 pa_lkey;
 };
 
 /* Use macros here so that don't have to duplicate
@@ -213,7 +212,6 @@ struct mlx5_ib_qp {
int uuarn;
 
int create_type;
-   u32 pa_lkey;
 
/* Store signature errors */
boolsignature_en;
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 203c8a45e095..45722a4fa99d 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -930,8 +930,6 @@ static int create_qp_common(struct mlx5_ib_dev *dev, struct 
ib_pd *pd,
err = create_kernel_qp(dev, init_attr, qp, &in, &inlen);
if (err)
mlx5_ib_dbg(dev, "err %d\n", err);
-   else
-   qp->pa_lkey = to_mpd(pd)->pa_lkey;
}
 
if (err)
@@ -2050,7 +2048,7 @@ static void set_frwr_pages(struct mlx5_wqe_data_seg *dseg,
mfrpl->mapped_page_list[i] = cpu_to_be64(page_list[i] | perm);
dseg->addr = cpu_to_be64(mfrpl->map);
dseg->byte_count = cpu_to_be32(ALIGN(sizeof(u64) * 
wr->wr.fast_reg.page_list_len, 64));
-   dseg->lkey = cpu_to_be32(pd->pa_lkey);
+   dseg->lkey = cpu_to_be32(pd->ibpd.local_dma_lkey);
 }
 
 static __be32 send_ieth(struct ib_send_wr *wr)
-

[PATCH rdma-rc v1 3/4] IB/mlx5: Remove support for IB_DEVICE_LOCAL_DMA_LKEY

2015-09-21 Thread Sagi Grimberg
ConnectIB has some known issues with memory registration
using the local_dma_lkey (SEND, RDMA, RECV seems to work ok).
Thus don't expose support for it (and remove device->local_dma_lkey
setting).

since commit 96249d70dd70 ("IB/core: Guarantee that a local_dma_lkey
is available") addressed that by allocating a DMA MR with local permissions
and converted the consumers to use the MR associated with the PD rather
then device->local_dma_lkey.

The local_dma_lkey support will be restored in CX4 depending on FW
capability query.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/mlx5/main.c| 10 +-
 drivers/net/ethernet/mellanox/mlx5/core/fw.c | 22 --
 include/linux/mlx5/device.h  | 11 ---
 include/linux/mlx5/driver.h  |  1 -
 4 files changed, 1 insertion(+), 43 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 41d6911e244e..0ab9625911a1 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -245,7 +245,6 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
props->device_cap_flags |= IB_DEVICE_BAD_QKEY_CNTR;
if (MLX5_CAP_GEN(mdev, apm))
props->device_cap_flags |= IB_DEVICE_AUTO_PATH_MIG;
-   props->device_cap_flags |= IB_DEVICE_LOCAL_DMA_LKEY;
if (MLX5_CAP_GEN(mdev, xrc))
props->device_cap_flags |= IB_DEVICE_XRC;
props->device_cap_flags |= IB_DEVICE_MEM_MGT_EXTENSIONS;
@@ -1245,18 +1244,10 @@ static int create_dev_resources(struct 
mlx5_ib_resources *devr)
struct ib_srq_init_attr attr;
struct mlx5_ib_dev *dev;
struct ib_cq_init_attr cq_attr = {.cqe = 1};
-   u32 rsvd_lkey;
int ret = 0;
 
dev = container_of(devr, struct mlx5_ib_dev, devr);
 
-   ret = mlx5_core_query_special_context(dev->mdev, &rsvd_lkey);
-   if (ret) {
-   pr_err("Failed to query special context %d\n", ret);
-   return ret;
-   }
-   dev->ib_dev.local_dma_lkey = rsvd_lkey;
-
devr->p0 = mlx5_ib_alloc_pd(&dev->ib_dev, NULL, NULL);
if (IS_ERR(devr->p0)) {
ret = PTR_ERR(devr->p0);
@@ -1418,6 +1409,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
strlcpy(dev->ib_dev.name, "mlx5_%d", IB_DEVICE_NAME_MAX);
dev->ib_dev.owner   = THIS_MODULE;
dev->ib_dev.node_type   = RDMA_NODE_IB_CA;
+   dev->ib_dev.local_dma_lkey  = 0 /* not supported for now */;
dev->num_ports  = MLX5_CAP_GEN(mdev, num_ports);
dev->ib_dev.phys_port_cnt = dev->num_ports;
dev->ib_dev.num_comp_vectors=
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fw.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
index aa0d5ffe92d8..9335e5ae18cc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fw.c
@@ -200,25 +200,3 @@ int mlx5_cmd_teardown_hca(struct mlx5_core_dev *dev)
 
return err;
 }
-
-int mlx5_core_query_special_context(struct mlx5_core_dev *dev, u32 *rsvd_lkey)
-{
-   struct mlx5_cmd_query_special_contexts_mbox_in in;
-   struct mlx5_cmd_query_special_contexts_mbox_out out;
-   int err;
-
-   memset(&in, 0, sizeof(in));
-   memset(&out, 0, sizeof(out));
-   in.hdr.opcode = cpu_to_be16(MLX5_CMD_OP_QUERY_SPECIAL_CONTEXTS);
-   err = mlx5_cmd_exec(dev, &in, sizeof(in), &out, sizeof(out));
-   if (err)
-   return err;
-
-   if (out.hdr.status)
-   err = mlx5_cmd_status_to_err(&out.hdr);
-
-   *rsvd_lkey = be32_to_cpu(out.resd_lkey);
-
-   return err;
-}
-EXPORT_SYMBOL(mlx5_core_query_special_context);
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index 6e4169c5ad78..b943cd9e2097 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -402,17 +402,6 @@ struct mlx5_cmd_teardown_hca_mbox_out {
u8  rsvd[8];
 };
 
-struct mlx5_cmd_query_special_contexts_mbox_in {
-   struct mlx5_inbox_hdr   hdr;
-   u8  rsvd[8];
-};
-
-struct mlx5_cmd_query_special_contexts_mbox_out {
-   struct mlx5_outbox_hdr  hdr;
-   __be32  dump_fill_mkey;
-   __be32  resd_lkey;
-};
-
 struct mlx5_cmd_layout {
u8  type;
u8  rsvd0[3];
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 1e2e48ccb3fd..5722d88c2429 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -828,7 +828,6 @@ void *mlx5_get_protocol_dev(struct mlx5_core_dev *mdev, int 
protocol);
 int mlx5_register_interface(struct mlx5_interface *intf);
 void mlx5_unregister_interface(struct mlx5_interface *intf);
 int mlx5_core_query_vendor_id(struct mlx5_core_dev *mdev, u32 *vendor_id);
-int mlx5_core_query_special_context(struct mlx5_core_

Re: [PATCH rdma-rc 0/4] Remove mlx5 support for IB_DEVICE_LOCAL_DMA_LKEY

2015-09-21 Thread Jason Gunthorpe
On Sun, Sep 20, 2015 at 12:52:43PM +0300, Sagi Grimberg wrote:
> The Connect-IB device has a specific issue with memory registration using
> the reserved lkey (device global_dma_lkey). This caused user-space memory
> registration which usually uses cached pre-registered memory keys to fail
> due to a device access error during registration. kernel-space memory
> registrations used an internal instance a physical memory key allocated with
> the private pd context, so this error didn't happen there.

I didn't read in super detail, but this all looks very sane to me.

Reviewed-By: Jason Gunthorpe 

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH rdma-rc 2/4] IB/iser: Add module parameter for always register memory

2015-09-21 Thread Jason Gunthorpe
On Sun, Sep 20, 2015 at 12:52:45PM +0300, Sagi Grimberg wrote:
> This module parameter forces memory registration even for
> a continuous memory region. It is true by default as sending
> an all-physical rkey with remote permissions might be insecure.
> 
> Signed-off-by: Sagi Grimberg 
>  drivers/infiniband/ulp/iser/iscsi_iser.c  |  5 +
>  drivers/infiniband/ulp/iser/iscsi_iser.h  |  1 +
>  drivers/infiniband/ulp/iser/iser_memory.c | 18 --
>  drivers/infiniband/ulp/iser/iser_verbs.c  | 21 +
>  4 files changed, 31 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c 
> b/drivers/infiniband/ulp/iser/iscsi_iser.c
> index 1ace5d83a4d7..bad9dd701d3c 100644
> +++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
> @@ -97,6 +97,11 @@ unsigned int iser_max_sectors = ISER_DEF_MAX_SECTORS;
>  module_param_named(max_sectors, iser_max_sectors, uint, S_IRUGO | S_IWUSR);
>  MODULE_PARM_DESC(max_sectors, "Max number of sectors in a single scsi 
> command (default:1024");
>  
> +bool iser_always_reg = true;
> +module_param_named(always_register, iser_always_reg, bool, S_IRUGO);
> +MODULE_PARM_DESC(always_register,
> +  "Always register memory, even for continuous memory regions 
> (default:false)");
 
^^^
Description doesn't match implementation?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/15] RDS: connection scalability and performance improvements

2015-09-21 Thread santosh shilimkar

On 9/20/2015 1:37 AM, Sagi Grimberg wrote:

On 9/20/2015 2:04 AM, Santosh Shilimkar wrote:

This series addresses RDS connection bottlenecks on massive workloads and
improve the RDMA performance almost by 3X. RDS TCP also gets a small gain
of about 12%.

RDS is being used in massive systems with high scalability where several
hundred thousand end points and tens of thousands of local processes
are operating in tens of thousand sockets. Being RC(reliable connection),
socket bind and release happens very often and any inefficiencies in
bind hash look ups hurts the overall system performance. RDS bin
hash-table
uses global spin-lock which is the biggest bottleneck. To make matter
worst,
it uses rcu inside global lock for hash buckets.
This is being addressed by simply using per bucket rw lock which makes
the
locking simple and very efficient. The hash table size is also scaled up
accordingly.

For RDS RDMA improvement, the completion handling is revamped so that we
can do batch completions. Both send and receive completion handlers are
split logically to achieve the same. RDS 8K messages being one of the
key usecase, mr pool is adapted to have the 8K mrs along with default 1M
mrs. And while doing this, few fixes and couple of bottlenecks seen with
rds_sendmsg() are addressed.


Hi Santosh,

I think that can get a more effective code review if you CC the
Linux-rdma mailing list.


I will do that from next time. Thanks Sagi !!

Regards,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 03/18] xprtrdma: Remove completion polling budgets

2015-09-21 Thread Chuck Lever

> On Sep 21, 2015, at 1:51 AM, Devesh Sharma  
> wrote:
> 
> On Sun, Sep 20, 2015 at 4:05 PM, Sagi Grimberg  
> wrote:
 It is possible that in a given poll_cq
 call you end up getting on 1 completion, the other completion is
 delayed due to some reason.
>>> 
>>> 
>>> If a CQE is allowed to be delayed, how does polling
>>> again guarantee that the consumer can retrieve it?
>>> 
>>> What happens if a signal occurs, there is only one CQE,
>>> but it is delayed? ib_poll_cq would return 0 in that
>>> case, and the consumer would never call again, thinking
>>> the CQ is empty. There's no way the consumer can know
>>> for sure when a CQ is drained.
>>> 
>>> If the delayed CQE happens only when there is more
>>> than one CQE, how can polling multiple WCs ever work
>>> reliably?
>>> 
>>> Maybe I don't understand what is meant by delayed.
>>> 
>> 
>> If I'm not mistaken, Devesh meant that if between ib_poll_cq (where you
>> polled the last 2 wcs) until the while statement another CQE was
>> generated then you lost a bit of efficiency. Correct?
> 
> Yes, That's the point.

I’m optimizing for the common case where 1 CQE is ready
to be polled. How much of an efficiency loss are you
talking about, how often would this loss occur, and is
this a problem for all providers / devices?

Is this an issue for the current arrangement where 8 WCs
are polled at a time?


—
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] IB/hfi1: checking for NULL instead of IS_ERR

2015-09-21 Thread Doug Ledford
On 09/18/2015 11:01 PM, Greg Kroah-Hartman wrote:
> On Fri, Sep 18, 2015 at 11:51:09AM -0400, Doug Ledford wrote:
>> On 09/16/2015 02:22 AM, Dan Carpenter wrote:
>>> __get_txreq() returns an ERR_PTR() but this checks for NULL so it would
>>> oops on failure.
>>>
>>> Signed-off-by: Dan Carpenter 
>>
>> Thanks, applied.
> 
> Applied to what?  Should I just ignore these types of patches and not
> take them in my tree and you will send them on later on?  I don't
> remember what we agreed to do, sorry.

My understanding was that I would handle everything in the staging/rdma
area.  To that end, I tried to make it explicit so that people would
know that via the following things:

From MAINTAINERS:

INFINIBAND SUBSYSTEM
M:  Doug Ledford 
M:  Sean Hefty 
M:  Hal Rosenstock 
L:  linux-rdma@vger.kernel.org
W:  http://www.openfabrics.org/
Q:  http://patchwork.kernel.org/project/linux-rdma/list/
T:  git git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
S:  Supported
F:  Documentation/infiniband/
F:  drivers/infiniband/
F:  drivers/staging/rdma/
F:  include/uapi/linux/if_infiniband.h
F:  include/uapi/rdma/
F:  include/rdma/


And from drivers/staging/rdma/Kconfig:

menuconfig STAGING_RDMA
bool "RDMA staging drivers"
depends on INFINIBAND
depends on PCI || BROKEN
depends on HAS_IOMEM
depends on NET
depends on INET
default n
---help---
  This option allows you to select a number of RDMA drivers that
  fall into one of two categories: deprecated drivers being held
  here before finally being removed or new drivers that still need
  some work before being moved to the normal RDMA driver area.

  If you wish to work on these drivers, to help improve them, or
  to report problems you have with them, please use the
  linux-rdma@vger.kernel.org mailing list.

  If in doubt, say N here.

I was hoping those two items would be sufficient to keep people from
flooding devel@ and yourself personally with fixups for these items and
instead they would send them through linux-rdma@.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


[PATCH] staging: rdma: add a blank line after function

2015-09-21 Thread Geliang Tang
Fixed warnings found by checkpatch.pl:

Please use a blank line after function/struct/union/enum declarations
  FILE: drivers/staging/rdma/amso1100/c2_mq.c:158:
  FILE: drivers/staging/rdma/hfi1/file_ops.c:2069:
  FILE: drivers/staging/rdma/hfi1/sdma.c:744:
  FILE: drivers/staging/rdma/hfi1/verbs.c:1202:

Signed-off-by: Geliang Tang 
---
 drivers/staging/rdma/amso1100/c2_mq.c | 1 +
 drivers/staging/rdma/hfi1/file_ops.c  | 1 +
 drivers/staging/rdma/hfi1/sdma.c  | 1 +
 drivers/staging/rdma/hfi1/verbs.c | 1 +
 4 files changed, 4 insertions(+)

diff --git a/drivers/staging/rdma/amso1100/c2_mq.c 
b/drivers/staging/rdma/amso1100/c2_mq.c
index 0cddc49..7827fb8 100644
--- a/drivers/staging/rdma/amso1100/c2_mq.c
+++ b/drivers/staging/rdma/amso1100/c2_mq.c
@@ -155,6 +155,7 @@ void c2_mq_req_init(struct c2_mq *q, u32 index, u32 q_size, 
u32 msg_size,
q->hint_count = 0;
return;
 }
+
 void c2_mq_rep_init(struct c2_mq *q, u32 index, u32 q_size, u32 msg_size,
u8 *pool_start, u16 __iomem *peer, u32 type)
 {
diff --git a/drivers/staging/rdma/hfi1/file_ops.c 
b/drivers/staging/rdma/hfi1/file_ops.c
index 72d3850..9a77221 100644
--- a/drivers/staging/rdma/hfi1/file_ops.c
+++ b/drivers/staging/rdma/hfi1/file_ops.c
@@ -2066,6 +2066,7 @@ static const struct file_operations ui_file_ops = {
.open = ui_open,
.release = ui_release,
 };
+
 #define UI_OFFSET 192  /* device minor offset for UI devices */
 static int create_ui = 1;
 
diff --git a/drivers/staging/rdma/hfi1/sdma.c b/drivers/staging/rdma/hfi1/sdma.c
index aecd1a7..78d016b 100644
--- a/drivers/staging/rdma/hfi1/sdma.c
+++ b/drivers/staging/rdma/hfi1/sdma.c
@@ -741,6 +741,7 @@ u16 sdma_get_descq_cnt(void)
return SDMA_DESCQ_CNT;
return count;
 }
+
 /**
  * sdma_select_engine_vl() - select sdma engine
  * @dd: devdata
diff --git a/drivers/staging/rdma/hfi1/verbs.c 
b/drivers/staging/rdma/hfi1/verbs.c
index 41bb59e..a43fccd 100644
--- a/drivers/staging/rdma/hfi1/verbs.c
+++ b/drivers/staging/rdma/hfi1/verbs.c
@@ -1199,6 +1199,7 @@ pio_bail:
}
return 0;
 }
+
 /*
  * egress_pkey_matches_entry - return 1 if the pkey matches ent (ent
  * being an entry from the ingress partition key table), return 0
-- 
2.5.0


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/hfi1: class_name_user() should be static

2015-09-21 Thread Geliang Tang
Fixes the following sparse warning:
  drivers/staging/rdma/hfi1/device.c:127:12:
  warning: symbol 'class_name_user' was not declared. Should it be static?

Signed-off-by: Geliang Tang 
---
 drivers/staging/rdma/hfi1/device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/rdma/hfi1/device.c 
b/drivers/staging/rdma/hfi1/device.c
index bc26a53..58472e5 100644
--- a/drivers/staging/rdma/hfi1/device.c
+++ b/drivers/staging/rdma/hfi1/device.c
@@ -124,7 +124,7 @@ static char *hfi1_devnode(struct device *dev, umode_t *mode)
 }
 
 static const char *hfi1_class_name_user = "hfi1_user";
-const char *class_name_user(void)
+static const char *class_name_user(void)
 {
return hfi1_class_name_user;
 }
-- 
2.5.0


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] staging/rdma/hfi1: do not use u8 to store a 32-bit integer

2015-09-21 Thread Marciniszyn, Mike
> Subject: [PATCH] staging/rdma/hfi1: do not use u8 to store a 32-bit integer
> 

Thanks for the patch!

Acked-by: Mike Marciniszyn 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] IB/hfi1: class_name_user() should be static

2015-09-21 Thread Marciniszyn, Mike
> Subject: [PATCH] IB/hfi1: class_name_user() should be static
> 
> Fixes the following sparse warning:
>   drivers/staging/rdma/hfi1/device.c:127:12:
>   warning: symbol 'class_name_user' was not declared. Should it be static?
> 
> Signed-off-by: Geliang Tang 
> ---

Thanks for the patch!

Acked-by: Mike Marciniszyn 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 for-next 0/7] Add support for multicast loopback prevention to mlx4

2015-09-21 Thread eran ben elisha
On Thu, Aug 20, 2015 at 5:34 PM, Eran Ben Elisha  wrote:
> Hi Doug,
>
> This patch-set adds a new  implementation for multicast loopback prevention 
> for
> mlx4 driver.  The current implementation is very limited, especially if link
> layer is Ethernet. The new implementation is based on HW feature of dropping
> incoming multicast packets if the sender QP counter index is equal to the
> receiver counter index.
>
> Patch 0001 extends ib_uverbs_create_qp in order to allow receiving the
> multicast loopback flag at create flags.
> Patch 0002 adds an infrastructure for the counters' loopback prevention in the
> mlx4_core.
> Patch 0003 modifies mlx4_en QPs to use the new loopback prevention mode.
> Patches 0004-0006 implements this feature for mlx4_ib driver.
> Patch 0007 allows setting IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK in create_flag
> field both from uverbs and ib_create_qp.
>
> Changes from v0:
>   Move loopback assignment outside the for loop according to Yuval's comment
>   rebase over to-be-rebased/for-4.3
>
>
> Thanks,
> Eran.
>
> Eran Ben Elisha (5):
>   IB/core: Extend ib_uverbs_create_qp
>   IB/core: Allow setting create flags in QP init attribute
>   IB/mlx4: Add IB counters table
>   IB/mlx4: Add counter based implementation for QP multicast loopback
> block
>   IB/mlx4: Add support for blocking multicast loopback QP creation user
> flag
>
> Maor Gottlieb (2):
>   net/mlx4_core: Add support for filtering multicast loopback
>   net/mlx4_en: Implement mcast loopback prevention for ETH qps

Hi Doug,
This version sits in the mailing list for a month with no comment.
It has been tested by Christoph Lameter.
When do you plan to take it into your tree?

Eran.

>
>  drivers/infiniband/core/uverbs.h   |   1 +
>  drivers/infiniband/core/uverbs_cmd.c   | 259 
> +++--
>  drivers/infiniband/core/uverbs_main.c  |   1 +
>  drivers/infiniband/hw/mlx4/mad.c   |  25 +-
>  drivers/infiniband/hw/mlx4/main.c  |  66 --
>  drivers/infiniband/hw/mlx4/mlx4_ib.h   |  10 +-
>  drivers/infiniband/hw/mlx4/qp.c|  88 ++-
>  drivers/net/ethernet/mellanox/mlx4/en_main.c   |  22 ++
>  drivers/net/ethernet/mellanox/mlx4/en_resources.c  |  25 ++
>  drivers/net/ethernet/mellanox/mlx4/fw.c|   6 +
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |   3 +-
>  drivers/net/ethernet/mellanox/mlx4/qp.c|  19 +-
>  .../net/ethernet/mellanox/mlx4/resource_tracker.c  |  30 ++-
>  include/linux/mlx4/device.h|   2 +
>  include/linux/mlx4/qp.h|  24 +-
>  include/uapi/rdma/ib_user_verbs.h  |  26 +++
>  16 files changed, 498 insertions(+), 109 deletions(-)
>
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/38] IB/ehca: fix handling idr_alloc result

2015-09-21 Thread Andrzej Hajda
The function can return negative value.

The problem has been detected using proposed semantic patch
scripts/coccinelle/tests/unsigned_lesser_than_zero.cocci [1].

[1]: http://permalink.gmane.org/gmane.linux.kernel/2038576

Signed-off-by: Andrzej Hajda 
---
 drivers/staging/rdma/ehca/ehca_cq.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/rdma/ehca/ehca_cq.c 
b/drivers/staging/rdma/ehca/ehca_cq.c
index 9b68b17..ea1b5c1 100644
--- a/drivers/staging/rdma/ehca/ehca_cq.c
+++ b/drivers/staging/rdma/ehca/ehca_cq.c
@@ -130,7 +130,7 @@ struct ib_cq *ehca_create_cq(struct ib_device *device,
void *vpage;
u32 counter;
u64 rpage, cqx_fec, h_ret;
-   int ipz_rc, i;
+   int rc, i;
unsigned long flags;
 
if (attr->flags)
@@ -170,16 +170,17 @@ struct ib_cq *ehca_create_cq(struct ib_device *device,
 
idr_preload(GFP_KERNEL);
write_lock_irqsave(&ehca_cq_idr_lock, flags);
-   my_cq->token = idr_alloc(&ehca_cq_idr, my_cq, 0, 0x200, GFP_NOWAIT);
+   rc = idr_alloc(&ehca_cq_idr, my_cq, 0, 0x200, GFP_NOWAIT);
write_unlock_irqrestore(&ehca_cq_idr_lock, flags);
idr_preload_end();
 
-   if (my_cq->token < 0) {
+   if (rc < 0) {
cq = ERR_PTR(-ENOMEM);
ehca_err(device, "Can't allocate new idr entry. device=%p",
 device);
goto create_cq_exit1;
}
+   my_cq->token = rc;
 
/*
 * CQs maximum depth is 4GB-64, but we need additional 20 as buffer
@@ -195,11 +196,11 @@ struct ib_cq *ehca_create_cq(struct ib_device *device,
goto create_cq_exit2;
}
 
-   ipz_rc = ipz_queue_ctor(NULL, &my_cq->ipz_queue, param.act_pages,
+   rc = ipz_queue_ctor(NULL, &my_cq->ipz_queue, param.act_pages,
EHCA_PAGESIZE, sizeof(struct ehca_cqe), 0, 0);
-   if (!ipz_rc) {
+   if (!rc) {
ehca_err(device, "ipz_queue_ctor() failed ipz_rc=%i device=%p",
-ipz_rc, device);
+rc, device);
cq = ERR_PTR(-EINVAL);
goto create_cq_exit3;
}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/38] Fixes related to incorrect usage of unsigned types

2015-09-21 Thread David Howells
Andrzej Hajda  wrote:

> Semantic patch finds comparisons of types:
> unsigned < 0
> unsigned >= 0
> The former is always false, the latter is always true.
> Such comparisons are useless, so theoretically they could be
> safely removed, but their presence quite often indicates bugs.

Or someone has left them in because they don't matter and there's the
possibility that the type being tested might be or become signed under some
circumstances.  If the comparison is useless, I'd expect the compiler to just
discard it - for such cases your patch is pointless.

If I have, for example:

unsigned x;

if (x == 0 || x > 27)
give_a_range_error();

I will write this as:

unsigned x;

if (x <= 0 || x > 27)
give_a_range_error();

because it that gives a way to handle x being changed to signed at some point
in the future for no cost.  In which case, your changing the <= to an ==
"because the < part of the case is useless" is arguably wrong.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] IB/hfi1: use kvfree() in sdma.c

2015-09-21 Thread Marciniszyn, Mike
> Subject: [PATCH] IB/hfi1: use kvfree() in sdma.c
> 
> Use kvfree() instead of open-coding it.
> 
> Signed-off-by: Geliang Tang 
> ---
>  drivers/staging/rdma/hfi1/sdma.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)

Thanks for the patch.

Acked-by: Mike Marciniszyn 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/cma: Potential NULL dereference in cma_id_from_event

2015-09-21 Thread Haggai Eran
If the lookup of a listening ID failed for an AF_IB request, the code
would try to call dev_put() on a NULL net_dev.

Fixes: be688195bd08 ("IB/cma: Fix net_dev reference leak with failed
requests")
Reported-by: Dan Carpenter 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/cma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index b1ab13f3e182..b92a3c2c060b 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1302,7 +1302,7 @@ static struct rdma_id_private *cma_id_from_event(struct 
ib_cm_id *cm_id,
bind_list = cma_ps_find(rdma_ps_from_service_id(req.service_id),
cma_port_from_service_id(req.service_id));
id_priv = cma_find_listener(bind_list, cm_id, ib_event, &req, *net_dev);
-   if (IS_ERR(id_priv)) {
+   if (IS_ERR(id_priv) && *net_dev) {
dev_put(*net_dev);
*net_dev = NULL;
}
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/hfi1: use kvfree() in sdma.c

2015-09-21 Thread Geliang Tang
Use kvfree() instead of open-coding it.

Signed-off-by: Geliang Tang 
---
 drivers/staging/rdma/hfi1/sdma.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/sdma.c b/drivers/staging/rdma/hfi1/sdma.c
index aecd1a7..9c02a3d 100644
--- a/drivers/staging/rdma/hfi1/sdma.c
+++ b/drivers/staging/rdma/hfi1/sdma.c
@@ -966,10 +966,7 @@ static void sdma_clean(struct hfi1_devdata *dd, size_t 
num_engines)
sde->descq = NULL;
sde->descq_phys = 0;
}
-   if (is_vmalloc_addr(sde->tx_ring))
-   vfree(sde->tx_ring);
-   else
-   kfree(sde->tx_ring);
+   kvfree(sde->tx_ring);
sde->tx_ring = NULL;
}
spin_lock_irq(&dd->sde_map_lock);
-- 
1.9.1


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 08/18] xprtrdma: Pre-allocate Work Requests for backchannel

2015-09-21 Thread Devesh Sharma
On Fri, Sep 18, 2015 at 2:15 AM, Chuck Lever  wrote:
> Pre-allocate extra send and receive Work Requests needed to handle
> backchannel receives and sends.
>
> The transport doesn't know how many extra WRs to pre-allocate until
> the xprt_setup_backchannel() call, but that's long after the WRs are
> allocated during forechannel setup.
>
> So, use a fixed value for now.
>
> Signed-off-by: Chuck Lever 
> ---
>  net/sunrpc/xprtrdma/backchannel.c |4 
>  net/sunrpc/xprtrdma/verbs.c   |   14 --
>  net/sunrpc/xprtrdma/xprt_rdma.h   |   10 ++
>  3 files changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/backchannel.c 
> b/net/sunrpc/xprtrdma/backchannel.c
> index c0a42ad..f5c7122 100644
> --- a/net/sunrpc/xprtrdma/backchannel.c
> +++ b/net/sunrpc/xprtrdma/backchannel.c
> @@ -123,6 +123,9 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned 
> int reqs)
>  * Twice as many rpc_rqsts are prepared to ensure there is
>  * always an rpc_rqst available as soon as a reply is sent.
>  */
> +   if (reqs > RPCRDMA_BACKWARD_WRS >> 1)
> +   goto out_err;
> +
> for (i = 0; i < (reqs << 1); i++) {
> rqst = kzalloc(sizeof(*rqst), GFP_KERNEL);
> if (!rqst) {
> @@ -159,6 +162,7 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned 
> int reqs)
>  out_free:
> xprt_rdma_bc_destroy(xprt, reqs);
>
> +out_err:
> pr_err("RPC:   %s: setup backchannel transport failed\n", 
> __func__);
> return -ENOMEM;
>  }
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 1e4a948..133c720 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -614,6 +614,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
> rpcrdma_ia *ia,
> struct ib_device_attr *devattr = &ia->ri_devattr;
> struct ib_cq *sendcq, *recvcq;
> struct ib_cq_init_attr cq_attr = {};
> +   unsigned int max_qp_wr;
> int rc, err;
>
> if (devattr->max_sge < RPCRDMA_MAX_IOVS) {
> @@ -622,18 +623,27 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct 
> rpcrdma_ia *ia,
> return -ENOMEM;
> }
>
> +   if (devattr->max_qp_wr <= RPCRDMA_BACKWARD_WRS) {
> +   dprintk("RPC:   %s: insufficient wqe's available\n",
> +   __func__);
> +   return -ENOMEM;
> +   }
> +   max_qp_wr = devattr->max_qp_wr - RPCRDMA_BACKWARD_WRS;
> +
> /* check provider's send/recv wr limits */
> -   if (cdata->max_requests > devattr->max_qp_wr)
> -   cdata->max_requests = devattr->max_qp_wr;
> +   if (cdata->max_requests > max_qp_wr)
> +   cdata->max_requests = max_qp_wr;

should we
cdata->max_request = max_qp_wr - RPCRDMA_BACKWARD_WRS?

>
> ep->rep_attr.event_handler = rpcrdma_qp_async_error_upcall;
> ep->rep_attr.qp_context = ep;
> ep->rep_attr.srq = NULL;
> ep->rep_attr.cap.max_send_wr = cdata->max_requests;
> +   ep->rep_attr.cap.max_send_wr += RPCRDMA_BACKWARD_WRS;

Looks like will cause a qp-create failure if any hypothetical device
supports devattr->max_qp_wr = cdata->max_requests

> rc = ia->ri_ops->ro_open(ia, ep, cdata);
> if (rc)
> return rc;
> ep->rep_attr.cap.max_recv_wr = cdata->max_requests;
> +   ep->rep_attr.cap.max_recv_wr += RPCRDMA_BACKWARD_WRS;
> ep->rep_attr.cap.max_send_sge = RPCRDMA_MAX_IOVS;
> ep->rep_attr.cap.max_recv_sge = 1;
> ep->rep_attr.cap.max_inline_data = 0;
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 2ca0567..37d0d7f 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -101,6 +101,16 @@ struct rpcrdma_ep {
>   */
>  #define RPCRDMA_IGNORE_COMPLETION  (0ULL)
>
> +/* Pre-allocate extra Work Requests for handling backward receives
> + * and sends. This is a fixed value because the Work Queues are
> + * allocated when the forward channel is set up.
> + */
> +#if defined(CONFIG_SUNRPC_BACKCHANNEL)
> +#define RPCRDMA_BACKWARD_WRS   (8)
> +#else
> +#define RPCRDMA_BACKWARD_WRS   (0)
> +#endif
> +
>  /* Registered buffer -- registered kmalloc'd memory for RDMA SEND/RECV
>   *
>   * The below structure appears at the front of a large region of kmalloc'd
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 07/18] xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers

2015-09-21 Thread Devesh Sharma
Looks good.

On Fri, Sep 18, 2015 at 2:15 AM, Chuck Lever  wrote:
> xprtrdma's backward direction send and receive buffers are the same
> size as the forechannel's inline threshold, and must be pre-
> registered.
>
> The consumer has no control over which receive buffer the adapter
> chooses to catch an incoming backwards-direction call. Any receive
> buffer can be used for either a forward reply or a backward call.
> Thus both types of RPC message must all be the same size.
>
> Signed-off-by: Chuck Lever 
> ---
>  net/sunrpc/xprtrdma/Makefile  |1
>  net/sunrpc/xprtrdma/backchannel.c |  204 
> +
>  net/sunrpc/xprtrdma/transport.c   |7 +
>  net/sunrpc/xprtrdma/verbs.c   |   92 ++---
>  net/sunrpc/xprtrdma/xprt_rdma.h   |   20 
>  5 files changed, 309 insertions(+), 15 deletions(-)
>  create mode 100644 net/sunrpc/xprtrdma/backchannel.c
>
> diff --git a/net/sunrpc/xprtrdma/Makefile b/net/sunrpc/xprtrdma/Makefile
> index 48913de..33f99d3 100644
> --- a/net/sunrpc/xprtrdma/Makefile
> +++ b/net/sunrpc/xprtrdma/Makefile
> @@ -5,3 +5,4 @@ rpcrdma-y := transport.o rpc_rdma.o verbs.o \
> svc_rdma.o svc_rdma_transport.o \
> svc_rdma_marshal.o svc_rdma_sendto.o svc_rdma_recvfrom.o \
> module.o
> +rpcrdma-$(CONFIG_SUNRPC_BACKCHANNEL) += backchannel.o
> diff --git a/net/sunrpc/xprtrdma/backchannel.c 
> b/net/sunrpc/xprtrdma/backchannel.c
> new file mode 100644
> index 000..c0a42ad
> --- /dev/null
> +++ b/net/sunrpc/xprtrdma/backchannel.c
> @@ -0,0 +1,204 @@
> +/*
> + * Copyright (c) 2015 Oracle.  All rights reserved.
> + *
> + * Support for backward direction RPCs on RPC/RDMA.
> + */
> +
> +#include 
> +
> +#include "xprt_rdma.h"
> +
> +#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
> +# define RPCDBG_FACILITY   RPCDBG_TRANS
> +#endif
> +
> +static void rpcrdma_bc_free_rqst(struct rpcrdma_xprt *r_xprt,
> +struct rpc_rqst *rqst)
> +{
> +   struct rpcrdma_buffer *buf = &r_xprt->rx_buf;
> +   struct rpcrdma_req *req = rpcr_to_rdmar(rqst);
> +
> +   spin_lock(&buf->rb_reqslock);
> +   list_del(&req->rl_all);
> +   spin_unlock(&buf->rb_reqslock);
> +
> +   rpcrdma_destroy_req(&r_xprt->rx_ia, req);
> +
> +   kfree(rqst);
> +}
> +
> +static int rpcrdma_bc_setup_rqst(struct rpcrdma_xprt *r_xprt,
> +struct rpc_rqst *rqst)
> +{
> +   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
> +   struct rpcrdma_regbuf *rb;
> +   struct rpcrdma_req *req;
> +   struct xdr_buf *buf;
> +   size_t size;
> +
> +   req = rpcrdma_create_req(r_xprt);
> +   if (!req)
> +   return -ENOMEM;
> +   req->rl_backchannel = true;
> +
> +   size = RPCRDMA_INLINE_WRITE_THRESHOLD(rqst);
> +   rb = rpcrdma_alloc_regbuf(ia, size, GFP_KERNEL);
> +   if (IS_ERR(rb))
> +   goto out_fail;
> +   req->rl_rdmabuf = rb;
> +
> +   size += RPCRDMA_INLINE_READ_THRESHOLD(rqst);
> +   rb = rpcrdma_alloc_regbuf(ia, size, GFP_KERNEL);
> +   if (IS_ERR(rb))
> +   goto out_fail;
> +   rb->rg_owner = req;
> +   req->rl_sendbuf = rb;
> +   /* so that rpcr_to_rdmar works when receiving a request */
> +   rqst->rq_buffer = (void *)req->rl_sendbuf->rg_base;
> +
> +   buf = &rqst->rq_snd_buf;
> +   buf->head[0].iov_base = rqst->rq_buffer;
> +   buf->head[0].iov_len = 0;
> +   buf->tail[0].iov_base = NULL;
> +   buf->tail[0].iov_len = 0;
> +   buf->page_len = 0;
> +   buf->len = 0;
> +   buf->buflen = size;
> +
> +   return 0;
> +
> +out_fail:
> +   rpcrdma_bc_free_rqst(r_xprt, rqst);
> +   return -ENOMEM;
> +}
> +
> +/* Allocate and add receive buffers to the rpcrdma_buffer's existing
> + * list of rep's. These are released when the transport is destroyed. */
> +static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
> +unsigned int count)
> +{
> +   struct rpcrdma_buffer *buffers = &r_xprt->rx_buf;
> +   struct rpcrdma_rep *rep;
> +   unsigned long flags;
> +   int rc = 0;
> +
> +   while (count--) {
> +   rep = rpcrdma_create_rep(r_xprt);
> +   if (IS_ERR(rep)) {
> +   pr_err("RPC:   %s: reply buffer alloc failed\n",
> +  __func__);
> +   rc = PTR_ERR(rep);
> +   break;
> +   }
> +
> +   spin_lock_irqsave(&buffers->rb_lock, flags);
> +   list_add(&rep->rr_list, &buffers->rb_recv_bufs);
> +   spin_unlock_irqrestore(&buffers->rb_lock, flags);
> +   }
> +
> +   return rc;
> +}
> +
> +/**
> + * xprt_rdma_bc_setup - Pre-allocate resources for handling backchannel 
> requests
> + * @xprt: transport associated with these backchannel resources
> + * @reqs: number of concurrent incoming requests to expect
> + *
> + * Returns 0 on success; oth

Re: [PATCH v1 04/18] xprtrdma: Refactor reply handler error handling

2015-09-21 Thread Devesh Sharma
Looks good.

On Fri, Sep 18, 2015 at 2:14 AM, Chuck Lever  wrote:
> Clean up: The error cases in rpcrdma_reply_handler() almost never
> execute. Ensure the compiler places them out of the hot path.
>
> No behavior change expected.
>
> Signed-off-by: Chuck Lever 
> ---
>  net/sunrpc/xprtrdma/rpc_rdma.c  |   90 
> ++-
>  net/sunrpc/xprtrdma/verbs.c |2 -
>  net/sunrpc/xprtrdma/xprt_rdma.h |2 +
>  3 files changed, 54 insertions(+), 40 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index bc8bd65..287c874 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -741,52 +741,27 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
> unsigned long cwnd;
> u32 credits;
>
> -   /* Check status. If bad, signal disconnect and return rep to pool */
> -   if (rep->rr_len == ~0U) {
> -   rpcrdma_recv_buffer_put(rep);
> -   if (r_xprt->rx_ep.rep_connected == 1) {
> -   r_xprt->rx_ep.rep_connected = -EIO;
> -   rpcrdma_conn_func(&r_xprt->rx_ep);
> -   }
> -   return;
> -   }
> -   if (rep->rr_len < RPCRDMA_HDRLEN_MIN) {
> -   dprintk("RPC:   %s: short/invalid reply\n", __func__);
> -   goto repost;
> -   }
> +   dprintk("RPC:   %s: incoming rep %p\n", __func__, rep);
> +
> +   if (rep->rr_len == RPCRDMA_BAD_LEN)
> +   goto out_badstatus;
> +   if (rep->rr_len < RPCRDMA_HDRLEN_MIN)
> +   goto out_shortreply;
> +
> headerp = rdmab_to_msg(rep->rr_rdmabuf);
> -   if (headerp->rm_vers != rpcrdma_version) {
> -   dprintk("RPC:   %s: invalid version %d\n",
> -   __func__, be32_to_cpu(headerp->rm_vers));
> -   goto repost;
> -   }
> +   if (headerp->rm_vers != rpcrdma_version)
> +   goto out_badversion;
>
> /* Get XID and try for a match. */
> spin_lock(&xprt->transport_lock);
> rqst = xprt_lookup_rqst(xprt, headerp->rm_xid);
> -   if (rqst == NULL) {
> -   spin_unlock(&xprt->transport_lock);
> -   dprintk("RPC:   %s: reply 0x%p failed "
> -   "to match any request xid 0x%08x len %d\n",
> -   __func__, rep, be32_to_cpu(headerp->rm_xid),
> -   rep->rr_len);
> -repost:
> -   r_xprt->rx_stats.bad_reply_count++;
> -   if (rpcrdma_ep_post_recv(&r_xprt->rx_ia, &r_xprt->rx_ep, rep))
> -   rpcrdma_recv_buffer_put(rep);
> -
> -   return;
> -   }
> +   if (!rqst)
> +   goto out_nomatch;
>
> /* get request object */
> req = rpcr_to_rdmar(rqst);
> -   if (req->rl_reply) {
> -   spin_unlock(&xprt->transport_lock);
> -   dprintk("RPC:   %s: duplicate reply 0x%p to RPC "
> -   "request 0x%p: xid 0x%08x\n", __func__, rep, req,
> -   be32_to_cpu(headerp->rm_xid));
> -   goto repost;
> -   }
> +   if (req->rl_reply)
> +   goto out_duplicate;
>
> dprintk("RPC:   %s: reply 0x%p completes request 0x%p\n"
> "   RPC request 0x%p xid 0x%08x\n",
> @@ -883,8 +858,45 @@ badheader:
> if (xprt->cwnd > cwnd)
> xprt_release_rqst_cong(rqst->rq_task);
>
> +   xprt_complete_rqst(rqst->rq_task, status);
> +   spin_unlock(&xprt->transport_lock);
> dprintk("RPC:   %s: xprt_complete_rqst(0x%p, 0x%p, %d)\n",
> __func__, xprt, rqst, status);
> -   xprt_complete_rqst(rqst->rq_task, status);
> +   return;
> +
> +out_badstatus:
> +   rpcrdma_recv_buffer_put(rep);
> +   if (r_xprt->rx_ep.rep_connected == 1) {
> +   r_xprt->rx_ep.rep_connected = -EIO;
> +   rpcrdma_conn_func(&r_xprt->rx_ep);
> +   }
> +   return;
> +
> +out_shortreply:
> +   dprintk("RPC:   %s: short/invalid reply\n", __func__);
> +   goto repost;
> +
> +out_badversion:
> +   dprintk("RPC:   %s: invalid version %d\n",
> +   __func__, be32_to_cpu(headerp->rm_vers));
> +   goto repost;
> +
> +out_nomatch:
> +   spin_unlock(&xprt->transport_lock);
> +   dprintk("RPC:   %s: reply 0x%p failed "
> +   "to match any request xid 0x%08x len %d\n",
> +   __func__, rep, be32_to_cpu(headerp->rm_xid),
> +   rep->rr_len);
> +   goto repost;
> +
> +out_duplicate:
> spin_unlock(&xprt->transport_lock);
> +   dprintk("RPC:   %s: duplicate reply 0x%p to RPC "
> +   "request 0x%p: xid 0x%08x\n", __func__, rep, req,
> +   be32_to_cpu(headerp->rm_xid));
> +
> +repost:
> +   r_xprt->rx_stats.bad_reply_count++;
> +   if (rpcrdma_ep_post_recv(&r_

Re: [PATCH v1 02/18] xprtrdma: Replace global lkey with lkey local to PD

2015-09-21 Thread Devesh Sharma
Looks good, will test this ocrdma and update you.

On Fri, Sep 18, 2015 at 2:14 AM, Chuck Lever  wrote:
> The core API has changed so that devices that do not have a global
> DMA lkey automatically create an mr, per-PD, and make that lkey
> available. The global DMA lkey interface is going away in favor of
> the per-PD DMA lkey.
>
> The per-PD DMA lkey is always available. Convert xprtrdma to use the
> device's per-PD DMA lkey for regbufs, no matter which memory
> registration scheme is in use.
>
> Signed-off-by: Chuck Lever 
> ---
>  net/sunrpc/xprtrdma/fmr_ops.c  |   19 ---
>  net/sunrpc/xprtrdma/frwr_ops.c |5 -
>  net/sunrpc/xprtrdma/physical_ops.c |   10 +-
>  net/sunrpc/xprtrdma/verbs.c|2 +-
>  net/sunrpc/xprtrdma/xprt_rdma.h|1 -
>  5 files changed, 2 insertions(+), 35 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
> index cb25c89..f1e8daf 100644
> --- a/net/sunrpc/xprtrdma/fmr_ops.c
> +++ b/net/sunrpc/xprtrdma/fmr_ops.c
> @@ -39,25 +39,6 @@ static int
>  fmr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
> struct rpcrdma_create_data_internal *cdata)
>  {
> -   struct ib_device_attr *devattr = &ia->ri_devattr;
> -   struct ib_mr *mr;
> -
> -   /* Obtain an lkey to use for the regbufs, which are
> -* protected from remote access.
> -*/
> -   if (devattr->device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY) {
> -   ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
> -   } else {
> -   mr = ib_get_dma_mr(ia->ri_pd, IB_ACCESS_LOCAL_WRITE);
> -   if (IS_ERR(mr)) {
> -   pr_err("%s: ib_get_dma_mr for failed with %lX\n",
> -  __func__, PTR_ERR(mr));
> -   return -ENOMEM;
> -   }
> -   ia->ri_dma_lkey = ia->ri_dma_mr->lkey;
> -   ia->ri_dma_mr = mr;
> -   }
> -
> return 0;
>  }
>
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index 21b3efb..004f1ad 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -189,11 +189,6 @@ frwr_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep 
> *ep,
> struct ib_device_attr *devattr = &ia->ri_devattr;
> int depth, delta;
>
> -   /* Obtain an lkey to use for the regbufs, which are
> -* protected from remote access.
> -*/
> -   ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
> -
> ia->ri_max_frmr_depth =
> min_t(unsigned int, RPCRDMA_MAX_DATA_SEGS,
>   devattr->max_fast_reg_page_list_len);
> diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
> b/net/sunrpc/xprtrdma/physical_ops.c
> index 72cf8b1..617b76f 100644
> --- a/net/sunrpc/xprtrdma/physical_ops.c
> +++ b/net/sunrpc/xprtrdma/physical_ops.c
> @@ -23,7 +23,6 @@ static int
>  physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep *ep,
>  struct rpcrdma_create_data_internal *cdata)
>  {
> -   struct ib_device_attr *devattr = &ia->ri_devattr;
> struct ib_mr *mr;
>
> /* Obtain an rkey to use for RPC data payloads.
> @@ -37,15 +36,8 @@ physical_op_open(struct rpcrdma_ia *ia, struct rpcrdma_ep 
> *ep,
>__func__, PTR_ERR(mr));
> return -ENOMEM;
> }
> -   ia->ri_dma_mr = mr;
> -
> -   /* Obtain an lkey to use for regbufs.
> -*/
> -   if (devattr->device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
> -   ia->ri_dma_lkey = ia->ri_device->local_dma_lkey;
> -   else
> -   ia->ri_dma_lkey = ia->ri_dma_mr->lkey;
>
> +   ia->ri_dma_mr = mr;
> return 0;
>  }
>
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index 01a314a..8a477e2 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -1255,7 +1255,7 @@ rpcrdma_alloc_regbuf(struct rpcrdma_ia *ia, size_t 
> size, gfp_t flags)
> goto out_free;
>
> iov->length = size;
> -   iov->lkey = ia->ri_dma_lkey;
> +   iov->lkey = ia->ri_pd->local_dma_lkey;
> rb->rg_size = size;
> rb->rg_owner = NULL;
> return rb;
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index 0251222..c09414e 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -65,7 +65,6 @@ struct rpcrdma_ia {
> struct rdma_cm_id   *ri_id;
> struct ib_pd*ri_pd;
> struct ib_mr*ri_dma_mr;
> -   u32 ri_dma_lkey;
> struct completion   ri_done;
> int ri_async_rc;
> unsigned intri_max_frmr_depth;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More ma

Re: [PATCH v1 01/18] xprtrdma: Enable swap-on-NFS/RDMA

2015-09-21 Thread Devesh Sharma
Looks Good.

On Fri, Sep 18, 2015 at 2:14 AM, Chuck Lever  wrote:
> After adding a swapfile on an NFS/RDMA mount and removing the
> normal swap partition, I was able to push the NFS client well
> into swap without any issue.
>
> I forgot to swapoff the NFS file before rebooting. This pinned
> the NFS mount and the IB core and provider, causing shutdown to
> hang. I think this is expected and safe behavior. Probably
> shutdown scripts should "swapoff -a" before unmounting any
> filesystems.
>
> Signed-off-by: Chuck Lever 
> ---
>  net/sunrpc/xprtrdma/transport.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index 41e452b..e9e5ed7 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -676,7 +676,7 @@ static void xprt_rdma_print_stats(struct rpc_xprt *xprt, 
> struct seq_file *seq)
>  static int
>  xprt_rdma_enable_swap(struct rpc_xprt *xprt)
>  {
> -   return -EINVAL;
> +   return 0;
>  }
>
>  static void
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 03/18] xprtrdma: Remove completion polling budgets

2015-09-21 Thread Devesh Sharma
On Sun, Sep 20, 2015 at 4:05 PM, Sagi Grimberg  wrote:
>>> It is possible that in a given poll_cq
>>> call you end up getting on 1 completion, the other completion is
>>> delayed due to some reason.
>>
>>
>> If a CQE is allowed to be delayed, how does polling
>> again guarantee that the consumer can retrieve it?
>>
>> What happens if a signal occurs, there is only one CQE,
>> but it is delayed? ib_poll_cq would return 0 in that
>> case, and the consumer would never call again, thinking
>> the CQ is empty. There's no way the consumer can know
>> for sure when a CQ is drained.
>>
>> If the delayed CQE happens only when there is more
>> than one CQE, how can polling multiple WCs ever work
>> reliably?
>>
>> Maybe I don't understand what is meant by delayed.
>>
>
> If I'm not mistaken, Devesh meant that if between ib_poll_cq (where you
> polled the last 2 wcs) until the while statement another CQE was
> generated then you lost a bit of efficiency. Correct?

Yes, That's the point.

>
>
>>
>>> Would it be better to poll for 1 in every
>>> poll call Or
>>> otherwise have this
>>> while ( rc <= ARRAY_SIZE(wcs) && rc);
>>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 03/18] xprtrdma: Remove completion polling budgets

2015-09-21 Thread Devesh Sharma
On Fri, Sep 18, 2015 at 7:49 PM, Chuck Lever  wrote:
> Hi Devesh-
>
>
> On Sep 18, 2015, at 2:52 AM, Devesh Sharma  
> wrote:
>
>> On Fri, Sep 18, 2015 at 2:14 AM, Chuck Lever  wrote:
>>>
>>> Commit 8301a2c047cc ("xprtrdma: Limit work done by completion
>>> handler") was supposed to prevent xprtrdma's upcall handlers from
>>> starving other softIRQ work by letting them return to the provider
>>> before all CQEs have been polled.
>>>
>>> The logic assumes the provider will call the upcall handler again
>>> immediately if the CQ is re-armed while there are still queued CQEs.
>>>
>>> This assumption is invalid. The IBTA spec says that after a CQ is
>>> armed, the hardware must interrupt only when a new CQE is inserted.
>>> xprtrdma can't rely on the provider calling again, even though some
>>> providers do.
>>>
>>> Therefore, leaving CQEs on queue makes sense only when there is
>>> another mechanism that ensures all remaining CQEs are consumed in a
>>> timely fashion. xprtrdma does not have such a mechanism. If a CQE
>>> remains queued, the transport can wait forever to send the next RPC.
>>>
>>> Finally, move the wcs array back onto the stack to ensure that the
>>> poll array is always local to the CPU where the completion upcall is
>>> running.
>>>
>>> Fixes: 8301a2c047cc ("xprtrdma: Limit work done by completion ...")
>>> Signed-off-by: Chuck Lever 
>>> ---
>>> net/sunrpc/xprtrdma/verbs.c |  100 
>>> ++-
>>> net/sunrpc/xprtrdma/xprt_rdma.h |5 --
>>> 2 files changed, 45 insertions(+), 60 deletions(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>>> index 8a477e2..f2e3863 100644
>>> --- a/net/sunrpc/xprtrdma/verbs.c
>>> +++ b/net/sunrpc/xprtrdma/verbs.c
>>> @@ -158,34 +158,37 @@ rpcrdma_sendcq_process_wc(struct ib_wc *wc)
>>>}
>>> }
>>>
>>> -static int
>>> +/* The wc array is on stack: automatic memory is always CPU-local.
>>> + *
>>> + * The common case is a single completion is ready. By asking
>>> + * for two entries, a return code of 1 means there is exactly
>>> + * one completion and no more. We don't have to poll again to
>>> + * know that the CQ is now empty.
>>> + */
>>> +static void
>>> rpcrdma_sendcq_poll(struct ib_cq *cq, struct rpcrdma_ep *ep)
>>> {
>>> -   struct ib_wc *wcs;
>>> -   int budget, count, rc;
>>> +   struct ib_wc *pos, wcs[2];
>>> +   int count, rc;
>>>
>>> -   budget = RPCRDMA_WC_BUDGET / RPCRDMA_POLLSIZE;
>>>do {
>>> -   wcs = ep->rep_send_wcs;
>>> +   pos = wcs;
>>>
>>> -   rc = ib_poll_cq(cq, RPCRDMA_POLLSIZE, wcs);
>>> -   if (rc <= 0)
>>> -   return rc;
>>> +   rc = ib_poll_cq(cq, ARRAY_SIZE(wcs), pos);
>>> +   if (rc < 0)
>>> +   goto out_warn;
>>>
>>>count = rc;
>>>while (count-- > 0)
>>> -   rpcrdma_sendcq_process_wc(wcs++);
>>> -   } while (rc == RPCRDMA_POLLSIZE && --budget);
>>> -   return 0;
>>> +   rpcrdma_sendcq_process_wc(pos++);
>>> +   } while (rc == ARRAY_SIZE(wcs));
>>
>> I think I have missed something and not able to understand the reason
>> for polling 2 CQEs in one poll?
>
> See the block comment above.
>
> When ib_poll_cq() returns the same number of WCs as the
> consumer requested, there may still be CQEs waiting to
> be polled. Another call to ib_poll_cq() is needed to find
> out if that's the case.

True...

>
> When ib_poll_cq() returns fewer WCs than the consumer
> requested, the consumer doesn't have to call again to
> know that the CQ is empty.

Agree, the while loop will terminate here. What if immediately after
the vendor_poll_cq() decided to report only 1 CQE and terminate
polling loop, another CQE is added. This new CQE will be polled only
after T usec (where T is interrupt-latency).

>
> The common case, by far, is that one CQE is ready. By
> requesting two, the number returned is less than the
> number requested, and the consumer can tell immediately
> that the CQE is drained. The extra ib_poll_cq call is
> avoided.
>
> Note that the existing logic also relies on polling
> multiple WCs at once, and stopping the loop when the
> number of returned WCs is less than the size of the
> array.

There is a logic to perform extra polling too if arm_cq reports missed cqes.
we change the while-loop arm-cq reporting missed cqe logic may be removed.

>
>
>> It is possible that in a given poll_cq
>> call you end up getting on 1 completion, the other completion is
>> delayed due to some reason.
>
> If a CQE is allowed to be delayed, how does polling
> again guarantee that the consumer can retrieve it?

Its possible the moment vendor_poll_cq() looked at the CQ-memory
buffer and decided to report 1 CQE, another CQE was in flight CQE but
poll_cq has already decided not to report 2.

>
> What happens if a signal occurs, there is only one CQE,
> but it is delay