Re: [PATCH 03/37] IB/rdmavt: Add protection domain to rdmavt.

2015-12-10 Thread Dennis Dalessandro

On Tue, Dec 08, 2015 at 09:20:27AM +0200, Leon Romanovsky wrote:

On Tue, Dec 08, 2015 at 08:28:17AM +0200, Leon Romanovsky wrote:

On Mon, Dec 07, 2015 at 03:43:10PM -0500, Dennis Dalessandro wrote:
> +
> +/*
> + * Things that are driver specific, module parameters in hfi1 and qib
> + */
> +struct rvt_driver_params {
> +  int max_pds;
Can it be negative value?
> +};

Forget my question, I see, you removed this variable in the following patch.


Well removed it from the rvt structure, but the concept still exists. I'm 
now using struct ib_device_attr.max_pd.


-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 06/13] irq_poll: remove unused data and max fields

2015-12-10 Thread Bart Van Assche

On 12/07/2015 12:51 PM, Christoph Hellwig wrote:

[ ... ]


Reviewed-by: Bart Van Assche 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/13] IB/srpt: use the new CQ API

2015-12-10 Thread Bart Van Assche

On 12/07/2015 12:51 PM, Christoph Hellwig wrote:

[ ... ]


Reviewed-by: Bart Van Assche 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/37] IB/rdmavt: Add protection domain to rdmavt.

2015-12-10 Thread ira.weiny
On Thu, Dec 10, 2015 at 11:49:40AM -0500, Dalessandro, Dennis wrote:
> On Tue, Dec 08, 2015 at 02:52:36PM -0500, ira. weiny wrote:
> >On Tue, Dec 08, 2015 at 01:17:48PM -0600, Sean Hefty wrote:
> >>> > > +struct ib_pd *rvt_alloc_pd(struct ib_device *ibdev,
> >>> > > +struct ib_ucontext *context,
> >>> > > +struct ib_udata *udata)
> >>> > > +{
> >>> > > + struct rvt_dev_info *dev = ib_to_rvt(ibdev);
> >>> > > + struct rvt_pd *pd;
> >>> > > + struct ib_pd *ret;
> >>> > > +
> >>> > > + pd = kmalloc(sizeof(*pd), GFP_KERNEL);
> >>> > > + if (!pd) {
> >>> > > + ret = ERR_PTR(-ENOMEM);
> >>> > > + goto bail;
> >>> > > + }
> >>> > > + /*
> >>> > > +  * This is actually totally arbitrary.  Some correctness 
> >>tests
> >>> > > +  * assume there's a maximum number of PDs that can be 
> >>allocated.
> >>> > > +  * We don't actually have this limit, but we fail the test if
> >>> > > +  * we allow allocations of more than we report for this 
> >>value.
> >>> > > +  */
> >>> >
> >>> > Why not trap this in user space, rather than forcing the kernel to
> >>> support some test program?
> >>> >
> >>> 
> >>> What do you mean "trap this in user space"?  This code is not supporting
> >>> any
> >>> particular test program.
> >>> 
> >>> If users try and allocate more protection domains then are advertised 
> >>then
> >>> an
> >>> error should be returned.
> >>> 
> >>> Perhaps the comment should be removed if it is confusing?
> >>
> >>There is no theoretical limit on the number of PDs, or likely most other
> >>resources.  So why define and enforce an arbitrary limit?  The 
> >>justification
> >>given was that some test program would fail.  If there's a concern about 
> >>that
> >>test program passing, then enforce the limit in user space.
> >
> >I did not interpret this as being driven by a test but rather by the IBTA 
> >spec.
> >
> >This may be pedantic but, wouldn't it be confusing from a user POV to see 
> >an
> >advertised limit but then not be constrained by it?  I'm not opposed to 
> >setting
> >this limit arbitrarily high, but I still think the implementation should
> >enforce that limit.
> >
> >>
> >>I didn't find the comment confusing.  Just the choice to let a test app 
> >>drive the design.
> >
> >Perhaps "confusing" is the wrong word.  But I did not interpreted the 
> >comment
> >as saying that the implementation was driven but a test program.
> 
> How about I reword the comment to say that while we could continue 
> allocating PDs being only constrained by system resources, the spec says 
> there is a limit so we enforce that?

I'm ok with that, Sean?

Ira

> 
> -Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 22/37] IB/rdmavt: Add queue pair data structure to rdmavt

2015-12-10 Thread Hefty, Sean
> >> +struct rvt_rwqe {
> >> +  u64 wr_id;
> >> +  u8 num_sge;
> >> +  struct ib_sge sg_list[0];
> >> +};
> >> +
> >> +/*
> >> + * This structure is used to contain the head pointer, tail pointer,
> >> + * and receive work queue entries as a single memory allocation so
> >> + * it can be mmap'ed into user space.
> >> + * Note that the wq array elements are variable size so you can't
> >> + * just index into the array to get the N'th element;
> >> + * use get_rwqe_ptr() instead.
> >
> >Can you add/use an entry_size field?
> 
> I think we could work something like that, however what we have in
> qib/hfi1
> also works.  Any reason we really should change this?

I did not check to see what the drivers do.  Using entry_size is 
straightforward, may provide the best performance, and can be done in common 
code, versus adding callbacks to all users. 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 03/37] IB/rdmavt: Add protection domain to rdmavt.

2015-12-10 Thread Hefty, Sean
> > How about I reword the comment to say that while we could continue
> > allocating PDs being only constrained by system resources, the spec says
> > there is a limit so we enforce that?
> 
> I'm ok with that, Sean?

That's fine
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 5/6] net/mlx5_core: Warn on unsupported events of QP/RQ/SQ

2015-12-10 Thread Majd Dibbiny
When an event arrives on QP/RQ/SQ, check whether it's supported,
and WARN_ON otherwise.

Signed-off-by: Majd Dibbiny 
---
 drivers/net/ethernet/mellanox/mlx5/core/qp.c | 52 
 1 file changed, 52 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index 27fed69..0b803a1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -68,6 +68,52 @@ void mlx5_core_put_rsc(struct mlx5_core_rsc_common *common)
complete(>free);
 }
 
+static u64 qp_allowed_event_types(void)
+{
+   u64 mask;
+
+   mask = MLX5_EVENT_TYPE_PATH_MIG |
+  MLX5_EVENT_TYPE_COMM_EST |
+  MLX5_EVENT_TYPE_SQ_DRAINED |
+  MLX5_EVENT_TYPE_SRQ_LAST_WQE |
+  MLX5_EVENT_TYPE_WQ_CATAS_ERROR |
+  MLX5_EVENT_TYPE_PATH_MIG_FAILED |
+  MLX5_EVENT_TYPE_WQ_INVAL_REQ_ERROR |
+  MLX5_EVENT_TYPE_WQ_ACCESS_ERROR;
+
+   return mask;
+}
+
+static u64 rq_allowed_event_types(void)
+{
+   u64 mask;
+
+   mask = MLX5_EVENT_TYPE_SRQ_LAST_WQE |
+  MLX5_EVENT_TYPE_WQ_CATAS_ERROR;
+
+   return mask;
+}
+
+static u64 sq_allowed_event_types(void)
+{
+   return MLX5_EVENT_TYPE_WQ_CATAS_ERROR;
+}
+
+static bool is_event_type_allowed(int rsc_type, int event_type)
+{
+   switch (rsc_type) {
+   case MLX5_EVENT_QUEUE_TYPE_QP:
+   return !!(BIT(event_type) & qp_allowed_event_types());
+   case MLX5_EVENT_QUEUE_TYPE_RQ:
+   return !!(BIT(event_type) & rq_allowed_event_types());
+   case MLX5_EVENT_QUEUE_TYPE_SQ:
+   return !!(BIT(event_type) & sq_allowed_event_types());
+   default:
+   WARN_ON(1);
+   return false;
+   }
+}
+
 void mlx5_rsc_event(struct mlx5_core_dev *dev, u32 rsn, int event_type)
 {
struct mlx5_core_rsc_common *common = mlx5_get_rsc(dev, rsn);
@@ -76,6 +122,12 @@ void mlx5_rsc_event(struct mlx5_core_dev *dev, u32 rsn, int 
event_type)
if (!common)
return;
 
+   if (!is_event_type_allowed((rsn >> 24), event_type)) {
+   mlx5_core_warn(dev, "event 0x%.2x is not allowed on resource 
0x%.8x\n",
+  event_type, rsn);
+   return;
+   }
+
switch (common->res) {
case MLX5_RES_QP:
case MLX5_RES_RQ:
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 22/37] IB/rdmavt: Add queue pair data structure to rdmavt

2015-12-10 Thread Dennis Dalessandro

On Mon, Dec 07, 2015 at 03:48:17PM -0600, Hefty, Sean wrote:

+struct rvt_rwqe {
+   u64 wr_id;
+   u8 num_sge;
+   struct ib_sge sg_list[0];
+};
+
+/*
+ * This structure is used to contain the head pointer, tail pointer,
+ * and receive work queue entries as a single memory allocation so
+ * it can be mmap'ed into user space.
+ * Note that the wq array elements are variable size so you can't
+ * just index into the array to get the N'th element;
+ * use get_rwqe_ptr() instead.


Can you add/use an entry_size field?


I think we could work something like that, however what we have in qib/hfi1 
also works.  Any reason we really should change this?



+ */
+struct rvt_rwq {
+   u32 head;   /* new work requests posted to the head */
+   u32 tail;   /* receives pull requests from here. */
+   struct rvt_rwqe wq[0];
+};
+
+struct rvt_rq {
+   struct rvt_rwq *wq;
+   u32 size;   /* size of RWQE array */
+   u8 max_sge;
+   /* protect changes in this struct */
+   spinlock_t lock cacheline_aligned_in_smp;
+};


-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/usnic: Handle 0 counts in resource allocation

2015-12-10 Thread Nelson Escobar
On 12/9/2015 10:47 PM, Leon Romanovsky wrote:
> On Wed, Dec 09, 2015 at 10:42:19AM -0800, Nelson Escobar wrote:
>> -if (usnic_vnic_res_free_cnt(vnic, type) < cnt || cnt < 1 || !owner)
>> +if (usnic_vnic_res_free_cnt(vnic, type) < cnt || cnt < 0 || !owner)
> Before this change you returned EINVAL if no free_cnt were available,
> now you will continue. is this behaviour expected?
Yes.  If cnt is 0, then no resources are being requested, so it is OK if
there are no resources available.
> 
>>  return ERR_PTR(-EINVAL);
>>  
>>  ret = kzalloc(sizeof(*ret), GFP_ATOMIC);
>> @@ -247,26 +247,28 @@ usnic_vnic_get_resources(struct usnic_vnic *vnic, enum 
>> usnic_vnic_res_type type,
>>  return ERR_PTR(-ENOMEM);
>>  }
>>  
>> -ret->res = kzalloc(sizeof(*(ret->res))*cnt, GFP_ATOMIC);
>> -if (!ret->res) {
>> -usnic_err("Failed to allocate resources for %s. Out of 
>> memory\n",
>> -usnic_vnic_pci_name(vnic));
>> -kfree(ret);
>> -return ERR_PTR(-ENOMEM);
>> -}
>> +if (cnt > 0) {
>> +ret->res = kcalloc(cnt, sizeof(*(ret->res)), GFP_ATOMIC);
>> +if (!ret->res) {
>> +usnic_err("Failed to allocate resources for %s. Out of 
>> memory\n",
>> +usnic_vnic_pci_name(vnic));
> You don't need to print OOM messages, failure in memory allocation very hard 
> to miss.
OOM messages are hard to miss, but this message is already in upstream
and outside the scope of this patch.
>> +kfree(ret);
>> +return ERR_PTR(-ENOMEM);
>> +}
>>  
>> -spin_lock(>res_lock);
>> -src = >chunks[type];
>> -for (i = 0; i < src->cnt && ret->cnt < cnt; i++) {
>> -res = src->res[i];
>> -if (!res->owner) {
>> -src->free_cnt--;
>> -res->owner = owner;
>> -ret->res[ret->cnt++] = res;
>> +spin_lock(>res_lock);
>> +src = >chunks[type];
>> +for (i = 0; i < src->cnt && ret->cnt < cnt; i++) {
>> +res = src->res[i];
>> +if (!res->owner) {
>> +src->free_cnt--;
> It will be negative, because of skip usnic_vnic_res_free_cnt check
> before.
We are inside the 'if (cnt > 0)' clause here, so the previous
usnic_vnic_res_free_cnt check wasn't skipped.
>> +res->owner = owner;
>> +ret->res[ret->cnt++] = res;
>> +}
>>  }
>> -}
>>  
>> -spin_unlock(>res_lock);
>> +spin_unlock(>res_lock);
>> +}
>>  ret->type = type;
>>  ret->vnic = vnic;
>>  WARN_ON(ret->cnt != cnt);
>> @@ -281,14 +283,16 @@ void usnic_vnic_put_resources(struct 
>> usnic_vnic_res_chunk *chunk)
>>  int i;
>>  struct usnic_vnic *vnic = chunk->vnic;
>>  
>> -spin_lock(>res_lock);
>> -while ((i = --chunk->cnt) >= 0) {
>> -res = chunk->res[i];
>> -chunk->res[i] = NULL;
>> -res->owner = NULL;
>> -vnic->chunks[res->type].free_cnt++;
>> +if (chunk->cnt > 0) {
>> +spin_lock(>res_lock);
>> +while ((i = --chunk->cnt) >= 0) {
>> +res = chunk->res[i];
>> +chunk->res[i] = NULL;
>> +res->owner = NULL;
>> +vnic->chunks[res->type].free_cnt++;
>> +}
>> +spin_unlock(>res_lock);
>>  }
>> -spin_unlock(>res_lock);
>>  
>>  kfree(chunk->res);
>>  kfree(chunk);
>> -- 
>> 2.4.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/usnic: Handle 0 counts in resource allocation

2015-12-10 Thread l...@leon.nu
On Thu, Dec 10, 2015 at 11:22:24AM -0800, Nelson Escobar wrote:
> On 12/9/2015 10:47 PM, Leon Romanovsky wrote:
> > On Wed, Dec 09, 2015 at 10:42:19AM -0800, Nelson Escobar wrote:
> >> -  if (usnic_vnic_res_free_cnt(vnic, type) < cnt || cnt < 1 || !owner)
> >> +  if (usnic_vnic_res_free_cnt(vnic, type) < cnt || cnt < 0 || !owner)
> > Before this change you returned EINVAL if no free_cnt were available,
> > now you will continue. is this behaviour expected?
> Yes.  If cnt is 0, then no resources are being requested, so it is OK if
> there are no resources available.
I afraid that you missed the point.
Old code:
usnic_vnic_res_free_cnt(vnic, type) == 0 and cnt == 1 will return EINVAL
New code
snic_vnic_res_free_cnt(vnic, type) == 0 and cnt == 1 will pass and will
pass te "if (cnt > 0)" check below and will decrease free_cnt variable
to be below zero.

Is this behavior expected?
> > 
> >>return ERR_PTR(-EINVAL);
> >>  
> >>ret = kzalloc(sizeof(*ret), GFP_ATOMIC);
> >> @@ -247,26 +247,28 @@ usnic_vnic_get_resources(struct usnic_vnic *vnic, 
> >> enum usnic_vnic_res_type type,
> >>return ERR_PTR(-ENOMEM);
> >>}
> >>  
> >> -  ret->res = kzalloc(sizeof(*(ret->res))*cnt, GFP_ATOMIC);
> >> -  if (!ret->res) {
> >> -  usnic_err("Failed to allocate resources for %s. Out of 
> >> memory\n",
> >> -  usnic_vnic_pci_name(vnic));
> >> -  kfree(ret);
> >> -  return ERR_PTR(-ENOMEM);
> >> -  }
> >> +  if (cnt > 0) {
> >> +  ret->res = kcalloc(cnt, sizeof(*(ret->res)), GFP_ATOMIC);
> >> +  if (!ret->res) {
> >> +  usnic_err("Failed to allocate resources for %s. Out of 
> >> memory\n",
> >> +  usnic_vnic_pci_name(vnic));
> > You don't need to print OOM messages, failure in memory allocation very 
> > hard to miss.
> OOM messages are hard to miss, but this message is already in upstream
> and outside the scope of this patch.
It is worth to fix, especially if you are changing these exact lines.
> >> +  kfree(ret);
> >> +  return ERR_PTR(-ENOMEM);
> >> +  }
> >>  
> >> -  spin_lock(>res_lock);
> >> -  src = >chunks[type];
> >> -  for (i = 0; i < src->cnt && ret->cnt < cnt; i++) {
> >> -  res = src->res[i];
> >> -  if (!res->owner) {
> >> -  src->free_cnt--;
> >> -  res->owner = owner;
> >> -  ret->res[ret->cnt++] = res;
> >> +  spin_lock(>res_lock);
> >> +  src = >chunks[type];
> >> +  for (i = 0; i < src->cnt && ret->cnt < cnt; i++) {
> >> +  res = src->res[i];
> >> +  if (!res->owner) {
> >> +  src->free_cnt--;
> > It will be negative, because of skip usnic_vnic_res_free_cnt check
> > before.
> We are inside the 'if (cnt > 0)' clause here, so the previous
> usnic_vnic_res_free_cnt check wasn't skipped.
See above.

> >> +  res->owner = owner;
> >> +  ret->res[ret->cnt++] = res;
> >> +  }
> >>}
> >> -  }
> >>  
> >> -  spin_unlock(>res_lock);
> >> +  spin_unlock(>res_lock);
> >> +  }
> >>ret->type = type;
> >>ret->vnic = vnic;
> >>WARN_ON(ret->cnt != cnt);
> >> @@ -281,14 +283,16 @@ void usnic_vnic_put_resources(struct 
> >> usnic_vnic_res_chunk *chunk)
> >>int i;
> >>struct usnic_vnic *vnic = chunk->vnic;
> >>  
> >> -  spin_lock(>res_lock);
> >> -  while ((i = --chunk->cnt) >= 0) {
> >> -  res = chunk->res[i];
> >> -  chunk->res[i] = NULL;
> >> -  res->owner = NULL;
> >> -  vnic->chunks[res->type].free_cnt++;
> >> +  if (chunk->cnt > 0) {
> >> +  spin_lock(>res_lock);
> >> +  while ((i = --chunk->cnt) >= 0) {
> >> +  res = chunk->res[i];
> >> +  chunk->res[i] = NULL;
> >> +  res->owner = NULL;
> >> +  vnic->chunks[res->type].free_cnt++;
> >> +  }
> >> +  spin_unlock(>res_lock);
> >>}
> >> -  spin_unlock(>res_lock);
> >>  
> >>kfree(chunk->res);
> >>kfree(chunk);
> >> -- 
> >> 2.4.3
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Dennis Dalessandro

On Thu, Dec 10, 2015 at 07:52:31PM +0200, Leon Romanovsky wrote:

On Thu, Dec 10, 2015 at 11:17:09AM -0500, Dennis Dalessandro wrote:

On Tue, Dec 08, 2015 at 08:08:21AM +0200, Leon Romanovsky wrote:
>On Mon, Dec 07, 2015 at 03:43:06PM -0500, Dennis Dalessandro wrote:
>>+
>>+#define BAD_DMA_ADDRESS ((u64)0)
>What is the advantage in using directly u64 values instead of
>pointers? You will get NULL and functions which return pointers
>without need of casting.
>
>...
>>+static u64 rvt_dma_map_single(struct ib_device *dev, void *cpu_addr,
>>+ size_t size, enum dma_data_direction direction)
>>+{
>>+   if (WARN_ON(!valid_dma_direction(direction)))
>>+   return BAD_DMA_ADDRESS;
>>+
>>+   return (u64)cpu_addr;
>>+}
>An example of such function.

Honestly I'm not really sure why it's done this way. We are just following
the signature of the function in struct ib_dma_mapping_ops.


Since that would be a core change it's beyond the scope of this patch series 
I think, but something that could certainly be done. I just wouldn't want to 
tie that to rdmavt.


-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 6/6] IB/mlx5: Add Raw Packet Queue Pair (QP) support

2015-12-10 Thread Majd Dibbiny
This patchs adds support for Raw Packet QP for the mlx5 device.

Raw Packet QP, unlike other QP types, has no matching mlx5_core_qp
object but rather it is built of RQ/SQ/TIR/TIS/TD mlx5_core object.

The Raw Packet QP state changes are implemented by changing the
state of the sub-objects.

Since the SQ and RQ work-queue (WQ) buffers are not contiguous like
other QPs, we allocate separate buffers in the user-space and pass
the address of each one of them separately to the kernel.

Signed-off-by: Majd Dibbiny 
---
 drivers/infiniband/hw/mlx5/main.c  |  15 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h   |  52 +-
 drivers/infiniband/hw/mlx5/qp.c| 894 ++---
 drivers/infiniband/hw/mlx5/user.h  |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/qp.c   |  48 +-
 drivers/net/ethernet/mellanox/mlx5/core/transobj.c |  39 +
 include/linux/mlx5/mlx5_ifc.h  |  11 +-
 include/linux/mlx5/qp.h|   3 +-
 include/linux/mlx5/transobj.h  |   4 +
 9 files changed, 913 insertions(+), 154 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 1b503e0..de533b9 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -680,6 +680,12 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
context->ibucontext.invalidate_range = _ib_invalidate_range;
 #endif
 
+   if (MLX5_CAP_GEN(dev->mdev, log_max_transport_domain)) {
+   err = mlx5_alloc_transport_domain(dev->mdev, >tdn);
+   if (err)
+   goto out_uars;
+   }
+
INIT_LIST_HEAD(>db_page_list);
mutex_init(>db_page_mutex);
 
@@ -688,7 +694,7 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
err = ib_copy_to_udata(udata, ,
   min(sizeof(resp), udata->outlen));
if (err)
-   goto out_uars;
+   goto out_td;
 
uuari->ver = ver;
uuari->num_low_latency_uuars = req.num_low_latency_uuars;
@@ -696,6 +702,10 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
uuari->num_uars = num_uars;
return >ibucontext;
 
+out_td:
+   if (MLX5_CAP_GEN(dev->mdev, log_max_transport_domain))
+   mlx5_dealloc_transport_domain(dev->mdev, context->tdn);
+
 out_uars:
for (i--; i >= 0; i--)
mlx5_cmd_free_uar(dev->mdev, uars[i].index);
@@ -720,6 +730,9 @@ static int mlx5_ib_dealloc_ucontext(struct ib_ucontext 
*ibcontext)
struct mlx5_uuar_info *uuari = >uuari;
int i;
 
+   if (MLX5_CAP_GEN(dev->mdev, log_max_transport_domain))
+   mlx5_dealloc_transport_domain(dev->mdev, context->tdn);
+
for (i = 0; i < uuari->num_uars; i++) {
if (mlx5_cmd_free_uar(dev->mdev, uuari->uars[i].index))
mlx5_ib_warn(dev, "failed to free UAR 0x%x\n", 
uuari->uars[i].index);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index e1c46c6..dbdf331 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define mlx5_ib_dbg(dev, format, arg...)   \
 pr_debug("%s:%s:%d:(pid %d): " format, (dev)->ib_dev.name, __func__,   \
@@ -95,6 +96,8 @@ struct mlx5_ib_ucontext {
 */
struct mutexdb_page_mutex;
struct mlx5_uuar_info   uuari;
+   /* Transport Domain number */
+   u32 tdn;
 };
 
 static inline struct mlx5_ib_ucontext *to_mucontext(struct ib_ucontext 
*ibucontext)
@@ -173,23 +176,52 @@ struct mlx5_ib_pfault {
struct mlx5_pagefault   mpfault;
 };
 
+struct mlx5_ib_ubuffer {
+   struct ib_umem *umem;
+   int buf_size;
+   u64 buf_addr;
+};
+
+struct mlx5_ib_rq {
+   struct mlx5_ib_wq   *rq;
+   struct mlx5_core_qp mrq;
+   struct mlx5_ib_ubuffer  ubuffer;
+   struct mlx5_db  *doorbell;
+   u32 tirn;
+   u8  state;
+};
+
+struct mlx5_ib_sq {
+   struct mlx5_ib_wq   *sq;
+   struct mlx5_core_qp msq;
+   struct mlx5_ib_ubuffer  ubuffer;
+   struct mlx5_db  *doorbell;
+   u32 tisn;
+   u8  state;
+};
+
+struct mlx5_ib_raw_packet_qp {
+   struct mlx5_ib_sq sq;
+   struct mlx5_ib_rq rq;
+};
+
 struct mlx5_ib_qp {
struct ib_qpibqp;
-   struct mlx5_core_qp mqp;
-   struct mlx5_buf buf;
+   union {
+   struct mlx5_core_qp mqp;
+   struct mlx5_ib_raw_packet_qpraw_packet_qp;
+   };
 
-   struct 

[PATCH for-next 1/6] IB/mlx5: Add CQE version 1 support to user space

2015-12-10 Thread Majd Dibbiny
From: Haggai Abramonvsky 

This patch prepares the infrastructure to work with CQE version 1
if the user-space works this way, otherwise work with CQE version 0.

After this patch the kernel still reports CQE version 0.

Signed-off-by: Haggai Abramovsky 
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h  |  2 ++
 drivers/infiniband/hw/mlx5/qp.c   | 15 +
 drivers/infiniband/hw/mlx5/srq.c  | 31 +--
 drivers/infiniband/hw/mlx5/user.h |  4 
 drivers/net/ethernet/mellanox/mlx5/core/qp.c  |  7 --
 drivers/net/ethernet/mellanox/mlx5/core/srq.c |  2 --
 6 files changed, 46 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 6333472..e1c46c6 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -55,6 +55,8 @@ pr_err("%s:%s:%d:(pid %d): " format, (dev)->ib_dev.name, 
__func__,\
 pr_warn("%s:%s:%d:(pid %d): " format, (dev)->ib_dev.name, __func__,\
__LINE__, current->pid, ##arg)
 
+#define MLX5_IB_DEFAULT_UIDX 0xff
+
 enum {
MLX5_IB_MMAP_CMD_SHIFT  = 8,
MLX5_IB_MMAP_CMD_MASK   = 0xff,
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 307bdbc..e0be8b3 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -32,6 +32,7 @@
 
 #include 
 #include 
+#include 
 #include "mlx5_ib.h"
 #include "user.h"
 
@@ -864,6 +865,8 @@ static int create_qp_common(struct mlx5_ib_dev *dev, struct 
ib_pd *pd,
struct mlx5_ib_create_qp ucmd;
int inlen = sizeof(*in);
int err;
+   u32 uidx;
+   void *qpc;
 
mlx5_ib_odp_create_qp(qp);
 
@@ -889,10 +892,16 @@ static int create_qp_common(struct mlx5_ib_dev *dev, 
struct ib_pd *pd,
return -EFAULT;
}
 
+   if (udata->inlen > offsetof(struct mlx5_ib_create_qp, uidx))
+   uidx = ucmd.uidx;
+   else
+   uidx = MLX5_IB_DEFAULT_UIDX;
+
qp->wq_sig = !!(ucmd.flags & MLX5_QP_FLAG_SIGNATURE);
qp->scat_cqe = !!(ucmd.flags & MLX5_QP_FLAG_SCATTER_CQE);
} else {
qp->wq_sig = !!wq_signature;
+   uidx = MLX5_IB_DEFAULT_UIDX;
}
 
qp->has_rq = qp_has_rq(init_attr);
@@ -1018,6 +1027,12 @@ static int create_qp_common(struct mlx5_ib_dev *dev, 
struct ib_pd *pd,
 
in->ctx.db_rec_addr = cpu_to_be64(qp->db.dma);
 
+   if (MLX5_CAP_GEN(mdev, cqe_version) == 1) {
+   qpc = MLX5_ADDR_OF(create_qp_in, in, qpc);
+   /* 0xff means we ask to work with cqe version 0 */
+   MLX5_SET(qpc, qpc, user_index, uidx);
+   }
+
err = mlx5_core_create_qp(dev->mdev, >mqp, in, inlen);
if (err) {
mlx5_ib_dbg(dev, "create qp failed\n");
diff --git a/drivers/infiniband/hw/mlx5/srq.c b/drivers/infiniband/hw/mlx5/srq.c
index e008505..07e1a39 100644
--- a/drivers/infiniband/hw/mlx5/srq.c
+++ b/drivers/infiniband/hw/mlx5/srq.c
@@ -80,24 +80,24 @@ static int create_srq_user(struct ib_pd *pd, struct 
mlx5_ib_srq *srq,
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct mlx5_ib_create_srq ucmd;
size_t ucmdlen;
+   void *xsrqc;
int err;
int npages;
int page_shift;
int ncont;
u32 offset;
+   int drv_data = udata->inlen - sizeof(struct ib_uverbs_cmd_hdr);
 
-   ucmdlen =
-   (udata->inlen - sizeof(struct ib_uverbs_cmd_hdr) <
-sizeof(ucmd)) ? (sizeof(ucmd) -
- sizeof(ucmd.reserved)) : sizeof(ucmd);
+   ucmdlen = (drv_data < sizeof(ucmd)) ?
+ drv_data : sizeof(ucmd);
 
+   memset(, 0, sizeof(ucmd));
if (ib_copy_from_udata(, udata, ucmdlen)) {
mlx5_ib_dbg(dev, "failed copy udata\n");
return -EFAULT;
}
 
-   if (ucmdlen == sizeof(ucmd) &&
-   ucmd.reserved != 0)
+   if (ucmd.reserved || ucmd.reserved1)
return -EINVAL;
 
srq->wq_sig = !!(ucmd.flags & MLX5_SRQ_FLAG_SIGNATURE);
@@ -138,6 +138,17 @@ static int create_srq_user(struct ib_pd *pd, struct 
mlx5_ib_srq *srq,
(*in)->ctx.log_pg_sz = page_shift - MLX5_ADAPTER_PAGE_SHIFT;
(*in)->ctx.pgoff_cqn = cpu_to_be32(offset << 26);
 
+   if (MLX5_CAP_GEN(dev->mdev, cqe_version) == 1) {
+   xsrqc = MLX5_ADDR_OF(create_xrc_srq_in, *in,
+xrc_srq_context_entry);
+   /* 0xff means we ask to work with cqe version 0 */
+   if (drv_data > offsetof(struct mlx5_ib_create_srq, uidx))
+   MLX5_SET(xrc_srqc, xsrqc, user_index, ucmd.uidx);
+   else
+   MLX5_SET(xrc_srqc, xsrqc, user_index,
+ 

[PATCH] IB/mlx5: Add Raw Packet Queue Pair (QP) support

2015-12-10 Thread Majd Dibbiny
This patchs adds support for Raw Packet QP for the mlx5 device.

Raw Packet QP, unlike other QP types, has no matching mlx5_core_qp
object but rather it is built of RQ/SQ/TIR/TIS/TD mlx5_core object.

The Raw Packet QP state changes are implemented by changing the
state of the sub-objects.

Since the SQ and RQ work-queue (WQ) buffers are not contiguous like
other QPs, we allocate separate buffers in the user-space and pass
the address of each one of them separately to the kernel.

Signed-off-by: Majd Dibbiny 
---
 drivers/infiniband/hw/mlx5/main.c  |  15 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h   |  52 +-
 drivers/infiniband/hw/mlx5/qp.c| 894 ++---
 drivers/infiniband/hw/mlx5/user.h  |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/qp.c   |  48 +-
 drivers/net/ethernet/mellanox/mlx5/core/transobj.c |  39 +
 include/linux/mlx5/mlx5_ifc.h  |  11 +-
 include/linux/mlx5/qp.h|   3 +-
 include/linux/mlx5/transobj.h  |   4 +
 9 files changed, 913 insertions(+), 154 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 1b503e0..de533b9 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -680,6 +680,12 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
context->ibucontext.invalidate_range = _ib_invalidate_range;
 #endif
 
+   if (MLX5_CAP_GEN(dev->mdev, log_max_transport_domain)) {
+   err = mlx5_alloc_transport_domain(dev->mdev, >tdn);
+   if (err)
+   goto out_uars;
+   }
+
INIT_LIST_HEAD(>db_page_list);
mutex_init(>db_page_mutex);
 
@@ -688,7 +694,7 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
err = ib_copy_to_udata(udata, ,
   min(sizeof(resp), udata->outlen));
if (err)
-   goto out_uars;
+   goto out_td;
 
uuari->ver = ver;
uuari->num_low_latency_uuars = req.num_low_latency_uuars;
@@ -696,6 +702,10 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
uuari->num_uars = num_uars;
return >ibucontext;
 
+out_td:
+   if (MLX5_CAP_GEN(dev->mdev, log_max_transport_domain))
+   mlx5_dealloc_transport_domain(dev->mdev, context->tdn);
+
 out_uars:
for (i--; i >= 0; i--)
mlx5_cmd_free_uar(dev->mdev, uars[i].index);
@@ -720,6 +730,9 @@ static int mlx5_ib_dealloc_ucontext(struct ib_ucontext 
*ibcontext)
struct mlx5_uuar_info *uuari = >uuari;
int i;
 
+   if (MLX5_CAP_GEN(dev->mdev, log_max_transport_domain))
+   mlx5_dealloc_transport_domain(dev->mdev, context->tdn);
+
for (i = 0; i < uuari->num_uars; i++) {
if (mlx5_cmd_free_uar(dev->mdev, uuari->uars[i].index))
mlx5_ib_warn(dev, "failed to free UAR 0x%x\n", 
uuari->uars[i].index);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index e1c46c6..dbdf331 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define mlx5_ib_dbg(dev, format, arg...)   \
 pr_debug("%s:%s:%d:(pid %d): " format, (dev)->ib_dev.name, __func__,   \
@@ -95,6 +96,8 @@ struct mlx5_ib_ucontext {
 */
struct mutexdb_page_mutex;
struct mlx5_uuar_info   uuari;
+   /* Transport Domain number */
+   u32 tdn;
 };
 
 static inline struct mlx5_ib_ucontext *to_mucontext(struct ib_ucontext 
*ibucontext)
@@ -173,23 +176,52 @@ struct mlx5_ib_pfault {
struct mlx5_pagefault   mpfault;
 };
 
+struct mlx5_ib_ubuffer {
+   struct ib_umem *umem;
+   int buf_size;
+   u64 buf_addr;
+};
+
+struct mlx5_ib_rq {
+   struct mlx5_ib_wq   *rq;
+   struct mlx5_core_qp mrq;
+   struct mlx5_ib_ubuffer  ubuffer;
+   struct mlx5_db  *doorbell;
+   u32 tirn;
+   u8  state;
+};
+
+struct mlx5_ib_sq {
+   struct mlx5_ib_wq   *sq;
+   struct mlx5_core_qp msq;
+   struct mlx5_ib_ubuffer  ubuffer;
+   struct mlx5_db  *doorbell;
+   u32 tisn;
+   u8  state;
+};
+
+struct mlx5_ib_raw_packet_qp {
+   struct mlx5_ib_sq sq;
+   struct mlx5_ib_rq rq;
+};
+
 struct mlx5_ib_qp {
struct ib_qpibqp;
-   struct mlx5_core_qp mqp;
-   struct mlx5_buf buf;
+   union {
+   struct mlx5_core_qp mqp;
+   struct mlx5_ib_raw_packet_qpraw_packet_qp;
+   };
 
-   struct 

[PATCH for-next 3/6] net/mlx5_core: Export transport objects

2015-12-10 Thread Majd Dibbiny
To be used by mlx5_ib in the following patches for implementing
RAW PACKET QP.

Signed-off-by: Majd Dibbiny 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/srq.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/transobj.c | 10 ++-
 drivers/net/ethernet/mellanox/mlx5/core/transobj.h | 72 -
 include/linux/mlx5/transobj.h  | 74 ++
 5 files changed, 85 insertions(+), 75 deletions(-)
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/transobj.h
 create mode 100644 include/linux/mlx5/transobj.h

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 0983a20..47b2811 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -36,8 +36,8 @@
 #include 
 #include 
 #include 
+#include 
 #include "wq.h"
-#include "transobj.h"
 #include "mlx5_core.h"
 
 #define MLX5E_MAX_NUM_TC   8
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/srq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/srq.c
index 7504d11..9474343 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/srq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/srq.c
@@ -37,7 +37,7 @@
 #include 
 #include 
 #include "mlx5_core.h"
-#include "transobj.h"
+#include 
 
 void mlx5_srq_event(struct mlx5_core_dev *dev, u32 srqn, int event_type)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/transobj.c 
b/drivers/net/ethernet/mellanox/mlx5/core/transobj.c
index b4c87c7..7523cb1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/transobj.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/transobj.c
@@ -32,7 +32,7 @@
 
 #include 
 #include "mlx5_core.h"
-#include "transobj.h"
+#include 
 
 int mlx5_alloc_transport_domain(struct mlx5_core_dev *dev, u32 *tdn)
 {
@@ -53,6 +53,7 @@ int mlx5_alloc_transport_domain(struct mlx5_core_dev *dev, 
u32 *tdn)
 
return err;
 }
+EXPORT_SYMBOL(mlx5_alloc_transport_domain);
 
 void mlx5_dealloc_transport_domain(struct mlx5_core_dev *dev, u32 tdn)
 {
@@ -68,6 +69,7 @@ void mlx5_dealloc_transport_domain(struct mlx5_core_dev *dev, 
u32 tdn)
 
mlx5_cmd_exec_check_status(dev, in, sizeof(in), out, sizeof(out));
 }
+EXPORT_SYMBOL(mlx5_dealloc_transport_domain);
 
 int mlx5_core_create_rq(struct mlx5_core_dev *dev, u32 *in, int inlen, u32 
*rqn)
 {
@@ -94,6 +96,7 @@ int mlx5_core_modify_rq(struct mlx5_core_dev *dev, u32 rqn, 
u32 *in, int inlen)
memset(out, 0, sizeof(out));
return mlx5_cmd_exec_check_status(dev, in, inlen, out, sizeof(out));
 }
+EXPORT_SYMBOL(mlx5_core_modify_rq);
 
 void mlx5_core_destroy_rq(struct mlx5_core_dev *dev, u32 rqn)
 {
@@ -133,6 +136,7 @@ int mlx5_core_modify_sq(struct mlx5_core_dev *dev, u32 sqn, 
u32 *in, int inlen)
memset(out, 0, sizeof(out));
return mlx5_cmd_exec_check_status(dev, in, inlen, out, sizeof(out));
 }
+EXPORT_SYMBOL(mlx5_core_modify_sq);
 
 void mlx5_core_destroy_sq(struct mlx5_core_dev *dev, u32 sqn)
 {
@@ -162,6 +166,7 @@ int mlx5_core_create_tir(struct mlx5_core_dev *dev, u32 
*in, int inlen,
 
return err;
 }
+EXPORT_SYMBOL(mlx5_core_create_tir);
 
 int mlx5_core_modify_tir(struct mlx5_core_dev *dev, u32 tirn, u32 *in,
 int inlen)
@@ -187,6 +192,7 @@ void mlx5_core_destroy_tir(struct mlx5_core_dev *dev, u32 
tirn)
 
mlx5_cmd_exec_check_status(dev, in, sizeof(in), out, sizeof(out));
 }
+EXPORT_SYMBOL(mlx5_core_destroy_tir);
 
 int mlx5_core_create_tis(struct mlx5_core_dev *dev, u32 *in, int inlen,
 u32 *tisn)
@@ -203,6 +209,7 @@ int mlx5_core_create_tis(struct mlx5_core_dev *dev, u32 
*in, int inlen,
 
return err;
 }
+EXPORT_SYMBOL(mlx5_core_create_tis);
 
 void mlx5_core_destroy_tis(struct mlx5_core_dev *dev, u32 tisn)
 {
@@ -216,6 +223,7 @@ void mlx5_core_destroy_tis(struct mlx5_core_dev *dev, u32 
tisn)
 
mlx5_cmd_exec_check_status(dev, in, sizeof(in), out, sizeof(out));
 }
+EXPORT_SYMBOL(mlx5_core_destroy_tis);
 
 int mlx5_core_create_rmp(struct mlx5_core_dev *dev, u32 *in, int inlen,
 u32 *rmpn)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/transobj.h 
b/drivers/net/ethernet/mellanox/mlx5/core/transobj.h
deleted file mode 100644
index 74cae51..000
--- a/drivers/net/ethernet/mellanox/mlx5/core/transobj.h
+++ /dev/null
@@ -1,72 +0,0 @@
-/*
- * Copyright (c) 2013-2015, Mellanox Technologies, Ltd.  All rights reserved.
- *
- * This software is available to you under a choice of one of two
- * licenses.  You may choose to be licensed under the terms of the GNU
- * General Public License (GPL) Version 2, available from the file
- * COPYING in the main directory of this source tree, or the
- * OpenIB.org BSD license below:
- *
- * Redistribution and use in source and binary forms, with or
- * without modification, are permitted provided that the following
- * 

[PATCH for-next 4/6] net/mlx5_core: Add RQ and SQ event handling

2015-12-10 Thread Majd Dibbiny
RQ/SQ will be used to implement IB verbs QPs, so the IB QP affiliated
events are affiliated also with SQs and RQs.

Since SQ, RQ and QP resource numbers do not share the same name
space, a queue type field was added to the event data to specify
the SW object that the event is affiliated with.

Signed-off-by: Majd Dibbiny 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/qp.c | 123 ++-
 include/linux/mlx5/device.h  |  10 ++-
 include/linux/mlx5/driver.h  |   8 +-
 include/linux/mlx5/qp.h  |   8 ++
 5 files changed, 127 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index a40b96d..a62c4bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -227,6 +227,7 @@ static int mlx5_eq_int(struct mlx5_core_dev *dev, struct 
mlx5_eq *eq)
case MLX5_EVENT_TYPE_WQ_INVAL_REQ_ERROR:
case MLX5_EVENT_TYPE_WQ_ACCESS_ERROR:
rsn = be32_to_cpu(eqe->data.qp_srq.qp_srq_n) & 0xff;
+   rsn |= (eqe->data.qp_srq.type << 24);
mlx5_core_dbg(dev, "event %s(%d) arrived on resource 
0x%x\n",
  eqe_type_str(eqe->type), eqe->type, rsn);
mlx5_rsc_event(dev, rsn, eqe->type);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index dc7dbf7..27fed69 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mlx5_core.h"
 
@@ -77,6 +78,8 @@ void mlx5_rsc_event(struct mlx5_core_dev *dev, u32 rsn, int 
event_type)
 
switch (common->res) {
case MLX5_RES_QP:
+   case MLX5_RES_RQ:
+   case MLX5_RES_SQ:
qp = (struct mlx5_core_qp *)common;
qp->event(qp, event_type);
break;
@@ -177,12 +180,45 @@ void mlx5_eq_pagefault(struct mlx5_core_dev *dev, struct 
mlx5_eqe *eqe)
 }
 #endif
 
+static int create_qprqsq_common(struct mlx5_core_dev *dev,
+   struct mlx5_core_qp *qp,
+   int rsc_type)
+{
+   struct mlx5_qp_table *table = >priv.qp_table;
+   int err;
+
+   qp->common.res = rsc_type;
+   spin_lock_irq(>lock);
+   err = radix_tree_insert(>tree, qp->qpn | (rsc_type << 24), qp);
+   spin_unlock_irq(>lock);
+   if (err)
+   return err;
+
+   atomic_set(>common.refcount, 1);
+   init_completion(>common.free);
+   qp->pid = current->pid;
+
+   return 0;
+}
+
+static void destroy_qprqsq_common(struct mlx5_core_dev *dev,
+ struct mlx5_core_qp *qp)
+{
+   struct mlx5_qp_table *table = >priv.qp_table;
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   radix_tree_delete(>tree, qp->qpn | (qp->common.res << 24));
+   spin_unlock_irqrestore(>lock, flags);
+   mlx5_core_put_rsc((struct mlx5_core_rsc_common *)qp);
+   wait_for_completion(>common.free);
+}
+
 int mlx5_core_create_qp(struct mlx5_core_dev *dev,
struct mlx5_core_qp *qp,
struct mlx5_create_qp_mbox_in *in,
int inlen)
 {
-   struct mlx5_qp_table *table = >priv.qp_table;
struct mlx5_create_qp_mbox_out out;
struct mlx5_destroy_qp_mbox_in din;
struct mlx5_destroy_qp_mbox_out dout;
@@ -206,24 +242,16 @@ int mlx5_core_create_qp(struct mlx5_core_dev *dev,
qp->qpn = be32_to_cpu(out.qpn) & 0xff;
mlx5_core_dbg(dev, "qpn = 0x%x\n", qp->qpn);
 
-   qp->common.res = MLX5_RES_QP;
-   spin_lock_irq(>lock);
-   err = radix_tree_insert(>tree, qp->qpn, qp);
-   spin_unlock_irq(>lock);
-   if (err) {
-   mlx5_core_warn(dev, "err %d\n", err);
+   err = create_qprqsq_common(dev, qp, MLX5_RES_QP);
+   if (err)
goto err_cmd;
-   }
 
err = mlx5_debug_qp_add(dev, qp);
if (err)
mlx5_core_dbg(dev, "failed adding QP 0x%x to debug file 
system\n",
  qp->qpn);
 
-   qp->pid = current->pid;
-   atomic_set(>common.refcount, 1);
atomic_inc(>num_qps);
-   init_completion(>common.free);
 
return 0;
 
@@ -243,18 +271,11 @@ int mlx5_core_destroy_qp(struct mlx5_core_dev *dev,
 {
struct mlx5_destroy_qp_mbox_in in;
struct mlx5_destroy_qp_mbox_out out;
-   struct mlx5_qp_table *table = >priv.qp_table;
-   unsigned long flags;
int err;
 
mlx5_debug_qp_remove(dev, qp);
 
-   spin_lock_irqsave(>lock, flags);
-   radix_tree_delete(>tree, qp->qpn);
-   

[PATCH libmlx5] Add Raw Packet Queue Pair (QP) support

2015-12-10 Thread Majd Dibbiny
1. Add support for RAW Packet WQEs in the  post send.
2. Allocate different buffers for RQ and SQ to ensure
   alignment of the SQ buffer.

Signed-off-by: Majd Dibbiny 
---

Hi Eli,

This patch adds support for Raw Packet QP in libmlx5.

Raw Packet QP enables the user to send and receive raw packets. The user is
responsible of building the packet including the headers.

Since the SQ and RQ work-queue (WQ) buffers are not contiguous like
other QPs, we allocate separate buffers and pass them to the kernel
driver.

The added support in post send includes building the WQE according
to the hardware requirements.

This patch depends on "Support CQE versions" series.

Regards,
Majd

 src/mlx5-abi.h |   2 ++
 src/mlx5.h |   4 +++
 src/qp.c   | 100 +
 src/verbs.c|  73 +++--
 src/wqe.h  |  25 +++
 5 files changed, 180 insertions(+), 24 deletions(-)

diff --git a/src/mlx5-abi.h b/src/mlx5-abi.h
index 21f576e..e68a328 100644
--- a/src/mlx5-abi.h
+++ b/src/mlx5-abi.h
@@ -124,6 +124,8 @@ struct mlx5_create_qp {
__u32   flags;
__u32   uidx;
__u32   reserved;
+   /* SQ buffer address - used for Raw Packet QP */
+   __u64   sq_buf_addr;
 };
 
 struct mlx5_create_qp_resp {
diff --git a/src/mlx5.h b/src/mlx5.h
index 9561417..0b06a1f 100644
--- a/src/mlx5.h
+++ b/src/mlx5.h
@@ -428,8 +428,12 @@ struct mlx5_qp {
struct verbs_qp verbs_qp;
struct ibv_qp   ibv_qp;
struct mlx5_buf buf;
+   void*sqstart;
int max_inline_data;
int buf_size;
+   /* For Raw Packet QP, use different buffers for the SQ and RQ */
+   struct mlx5_buf sq_buf;
+   int sq_buf_size;
struct mlx5_bf *bf;
 
uint8_t sq_signal_bits;
diff --git a/src/qp.c b/src/qp.c
index eeb0c92..5ff1f00 100644
--- a/src/qp.c
+++ b/src/qp.c
@@ -145,7 +145,7 @@ int mlx5_copy_to_send_wqe(struct mlx5_qp *qp, int idx, void 
*buf, int size)
 
 void *mlx5_get_send_wqe(struct mlx5_qp *qp, int n)
 {
-   return qp->buf.buf + qp->sq.offset + (n << MLX5_SEND_WQE_SHIFT);
+   return qp->sqstart + (n << MLX5_SEND_WQE_SHIFT);
 }
 
 void mlx5_init_qp_indices(struct mlx5_qp *qp)
@@ -188,11 +188,12 @@ static void set_datagram_seg(struct mlx5_wqe_datagram_seg 
*dseg,
dseg->av.key.qkey.qkey = htonl(wr->wr.ud.remote_qkey);
 }
 
-static void set_data_ptr_seg(struct mlx5_wqe_data_seg *dseg, struct ibv_sge 
*sg)
+static void set_data_ptr_seg(struct mlx5_wqe_data_seg *dseg, struct ibv_sge 
*sg,
+int offset)
 {
-   dseg->byte_count = htonl(sg->length);
+   dseg->byte_count = htonl(sg->length - offset);
dseg->lkey   = htonl(sg->lkey);
-   dseg->addr   = htonll(sg->addr);
+   dseg->addr   = htonll(sg->addr + offset);
 }
 
 /*
@@ -214,7 +215,7 @@ static void mlx5_bf_copy(unsigned long long *dst, unsigned 
long long *src,
*dst++ = *src++;
bytecnt -= 8 * sizeof(unsigned long long);
if (unlikely(src == qp->sq.qend))
-   src = qp->buf.buf + qp->sq.offset;
+   src = qp->sqstart;
}
 }
 
@@ -230,7 +231,8 @@ static uint32_t send_ieth(struct ibv_send_wr *wr)
 }
 
 static int set_data_inl_seg(struct mlx5_qp *qp, struct ibv_send_wr *wr,
-   void *wqe, int *sz)
+   void *wqe, int *sz,
+   struct mlx5_sg_copy_ptr *sg_copy_ptr)
 {
struct mlx5_wqe_inline_seg *seg;
void *addr;
@@ -239,13 +241,15 @@ static int set_data_inl_seg(struct mlx5_qp *qp, struct 
ibv_send_wr *wr,
int inl = 0;
void *qend = qp->sq.qend;
int copy;
+   int offset = sg_copy_ptr->offset;
 
seg = wqe;
wqe += sizeof *seg;
-   for (i = 0; i < wr->num_sge; ++i) {
-   addr = (void *) (unsigned long)(wr->sg_list[i].addr);
-   len  = wr->sg_list[i].length;
+   for (i = sg_copy_ptr->index; i < wr->num_sge; ++i) {
+   addr = (void *) (unsigned long)(wr->sg_list[i].addr + offset);
+   len  = wr->sg_list[i].length - offset;
inl += len;
+   offset = 0;
 
if (unlikely(inl > qp->max_inline_data)) {
errno = ENOMEM;
@@ -317,6 +321,56 @@ void *mlx5_get_atomic_laddr(struct mlx5_qp *qp, uint16_t 
idx, int *byte_count)
return addr;
 }
 
+static inline int copy_inline_headers(struct ibv_send_wr *wr,
+ struct mlx5_wqe_eth_seg *eseg,
+   

[PATCH for-next 0/6] Raw Packet QP user-space support for mlx5

2015-12-10 Thread Majd Dibbiny
Hi Eli,

This patch set adds support for Raw Packet QP for user-space applications.

Raw Packet QP enables the user to send and receive raw packets. The user is
responsible of building the packet including the headers.

Raw Packet QP is composed of the following sub-objects:
1. RQ: Receive Queue
2. TIR: Transport Interface Receive
3. SQ: Send Queue
4. TIS: Transport Interface Send

The first two sub-objects are responsible for the RX side, the later are
responsible for the TX side.

The Raw Packet QP state changes are implemented by changing the
state of the sub-objects.

Since the SQ and RQ work-queue (WQ) buffers are not contiguous like
other QPs, we allocate separate buffers in the user-space and pass
the address of each one of them separately to the kernel.

In order to tie CQEs to SQ or RQ, we need to provide a user-index in the
creation of each one of them. The user-index is generated by the user-space
driver. The first two patches are responsible for this functionality.

Patches 0003-0005 add RQ and SQ event handling. This is necessary in order to
notify the user application regarding RQ/SQ affiliated events.

Patch 0006 adds all the required functionality to the mlx5_ib module to
create, destroy, modify and query a Raw Packet QP.

This patch series is applied on top of ib-next and depends on "Add RoCE support
to the mlx5 driver" series.
In order to receive packets, flow steering support should be added to the
mlx5_ib. 

Regards,
Majd

Haggai Abramonvsky (2):
  IB/mlx5: Add CQE version 1 support to user space
  IB/mlx5: Exposed CQE version to user-space

Majd Dibbiny (4):
  net/mlx5_core: Export transport objects
  net/mlx5_core: Add RQ and SQ event handling
  net/mlx5_core: Warn on unsupported events of QP/RQ/SQ
  IB/mlx5: Add Raw Packet Queue Pair (QP) support

 drivers/infiniband/hw/mlx5/main.c  |  19 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h   |  54 +-
 drivers/infiniband/hw/mlx5/qp.c| 908 ++---
 drivers/infiniband/hw/mlx5/srq.c   |  31 +-
 drivers/infiniband/hw/mlx5/user.h  |   8 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c   |   1 +
 drivers/net/ethernet/mellanox/mlx5/core/main.c |   4 -
 drivers/net/ethernet/mellanox/mlx5/core/qp.c   | 230 --
 drivers/net/ethernet/mellanox/mlx5/core/srq.c  |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/transobj.c |  49 +-
 drivers/net/ethernet/mellanox/mlx5/core/transobj.h |  72 --
 include/linux/mlx5/device.h|  10 +-
 include/linux/mlx5/driver.h|   8 +-
 include/linux/mlx5/mlx5_ifc.h  |  11 +-
 include/linux/mlx5/qp.h|  11 +-
 include/linux/mlx5/transobj.h  |  78 ++
 17 files changed, 1228 insertions(+), 272 deletions(-)
 delete mode 100644 drivers/net/ethernet/mellanox/mlx5/core/transobj.h
 create mode 100644 include/linux/mlx5/transobj.h

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 2/6] IB/mlx5: Exposed CQE version to user-space

2015-12-10 Thread Majd Dibbiny
From: Haggai Abramonvsky 

Report the CQE version to the user-space via the response of
alloc_ucontext command.

Signed-off-by: Haggai Abramovsky 
---
 drivers/infiniband/hw/mlx5/main.c  | 4 +++-
 drivers/infiniband/hw/mlx5/user.h  | 3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/main.c | 4 
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index bdd60a6..1b503e0 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -633,6 +633,8 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
resp.max_send_wqebb = 1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz);
resp.max_recv_wr = 1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz);
resp.max_srq_recv_wr = 1 << MLX5_CAP_GEN(dev->mdev, log_max_srq_sz);
+   resp.cqe_version = MLX5_CAP_GEN(dev->mdev, cqe_version);
+   resp.reserved = 0;
 
context = kzalloc(sizeof(*context), GFP_KERNEL);
if (!context)
@@ -684,7 +686,7 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
resp.tot_uuars = req.total_num_uuars;
resp.num_ports = MLX5_CAP_GEN(dev->mdev, num_ports);
err = ib_copy_to_udata(udata, ,
-  sizeof(resp) - sizeof(resp.reserved));
+  min(sizeof(resp), udata->outlen));
if (err)
goto out_uars;
 
diff --git a/drivers/infiniband/hw/mlx5/user.h 
b/drivers/infiniband/hw/mlx5/user.h
index 9a78578..4d70505 100644
--- a/drivers/infiniband/hw/mlx5/user.h
+++ b/drivers/infiniband/hw/mlx5/user.h
@@ -80,7 +80,8 @@ struct mlx5_ib_alloc_ucontext_resp {
__u32   max_recv_wr;
__u32   max_srq_recv_wr;
__u16   num_ports;
-   __u16   reserved;
+   __u8cqe_version;
+   __u8reserved;
 };
 
 struct mlx5_ib_alloc_pd_resp {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 03aabdd..f9380b8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -601,7 +601,6 @@ clean:
return err;
 }
 
-#ifdef CONFIG_MLX5_CORE_EN
 static int mlx5_core_set_issi(struct mlx5_core_dev *dev)
 {
u32 query_in[MLX5_ST_SZ_DW(query_issi_in)];
@@ -654,7 +653,6 @@ static int mlx5_core_set_issi(struct mlx5_core_dev *dev)
 
return -ENOTSUPP;
 }
-#endif
 
 static int map_bf_area(struct mlx5_core_dev *dev)
 {
@@ -738,13 +736,11 @@ static int mlx5_dev_init(struct mlx5_core_dev *dev, 
struct pci_dev *pdev)
goto err_pagealloc_cleanup;
}
 
-#ifdef CONFIG_MLX5_CORE_EN
err = mlx5_core_set_issi(dev);
if (err) {
dev_err(>dev, "failed to set issi\n");
goto err_disable_hca;
}
-#endif
 
err = mlx5_satisfy_startup_pages(dev, 1);
if (err) {
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 03/10] IB/iser: Don't register memory for all immediate data writes

2015-12-10 Thread Sagi Grimberg



On 09/12/2015 21:30, Christoph Hellwig wrote:

The iser_reg_rdma_mem calling conventions seem rather confusing,
and this patch doesn't help that.  But I think that's something
to be addressed in bigger cleanup later on.


I do plan to rework this area. It would be nice to have it simpler,
But I want to rework the signature API a little bit before I do that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-10 Thread Moni Shoua
> No, I am defining an API that *make sense* and doesn't leak useless
> details. Of course that doesn't force code duplication or anyhting
> like that, just implement it smartly.
>
> I think mlx made a big mistake returning network_type instead of gid
> index, and I don't want to see that error enshrined in our API.
>
returning gid_index is wrong because it forces CQ pollers to be aware
of the entire table. Like I already mentioned, the GID table is a HW
resource that can be divided and handed to multiple VMs,

>> The Verbs are a low-level API, that should report exactly what was
>> received from the wire.  In the RoCEv2 case, it should be the GID/IP
>> addresses and the protocol type.  The addressing information is not
>> intended to be used directly by applications; it is the raw bits
>> that were accepted from the wire.
>
> Low level details isn't what any in kernel consumer needs. Everything
> in kernel needs the gid index to determine the namespace, routing and
> other details. It is not optional. A common API is thus needed to do
> this conversion.
>
> API-wise, once you get the gid index then it is trivial to make easy
> extractors for everything else. ie for example:
> rdma_get_ud_src_sockaddr(gid_index,,wc,grh)
> rdma_get_ud_dst_sockaddr(gid_index,,wc,grh)
>
>> ib_init_ah_from_wc() and friends is exactly the place that you want
>> to create an address handle based on completion and packet fields.
>
> CMA needs exactly the same logic as well, the fact it doesn't have it
> is a bug in this series.
>
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] iser-target: Declare correct flags when accepting a connection

2015-12-10 Thread Sagi Grimberg




Only mrginally related, but can someone explain what zero based
virtual addresses means in this context?  Does this means it uses
the old RFC5046-style header without the read/write_va fields?
Or does it mean those fields exist but must always be zero?


That's correct, negotiating this bit means that the iser header
format must not include read/write_va and it is assumed to be 0.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Sagi Grimberg




Doug this is going to conflict with the rdmavt work.  So if you take this could
you respond on the list.


It will also conflict with the iser remote invalidate series.

Doug it would help if you share your plans so people can rebase
accordingly.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/10] IB/iser: Reuse ib_sg_to_pages

2015-12-10 Thread Sagi Grimberg



On 09/12/2015 21:28, Christoph Hellwig wrote:

On Wed, Dec 09, 2015 at 02:12:00PM +0200, Sagi Grimberg wrote:

We have in iser iser_sg_to_page_vec which has exactly
the same role as ib_sg_to_pages. Customize the page_vec
to hold a fake MR so we can reuse ib_sg_to_pages.


Looks good.  In the long run we should simply kill struct ib_fmr
and make FRMs operate on struct ib_mr so that it can use
ib_sg_to_pages directly.


We can do that. We'd need to add ib_mr a list member just for fmr
routines.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libibcm] cmpost.c: Handle ibv_get_device_list returning no IB devices in init()

2015-12-10 Thread Doug Ledford
On 11/05/2015 12:46 PM, Hefty, Sean wrote:
> Merged - thanks.
> 
> This is the first patch against the libibcm in over 4 years.  Is there a 
> reason why this is being used instead of the librdmacm?  I ask because I 
> assumed that the libibcm was basically deprecated.  The last release was over 
> 6 years ago.

Not to mention the fact that libibcm is completely lacking all
documentation if I recall correctly.  No man pages or anything like
that.  We've considered it deprecated for quite some time.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


RE: [PATCH V2 1/1] [PATCH 1/1] Ibacm: default pkey for partitioned fabrics

2015-12-10 Thread Hefty, Sean
Thanks - applied
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/mad: Ensure fairness in ib_mad_completion_handler

2015-12-10 Thread ira . weiny
From: Dean Luick 

It was found that when a process was rapidly sending MADs other processes could
be hung in their unregister calls.

This would happen when process A was injecting packets fast enough that the
single threaded workqueue was never exiting ib_mad_completion_handler.
Therefore when process B called flush_workqueue via the unregister call it
would hang until process A stopped sending MADs.

The fix is to periodically reschedule ib_mad_completion_handler after
processing a large number of completions.  The number of completions chosen was
decided based on the defaults for the recv queue size.  However, it was kept
fixed such that increasing those queue sizes would not adversely affect
fairness in the future.

Reviewed-by: Ira Weiny 
Signed-off-by: Dean Luick 
---
 drivers/infiniband/core/mad.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 2281de122038..d4d2a618fd66 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -61,6 +61,18 @@ MODULE_PARM_DESC(send_queue_size, "Size of send queue in 
number of work requests
 module_param_named(recv_queue_size, mad_recvq_size, int, 0444);
 MODULE_PARM_DESC(recv_queue_size, "Size of receive queue in number of work 
requests");
 
+/*
+ * Define a limit on the number of completions which will be processed by the
+ * worker thread in a single work item.  This ensures that other work items
+ * (potentially from other users) are processed fairly.
+ *
+ * The number of completions was derived from the default queue sizes above.
+ * We use a value which is double the larger of the 2 queues (receive @ 512)
+ * but keep it fixed such that an increase in that value does not introduce
+ * unfairness.
+ */
+#define MAD_COMPLETION_PROC_LIMIT 1024
+
 static struct list_head ib_mad_port_list;
 static u32 ib_mad_client_id = 0;
 
@@ -2555,6 +2567,7 @@ static void ib_mad_completion_handler(struct work_struct 
*work)
 {
struct ib_mad_port_private *port_priv;
struct ib_wc wc;
+   int count = 0;
 
port_priv = container_of(work, struct ib_mad_port_private, work);
ib_req_notify_cq(port_priv->cq, IB_CQ_NEXT_COMP);
@@ -2574,6 +2587,11 @@ static void ib_mad_completion_handler(struct work_struct 
*work)
}
} else
mad_error_handler(port_priv, );
+
+   if (++count > MAD_COMPLETION_PROC_LIMIT) {
+   queue_work(port_priv->wq, _priv->work);
+   break;
+   }
}
 }
 
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PULL REQUEST] Please pull rdma.git

2015-12-10 Thread Doug Ledford
Hi Linus,

The following changes since commit 527e9316f8ec44bd53d90fb9f611fa752bb9:

  Linux 4.4-rc4 (2015-12-06 15:43:12 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git
tags/for-linus

for you to fetch changes up to ab5cdc31630c7596d81ca8fbe7d695f10666f39b:

  IB/mlx5: Postpone remove_keys under knowledge of coming preemption
(2015-12-08 16:55:31 -0500)


Series of 4.4-rc fixes

Most are minor to important fixes.  There is one performance enhancement
that I took on the grounds that failing to check if other processes can
run before running what's intended to be a background, idle-time task
is a bug, even though the primary effect of the fix is to improve
performance (and it was a very simple patch).


Arnd Bergmann (1):
  IB/iser: use sector_div instead of do_div

Bart Van Assche (5):
  IB/srp: Fix a memory leak
  IB/srp: Fix indirect data buffer rkey endianness
  IB/srp: Fix srp_map_sg_fr()
  IB core: Fix ib_sg_to_pages()
  IB/cma: Add a missing rcu_read_unlock()

Christoph Hellwig (1):
  IB/srp: Initialize dma_length in srp_map_idb

Easwar Hariharan (1):
  IB/qib: Minor fixes to qib per SFF 8636

Hal Rosenstock (1):
  IB/mad: Require CM send method for everything except ClassPortInfo

Ira Weiny (1):
  IB/qib: Fix qib_mr structure

Kaike Wan (1):
  IB/sa: Put netlink request into the request list before sending

Leon Romanovsky (1):
  IB/mlx5: Postpone remove_keys under knowledge of coming preemption

Mike Marciniszyn (2):
  IB/core: Fix user mode post wr corruption
  IB/core: use RCU for uverbs id lookup

Sagi Grimberg (3):
  IB/srp: Fix possible send queue overflow
  mlx4: Expose correct max_sge_rd limit
  iser-target: Remove explicit mlx4 work-around

Wengang Wang (2):
  IB/mlx4: Use correct order of variables in log message
  IB/mlx4: Use vmalloc for WR buffers when needed

 drivers/infiniband/core/cma.c|  5 +---
 drivers/infiniband/core/mad.c|  5 
 drivers/infiniband/core/sa_query.c   | 32 +++--
 drivers/infiniband/core/uverbs_cmd.c | 27 +++---
 drivers/infiniband/core/verbs.c  | 43 ++--
 drivers/infiniband/hw/mlx4/main.c|  2 +-
 drivers/infiniband/hw/mlx4/qp.c  | 19 +
 drivers/infiniband/hw/mlx4/srq.c | 11 ++--
 drivers/infiniband/hw/mlx5/mr.c  | 14 +-
 drivers/infiniband/hw/qib/qib_qsfp.c |  4 +--
 drivers/infiniband/hw/qib/qib_verbs.h|  2 +-
 drivers/infiniband/ulp/iser/iser_verbs.c |  2 +-
 drivers/infiniband/ulp/isert/ib_isert.c  | 13 ++---
 drivers/infiniband/ulp/srp/ib_srp.c  | 48
+---
 drivers/infiniband/ulp/srp/ib_srp.h  |  5 +---
 drivers/net/ethernet/mellanox/mlx4/cmd.c |  2 +-
 include/linux/mlx4/device.h  | 11 
 include/rdma/ib_mad.h|  2 ++
 include/rdma/ib_verbs.h  |  1 +
 19 files changed, 146 insertions(+), 102 deletions(-)

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] IB/usnic: Handle 0 counts in resource allocation

2015-12-10 Thread Nelson Escobar


On 12/10/2015 11:47 AM, l...@leon.nu wrote:
> On Thu, Dec 10, 2015 at 11:22:24AM -0800, Nelson Escobar wrote:
>> On 12/9/2015 10:47 PM, Leon Romanovsky wrote:
>>> On Wed, Dec 09, 2015 at 10:42:19AM -0800, Nelson Escobar wrote:
 -  if (usnic_vnic_res_free_cnt(vnic, type) < cnt || cnt < 1 || !owner)
 +  if (usnic_vnic_res_free_cnt(vnic, type) < cnt || cnt < 0 || !owner)
>>> Before this change you returned EINVAL if no free_cnt were available,
>>> now you will continue. is this behaviour expected?
>> Yes.  If cnt is 0, then no resources are being requested, so it is OK if
>> there are no resources available.
> I afraid that you missed the point.
Thanks for looking at the code.  I am still not understanding your point.
> Old code:
> usnic_vnic_res_free_cnt(vnic, type) == 0 and cnt == 1 will return EINVAL
Yes:
if (0 < 1 || 1 < 1 || !owner)
return -EINVAL;
> New code
> snic_vnic_res_free_cnt(vnic, type) == 0 and cnt == 1 will pass and will
> pass te "if (cnt > 0)" check below and will decrease free_cnt variable
> to be below zero.
This I don't understand.  The following still fails with -EINVAL.
if (0 < 1 || 1 < 0 || !owner)
return -EINVAL;
> 
> Is this behavior expected?
>>>
return ERR_PTR(-EINVAL);
  
ret = kzalloc(sizeof(*ret), GFP_ATOMIC);
 @@ -247,26 +247,28 @@ usnic_vnic_get_resources(struct usnic_vnic *vnic, 
 enum usnic_vnic_res_type type,
return ERR_PTR(-ENOMEM);
}
  
 -  ret->res = kzalloc(sizeof(*(ret->res))*cnt, GFP_ATOMIC);
 -  if (!ret->res) {
 -  usnic_err("Failed to allocate resources for %s. Out of 
 memory\n",
 -  usnic_vnic_pci_name(vnic));
 -  kfree(ret);
 -  return ERR_PTR(-ENOMEM);
 -  }
 +  if (cnt > 0) {
 +  ret->res = kcalloc(cnt, sizeof(*(ret->res)), GFP_ATOMIC);
 +  if (!ret->res) {
 +  usnic_err("Failed to allocate resources for %s. Out of 
 memory\n",
 +  usnic_vnic_pci_name(vnic));
>>> You don't need to print OOM messages, failure in memory allocation very 
>>> hard to miss.
>> OOM messages are hard to miss, but this message is already in upstream
>> and outside the scope of this patch.
> It is worth to fix, especially if you are changing these exact lines.
 +  kfree(ret);
 +  return ERR_PTR(-ENOMEM);
 +  }
  
 -  spin_lock(>res_lock);
 -  src = >chunks[type];
 -  for (i = 0; i < src->cnt && ret->cnt < cnt; i++) {
 -  res = src->res[i];
 -  if (!res->owner) {
 -  src->free_cnt--;
 -  res->owner = owner;
 -  ret->res[ret->cnt++] = res;
 +  spin_lock(>res_lock);
 +  src = >chunks[type];
 +  for (i = 0; i < src->cnt && ret->cnt < cnt; i++) {
 +  res = src->res[i];
 +  if (!res->owner) {
 +  src->free_cnt--;
>>> It will be negative, because of skip usnic_vnic_res_free_cnt check
>>> before.
>> We are inside the 'if (cnt > 0)' clause here, so the previous
>> usnic_vnic_res_free_cnt check wasn't skipped.
> See above.
> 
 +  res->owner = owner;
 +  ret->res[ret->cnt++] = res;
 +  }
}
 -  }
  
 -  spin_unlock(>res_lock);
 +  spin_unlock(>res_lock);
 +  }
ret->type = type;
ret->vnic = vnic;
WARN_ON(ret->cnt != cnt);
 @@ -281,14 +283,16 @@ void usnic_vnic_put_resources(struct 
 usnic_vnic_res_chunk *chunk)
int i;
struct usnic_vnic *vnic = chunk->vnic;
  
 -  spin_lock(>res_lock);
 -  while ((i = --chunk->cnt) >= 0) {
 -  res = chunk->res[i];
 -  chunk->res[i] = NULL;
 -  res->owner = NULL;
 -  vnic->chunks[res->type].free_cnt++;
 +  if (chunk->cnt > 0) {
 +  spin_lock(>res_lock);
 +  while ((i = --chunk->cnt) >= 0) {
 +  res = chunk->res[i];
 +  chunk->res[i] = NULL;
 +  res->owner = NULL;
 +  vnic->chunks[res->type].free_cnt++;
 +  }
 +  spin_unlock(>res_lock);
}
 -  spin_unlock(>res_lock);
  
kfree(chunk->res);
kfree(chunk);
 -- 
 2.4.3

 --
 To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] staging/rdma/hfi1: Fix for module parameter hdrq_entsize when it's 0

2015-12-10 Thread Mike Marciniszyn
From: Sebastian Sanchez 

If driver is loaded with parameter hdrq_entsize=0, then
there's a NULL dereference when the driver gets unloaded.
This causes a kernel Oops and prevents the module  from
being unloaded. This patch fixes this issue by making sure
-EINVAL gets returned when hdrq_entsize=0.

Reviewed-by: Chegondi, Harish 
Reviewed-by: Haralanov, Mitko 
Signed-off-by: Sebastian Sanchez 
---
 drivers/staging/rdma/hfi1/init.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/staging/rdma/hfi1/init.c b/drivers/staging/rdma/hfi1/init.c
index 1c8286f..b61a854 100644
--- a/drivers/staging/rdma/hfi1/init.c
+++ b/drivers/staging/rdma/hfi1/init.c
@@ -1350,6 +1350,7 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
if (!encode_rcv_header_entry_size(hfi1_hdrq_entsize)) {
hfi1_early_err(>dev, "Invalid HdrQ Entry size %u\n",
   hfi1_hdrq_entsize);
+   ret = -EINVAL;
goto bail;
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/1] IB/sa: Put netlink request into the request list before sending

2015-12-10 Thread Doug Ledford
On 11/13/2015 08:57 AM, Wan, Kaike wrote:
>> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
>> ow...@vger.kernel.org] On Behalf Of kaike@intel.com
>> Sent: Friday, October 30, 2015 8:24 AM
>> To: linux-rdma@vger.kernel.org
>> Cc: Wan, Kaike
>> Subject: [PATCH v2 1/1] IB/sa: Put netlink request into the request list 
>> before
>> sending
>>
>> From: Kaike Wan 
>>
>> It was found by Saurabh Sengar that the netlink code tried to allocate
>> memory with GFP_KERNEL while holding a spinlock. While it is possible to fix
>> the issue by replacing GFP_KERNEL with GFP_ATOMIC, it is better to get rid
>> of the spinlock while sending the packet. However, in order to protect
>> against a race condition that a quick response may be received before the
>> request is put on the request list, we need to put the request on the list 
>> first.
>>
>> Signed-off-by: Kaike Wan 
>> ---
> 
> Reported-by: Saurabh Sengar 
> 

This was pulled in for 4.4-rc.  Thanks.


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Jason Gunthorpe
On Thu, Dec 10, 2015 at 03:41:12PM -0800, Christoph Hellwig wrote:
> On Thu, Dec 10, 2015 at 10:44:13AM -0700, Jason Gunthorpe wrote:
> > > the signature of the function in struct ib_dma_mapping_ops.
> > 
> > It is supposed to be a dma_addr_t 'cookie' not a u64.
> > 
> > A patch to cleanup the core in this area would be appreciated.
> 
> I walked through the ib_dma_* mess in detail, and sadly speaking it
> has to be a u64.  This is due to the drivers being consolidated into
> rdmavt in fact.

> Those drivers use the addr field in struct ib_sge to point to a kernel
> virtual address, not to a DMA address.  In Linux u64 is the safe
> superset for a dma_addr_t and a pointer so we'll need to go with that.

Hrm.. sizeof(void *) > sizeof(dma_addr_t) seemed pretty obscure to me,
here is the original discussion:

https://lkml.org/lkml/2006/12/13/245

Sounds like someone was worried about sparc64. I doubt it is an actual
issue today, but granted the u64 did make some sense.

I probably would have just banned compiling qib on such a platform,
and kept the dma_addr_t annotations..

> Now these drivers will end up dma mapping these virtual addresses later,
> so we might want figure out a) why the qib & co drivers even need the
> virtual address, and b) see if we maybe should always do the dma_map
> in the callers anyway, and just have an additional virtual address field
> for those drivers if absolutely needed.

So, I *believe* the issue is that linux has (had?) no approved way to
convert from a device specific dma_addr_t to a virtual address.

I always understood that there was some kind of path in qib where it
needed to memcpy with the CPU the 'dma' data, so if it only had a
dma_addr then there was no way to make it work.

Certainly a future generic soft-rocee driver will need to cpu memcpy from
'dma' memory to a skb..

It is really too bad we can't just use get_dma_ops to handle this case
and instead require our own infrastructure.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Christoph Hellwig
On Thu, Dec 10, 2015 at 12:46:54PM -0500, Doug Ledford wrote:
> Organization.  Let's be fair, the totally flat namespace you are
> preferring is the equivalent of a teenager that is completely incapable
> of picking thier dirty laundry up off the floor.  It is sloppy,
> disorganized, often full of old cruft that you don't know if you can get
> rid of or not, often so disorganized you might have three similarly
> named items that you can't figure out which one should be used in which
> circumstances, etc.

The most cruft I've found in major subsystems in years has been
in the RDMA code, so I'm not sure where that argument comes from.
We're pretty good at garbage collecting cruft in Linux, and the
typical counter examples are arbitrarily split structures where it's
easy to hide things.

> >  And looking at the existing members of
> > struct ib_device what determines if it goes straight into the device
> > or the attribute?
> 
> Organization.  What goes where depends on what makes sense according to
> the organization you are doing.

So what makes num_comp_vectors or phys_port_cnt fit into ib_device,
while max_qp or max_cq are in struct ib_device_attr?

I really like clean data structures, but keeping structures that
have 1:1 relationships and sit in the same module separate never
has been a good idea.

> >  There is a reason why we don't do this weird
> > attr split in other Linux subsystems, and making IB follow this pattern
> > makes everyone feel right at home instead of wondering about the
> > weird attribute.
> 
> Being organized is not "weird".  Let's not wax poetic about sloppy,
> disorganized structures.  Let's be honest about what they are so we
> don't feel like we need to take a shower every time we talk about them
> to purge us of the sins of our lies.

I call that utter BS.  Being organized is exactly not having multiple
structures that have the same scope or life time, it's actually what
I call disorganzied.  There is a lot to be said about grouping the
fields in the structure, and that's how sensible subsystems handle it:

stuct foo_bar {
/* read/write in the hot path, keep together and tightly packed: */
...

/* read-only in the hot path */
...

/* random members: */
...

/* properties here, immutable after setup: */
...
};

but that's completely inverse to what we're having with ib_device
currently.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Christoph Hellwig
On Thu, Dec 10, 2015 at 11:07:03AM -0700, Jason Gunthorpe wrote:
> The ARM folks do this sort of stuff on a regular basis.. Very early on
> Doug prepares a topic branch with only the big change, NFS folks pull
> it and then pull your work. Then Doug would send the topic branch to
> Linus as soon as the merge window opens, then NFS would send theirs.
> 
> This is alot less work overall than trying to sequence multiple
> patches over multiple releases..

Agreed.  Staging has alaways been a giant pain and things tend to never
finish moving over that way if they are non-trivial enough.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Chuck Lever

> On Dec 10, 2015, at 6:30 PM, Christoph Hellwig  wrote:
> 
> On Thu, Dec 10, 2015 at 11:07:03AM -0700, Jason Gunthorpe wrote:
>> The ARM folks do this sort of stuff on a regular basis.. Very early on
>> Doug prepares a topic branch with only the big change, NFS folks pull
>> it and then pull your work. Then Doug would send the topic branch to
>> Linus as soon as the merge window opens, then NFS would send theirs.
>> 
>> This is alot less work overall than trying to sequence multiple
>> patches over multiple releases..
> 
> Agreed.  Staging has alaways been a giant pain and things tend to never
> finish moving over that way if they are non-trivial enough.

In that case:

You need to make sure you have all the right Acks. I've added
Anna and Bruce to Ack the NFS-related portions. Santosh should
Ack the RDS part.

http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/ib_device_attr

Given the proximity to the holidays and the next merge window,
Doug will need to get a properly-acked topic branch published
in the next day or two so the rest of us can rebase and start
testing before the relevant parties disappear for the holiday.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Christoph Hellwig
On Thu, Dec 10, 2015 at 10:44:13AM -0700, Jason Gunthorpe wrote:
> > the signature of the function in struct ib_dma_mapping_ops.
> 
> It is supposed to be a dma_addr_t 'cookie' not a u64.
> 
> A patch to cleanup the core in this area would be appreciated.

I walked through the ib_dma_* mess in detail, and sadly speaking it
has to be a u64.  This is due to the drivers being consolidated into
rdmavt in fact.

Those drivers use the addr field in struct ib_sge to point to a kernel
virtual address, not to a DMA address.  In Linux u64 is the safe
superset for a dma_addr_t and a pointer so we'll need to go with that.

Now these drivers will end up dma mapping these virtual addresses later,
so we might want figure out a) why the qib & co drivers even need the
virtual address, and b) see if we maybe should always do the dma_map
in the callers anyway, and just have an additional virtual address field
for those drivers if absolutely needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Christoph Hellwig
On Thu, Dec 10, 2015 at 05:29:50PM -0700, Jason Gunthorpe wrote:
> Hrm.. sizeof(void *) > sizeof(dma_addr_t) seemed pretty obscure to me,
> here is the original discussion:
> 
> https://lkml.org/lkml/2006/12/13/245
> 
> Sounds like someone was worried about sparc64. I doubt it is an actual
> issue today, but granted the u64 did make some sense.

sparc64 still uses a 32-bit dma_addr_t.

> > Now these drivers will end up dma mapping these virtual addresses later,
> > so we might want figure out a) why the qib & co drivers even need the
> > virtual address, and b) see if we maybe should always do the dma_map
> > in the callers anyway, and just have an additional virtual address field
> > for those drivers if absolutely needed.
> 
> So, I *believe* the issue is that linux has (had?) no approved way to
> convert from a device specific dma_addr_t to a virtual address.

Linux doesn't have an approved way because it's impossible for the
generic case.  When you have an iommu you have potentially multiple
page tables mapping physical addresses to device virtual addresses,
and there is no easy way to do a reverse mapping.

> It is really too bad we can't just use get_dma_ops to handle this case
> and instead require our own infrastructure.

FYI, I have a patch series in linux-next to switches all remaining
architectures to use get_dma_ops, and there are plans to allow generic
per-device dma_ops based on that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/37] IB/rdmavt: Add ib core device attributes to rvt driver params list

2015-12-10 Thread Haggai Eran
On 07/12/2015 22:43, Dennis Dalessandro wrote:
>  struct rvt_dev_info {
> + /*
> +  * Prior to calling for registration the driver will be responsible for
> +  * allocating space for this structure. The driver will also need to
> +  * allocate space for any private device or per port data structures.
> +  * Alternatively rvt could do this allocation and the registration API
> +  * would then change to accept an "extra" piece to allocate.
I don't think you need rvt to allocate the private data, but even if you
decide to do that, there's no need for a comment that describes all the
alternative designs here.

> +  *
> +  * The driver will also be
> +  * responsible for filling in certain members of dparms.props
> +  */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/37] IB/rdmavt: Add ib core device attributes to rvt driver params list

2015-12-10 Thread Haggai Eran
On 07/12/2015 22:43, Dennis Dalessandro wrote:
> + /*
> +  * Drivers will need to support a number of notifications to rvt in
> +  * accordance with certain events. This structure should contain a mask
> +  * of the supported events. Such events that the rvt may need to know
> +  * about include:
> +  * port errors
> +  * port active
> +  * lid change
> +  * sm change
> +  * client reregister
> +  * pkey change
Where are the event values defined?

> +  *
> +  * There may also be other events that the rvt layers needs to know
> +  * about this is not an exhaustive list. Some events though rvt does not
> +  * need to rely on the driver for such as completion queue error.
> +  */
> +  int rvt_signal_supported;

What will rvt do if it learns that the device doesn't support some of
the events?

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


re: net/mlx5: E-Switch, Add SR-IOV (FDB) support

2015-12-10 Thread Dan Carpenter
Hello Saeed Mahameed,

The patch 81848731ff40: "net/mlx5: E-Switch, Add SR-IOV (FDB)
support" from Dec 1, 2015, leads to the following static checker
warning:

drivers/net/ethernet/mellanox/mlx5/core/eswitch.c:579 
esw_fdb_set_vport_rule()
warn: passing zero to 'PTR_ERR'

drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
   568  esw_debug(esw->dev,
   569"\tFDB add rule dmac_v(%pM) dmac_c(%pM) -> 
vport(%d)\n",
   570dmac_v, dmac_c, vport);
   571  flow_rule =
   572  mlx5_add_flow_rule(esw,
   573 match_header,
   574 match_c,
   575 match_v,
   576 MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
   577 0, );
   578  if (IS_ERR_OR_NULL(flow_rule)) {

mlx5_add_flow_rule() only returns NULL on error.  It never returns
ERR_PTRs.

   579  pr_warn(
   580  "FDB: Failed to add flow rule: dmac_v(%pM) 
dmac_c(%pM) -> vport(%d), err(%ld)\n",

 
It's not a terrible bug but this always says "err(0)" which is not very
useful, and it causes this static checker warning.

   581   dmac_v, dmac_c, vport, PTR_ERR(flow_rule));
   582  flow_rule = NULL;
   583  }
   584  out:
   585  kfree(match_v);
   586  kfree(match_c);
   587  return flow_rule;
   588  }

regards,
dan carpenter
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/37] IB/rdmavt: Create module framework and handle driver registration

2015-12-10 Thread Haggai Eran
On 07/12/2015 22:43, Dennis Dalessandro wrote:
> +struct rvt_dev_info {
> + struct ib_device ibdev;
> + int (*port_callback)(struct ib_device *, u8, struct kobject *);
> +
> + /*
> +  * TODO:
> +  *  need to reflect module parameters that may vary by dev
> +  */
> +};

Could you explain what you mean by the TODO comment? Different device
drivers using rvt will have their own modules with their own parameters,
right? Why do you need them reflected here?

Thanks,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Haggai Eran
A few nits:

On 07/12/2015 22:43, Dennis Dalessandro wrote:
> +static int rvt_map_sg(struct ib_device *dev, struct scatterlist *sgl,
> +   int nents, enum dma_data_direction direction)
> +{
> + struct scatterlist *sg;
> + u64 addr;
> + int i;
> + int ret = nents;
> +
> + if (WARN_ON(!valid_dma_direction(direction)))
> + return BAD_DMA_ADDRESS;
The function returns 0 on error, so its technically correct, but it
doesn't return a DMA address, so I think it would make more sense to
return 0 here and not BAD_DMA_ADDRESS.

> +
> + for_each_sg(sgl, sg, nents, i) {
> + addr = (u64)page_address(sg_page(sg));
> + if (!addr) {
> + ret = 0;
> + break;
> + }
> + sg->dma_address = addr + sg->offset;
> +#ifdef CONFIG_NEED_SG_DMA_LENGTH
> + sg->dma_length = sg->length;
> +#endif
> + }
> + return ret;
> +}

> diff --git a/drivers/infiniband/sw/rdmavt/vt.h 
> b/drivers/infiniband/sw/rdmavt/vt.h
> index ec210f3..a19a3af 100644
> --- a/drivers/infiniband/sw/rdmavt/vt.h
> +++ b/drivers/infiniband/sw/rdmavt/vt.h
> @@ -52,5 +52,6 @@
>   */
>  
>  #include 
> +#include "dma.h"
Why do you need the dma.h file included here? Won't it be enough to
include it in vt.c?

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/37] IB/rdmavt: Add protection domain to rdmavt.

2015-12-10 Thread Dennis Dalessandro

On Tue, Dec 08, 2015 at 02:52:36PM -0500, ira. weiny wrote:

On Tue, Dec 08, 2015 at 01:17:48PM -0600, Sean Hefty wrote:

> > > +struct ib_pd *rvt_alloc_pd(struct ib_device *ibdev,
> > > +   struct ib_ucontext *context,
> > > +   struct ib_udata *udata)
> > > +{
> > > +struct rvt_dev_info *dev = ib_to_rvt(ibdev);
> > > +struct rvt_pd *pd;
> > > +struct ib_pd *ret;
> > > +
> > > +pd = kmalloc(sizeof(*pd), GFP_KERNEL);
> > > +if (!pd) {
> > > +ret = ERR_PTR(-ENOMEM);
> > > +goto bail;
> > > +}
> > > +/*
> > > + * This is actually totally arbitrary.  Some correctness tests
> > > + * assume there's a maximum number of PDs that can be allocated.
> > > + * We don't actually have this limit, but we fail the test if
> > > + * we allow allocations of more than we report for this value.
> > > + */
> >
> > Why not trap this in user space, rather than forcing the kernel to
> support some test program?
> >
> 
> What do you mean "trap this in user space"?  This code is not supporting

> any
> particular test program.
> 
> If users try and allocate more protection domains then are advertised then

> an
> error should be returned.
> 
> Perhaps the comment should be removed if it is confusing?


There is no theoretical limit on the number of PDs, or likely most other
resources.  So why define and enforce an arbitrary limit?  The justification
given was that some test program would fail.  If there's a concern about that
test program passing, then enforce the limit in user space.


I did not interpret this as being driven by a test but rather by the IBTA spec.

This may be pedantic but, wouldn't it be confusing from a user POV to see an
advertised limit but then not be constrained by it?  I'm not opposed to setting
this limit arbitrarily high, but I still think the implementation should
enforce that limit.



I didn't find the comment confusing.  Just the choice to let a test app drive 
the design.


Perhaps "confusing" is the wrong word.  But I did not interpreted the comment
as saying that the implementation was driven but a test program.


How about I reword the comment to say that while we could continue 
allocating PDs being only constrained by system resources, the spec says 
there is a limit so we enforce that?


-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-10 Thread Jason Gunthorpe
On Thu, Dec 10, 2015 at 09:58:51AM +0200, Moni Shoua wrote:

> > creating a horrible API, or requiring all vendors to implement a
> > network type flag.
> >
> Actually you haven't suggested anything yet besides a name to the function.
> I already said that calculating gid_index from wc and grh requires
> extra CPU cycles and is prone to mistakes. But, I might be wrong and
> you have a better idea. Do you?

I told you, if a driver requires vendor specific information then
include it as private data in the grh.

I keep saying this: computing the gid index is not optional. If a
drive can't do it efficiently then too bad, it has to go the long way
around.

>> I think mlx made a big mistake returning network_type instead of gid
>> index, and I don't want to see that error enshrined in our API.
>>
> returning gid_index is wrong because it forces CQ pollers to be aware
> of the entire table. Like I already mentioned, the GID table is a HW
> resource that can be divided and handed to multiple VMs,

Please. If HW virt stuff can't sort this out it is broken, there are
many trivial solutions.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Jason Gunthorpe
On Thu, Dec 10, 2015 at 11:07:37AM -0500, Chuck Lever wrote:
 
> Invasive IB core changes like this clean up are especially
> burdensome for me because NFS/RDMA changes do not normally
> go through Doug's tree, so it takes extra co-ordination.

The ARM folks do this sort of stuff on a regular basis.. Very early on
Doug prepares a topic branch with only the big change, NFS folks pull
it and then pull your work. Then Doug would send the topic branch to
Linus as soon as the merge window opens, then NFS would send theirs.

This is alot less work overall than trying to sequence multiple
patches over multiple releases..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/13] irq_poll: don't disable new irq_poll instances

2015-12-10 Thread Bart Van Assche

On 12/07/2015 12:51 PM, Christoph Hellwig wrote:

There is no good reason to start out disabled - drivers can control if
the poll instance can be scheduled by simply not scheduling it yet.


Reviewed-by: Bart Van Assche 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/13] irq_poll: mark __irq_poll_complete static

2015-12-10 Thread Bart Van Assche

On 12/07/2015 12:51 PM, Christoph Hellwig wrote:

[ ... ]


Reviewed-by: Bart Van Assche 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/13] irq_poll: fold irq_poll_disable_pending into irq_poll_softirq

2015-12-10 Thread Bart Van Assche

On 12/07/2015 12:51 PM, Christoph Hellwig wrote:
> [ ... ]

Reviewed-by: Bart Van Assche 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/13] irq_poll: fold irq_poll_sched_prep into irq_poll_sched

2015-12-10 Thread Bart Van Assche

On 12/07/2015 12:51 PM, Christoph Hellwig wrote:

@@ -58,7 +62,7 @@ EXPORT_SYMBOL(__irq_poll_complete);
   * Description:
   * If a driver consumes less than the assigned budget in its run of the
   * iopoll handler, it'll end the polled mode by calling this function. The
- * iopoll handler will not be invoked again before irq_poll_sched_prep()
+ * iopoll handler will not be invoked again before irq_poll_schedp()
   * is called.
   **/
  void irq_poll_complete(struct irq_poll *iop)


Is that a typo at the end of the modified line ("schedp") ? If this gets 
addressed:


Reviewed-by: Bart Van Assche 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/13] IB/srp: use the new CQ API

2015-12-10 Thread Bart Van Assche

On 12/07/2015 12:51 PM, Christoph Hellwig wrote:

+static void srp_send_done(struct ib_cq *cq, struct ib_wc *wc)
+{
+   struct srp_iu *iu = container_of(wc->wr_cqe, struct srp_iu, cqe);
+   struct srp_rdma_ch *ch = cq->cq_context;
+
+   if (likely(wc->status != IB_WC_SUCCESS)) {
+   srp_handle_qp_err(cq, wc, "SEND");
+   return;
+   }
+
+   list_add(>list, >free_tx);
+}


Please change likely() in the above code into unlikely().

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Doug Ledford
On 12/09/2015 01:44 PM, Christoph Hellwig wrote:
> On Wed, Dec 09, 2015 at 01:13:29AM +0200, Or Gerlitz wrote:
>>
>> Christoph patch is here indeed, it does two things
>>
>> 1. remove all the ULP device attr alloc, device query, attr free hassle
>> 2. adds tons of new fields to struct ib_device
>>
>> I think it just goes too much and needlessly adds tons of these new
>> fields directly to struct ib_device where we can have them all well
>> scoped into ib_device_attr member or pointer from struct ib_device
> 
> What's the benefit of that?

Organization.  Let's be fair, the totally flat namespace you are
preferring is the equivalent of a teenager that is completely incapable
of picking thier dirty laundry up off the floor.  It is sloppy,
disorganized, often full of old cruft that you don't know if you can get
rid of or not, often so disorganized you might have three similarly
named items that you can't figure out which one should be used in which
circumstances, etc.

The downside of a more organized approach is often longer/more complex
variable names.  It also means that if you want to micro-optimize for
which variables are in the hot path and you need one and only one of the
variables from the attr struct, then you either have to make a copy
somewhere else that is in a frequently hot cache line (along with making
sure the copy and the original stay in sync) or you have to take it out
of the attr struct, or other bad things.

>  And looking at the existing members of
> struct ib_device what determines if it goes straight into the device
> or the attribute?

Organization.  What goes where depends on what makes sense according to
the organization you are doing.

>  There is a reason why we don't do this weird
> attr split in other Linux subsystems, and making IB follow this pattern
> makes everyone feel right at home instead of wondering about the
> weird attribute.

Being organized is not "weird".  Let's not wax poetic about sloppy,
disorganized structures.  Let's be honest about what they are so we
don't feel like we need to take a shower every time we talk about them
to purge us of the sins of our lies.

That said, though, the kernel frequently has hot spots that require we
have the freedom to rearrange structures to suit memory placement
constraints.  It's really for that reason more than anything else that
we tolerate these horribly disorganized structures with a totally flat
namespace.  And to be fair, any super high speed core code like the IB
code is more susceptible than most to issues of cache line delays and
hot path slow downs.

For that reason, and *only* that reason, I'm inclined to take your
patch.  Otherwise, I wouldn't touch it.  The rest of the kernel may
think a disorganized, flat namespace is fruit punch kool-aid...I happen
to prefer things differently.  But I'm willing to acquiesce to the needs
of a high performance kernel subsystem as appropriate regardless of my
personal taste on the issue.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 03/37] IB/rdmavt: Add protection domain to rdmavt.

2015-12-10 Thread Leon Romanovsky
On Thu, Dec 10, 2015 at 11:40:48AM -0500, Dennis Dalessandro wrote:
> On Tue, Dec 08, 2015 at 08:28:17AM +0200, Leon Romanovsky wrote:
> >On Mon, Dec 07, 2015 at 03:43:10PM -0500, Dennis Dalessandro wrote:
> >>+
> >>+/*
> >>+ * Things that are driver specific, module parameters in hfi1 and qib
> >>+ */
> >>+struct rvt_driver_params {
> >>+   int max_pds;
> >Can it be negative value?
> >>+};
> 
> If so no protection domains would ever get allocated. I don't think anything
> else would break though if a driver were to do that.
In such way, it should be "unsigned int".

> 
> -Denny
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/13] IB: add a proper completion queue abstraction

2015-12-10 Thread Bart Van Assche

On 12/07/2015 12:51 PM, Christoph Hellwig wrote:

This adds an abstraction that allows ULP to simply pass a completion

   ^^^

I think this should either be changed into either "an ULP" or "ULPs".


+/**
+ * ib_process_direct_cq - process a CQ in caller context
+ * @cq:CQ to process
+ * @budget:number of CQEs to poll for
+ *
+ * This function is used to process all outstanding CQ entries on a
+ * %IB_POLL_DIRECT CQ.  It does not offload CQ processing to a different
+ * context and does not ask from completion interrupts from the HCA.

   

Should this perhaps be changed into "for" ?


+ *
+ * Note: for compatibility reasons -1 can be passed in %budget for unlimited
+ * polling.  Do not use this feature in new code, it will be remove soon.

^^

Did you perhaps intend "removed" ?



+struct ib_cq *ib_alloc_cq(struct ib_device *dev, void *private,
+   int nr_cqe, int comp_vector, enum ib_poll_context poll_ctx)
+{

> [ ... ]

+   cq->wc = kmalloc_array(IB_POLL_BATCH, sizeof(*cq->wc), GFP_KERNEL);


Why is the wc array allocated separately instead of being embedded in 
struct ib_cq ? I think the faster completion queues can be created the 
better so if it is possible to eliminate the above kmalloc() call I 
would prefer that.



--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -457,10 +457,11 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct 
srp_target_port *target)
  static void srp_destroy_qp(struct srp_rdma_ch *ch)
  {
static struct ib_qp_attr attr = { .qp_state = IB_QPS_ERR };
-   static struct ib_recv_wr wr = { .wr_id = SRP_LAST_WR_ID };
+   static struct ib_recv_wr wr = { 0 };
struct ib_recv_wr *bad_wr;
int ret;


Is explicit initialization to "{ 0 }" really needed for static structures ?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/13] irq_poll: make blk-iopoll available outside the block layer

2015-12-10 Thread Bart Van Assche

On 12/07/2015 12:51 PM, Christoph Hellwig wrote:

-enum {
-   IOPOLL_F_SCHED  = 0,
-   IOPOLL_F_DISABLE= 1,
-};


[ ... ]


+enum {
+   IRQ_POLL_F_SCHED= 0,
+   IRQ_POLL_F_DISABLE  = 1,
+};


A nitpick: the values of these constants were aligned in the original 
code but no longer in the new code. Whether or not this gets addressed:


Reviewed-by: Bart Van Assche 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 08/13] IB/srpt: chain RDMA READ/WRITE requests

2015-12-10 Thread Bart Van Assche

On 12/07/2015 12:51 PM, Christoph Hellwig wrote:

Remove struct rdma_iu and instead allocate the struct ib_rdma_wr array
early and fill out directly.  This allows us to chain the WRs, and thus
archive both less lock contention on the HCA workqueue as well as much

  ^^^

Did you perhaps intend "achieve" ?


  struct srpt_send_ioctx {
struct srpt_ioctx   ioctx;
struct srpt_rdma_ch *ch;
-   struct rdma_iu  *rdma_ius;
+   struct ib_rdma_wr   *rdma_ius;


Please rename the "rdma_ius" member into "wr" or any other name that 
shows that this member is now a work request array and nothing else.


Otherwise this patch looks fine to me.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Jason Gunthorpe
On Thu, Dec 10, 2015 at 11:17:09AM -0500, Dennis Dalessandro wrote:
> On Tue, Dec 08, 2015 at 08:08:21AM +0200, Leon Romanovsky wrote:
> >On Mon, Dec 07, 2015 at 03:43:06PM -0500, Dennis Dalessandro wrote:
> >>+
> >>+#define BAD_DMA_ADDRESS ((u64)0)

Just use DMA_ERROR_CODE.

> >>+static u64 rvt_dma_map_single(struct ib_device *dev, void *cpu_addr,
> >>+ size_t size, enum dma_data_direction direction)
> >>+{
> >>+   if (WARN_ON(!valid_dma_direction(direction)))
> >>+   return BAD_DMA_ADDRESS;
> >>+
> >>+   return (u64)cpu_addr;
> >>+}
> >An example of such function.
> 
> Honestly I'm not really sure why it's done this way. We are just following
> the signature of the function in struct ib_dma_mapping_ops.

It is supposed to be a dma_addr_t 'cookie' not a u64.

A patch to cleanup the core in this area would be appreciated.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Doug Ledford
On 12/10/2015 11:10 AM, Steve Wise wrote:
> 
> 
>> -Original Message-
>> From: linux-rdma-ow...@vger.kernel.org 
>> [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Chuck Lever
>> Sent: Thursday, December 10, 2015 10:08 AM
>> To: linux-rdma@vger.kernel.org
>> Cc: ira.weiny; Christoph Hellwig; Jason Gunthorpe; Or Gerlitz; Steve Wise; 
>> Or Gerlitz; Sagi Grimberg; Doug Ledford
>> Subject: Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)
>>
>>
>>> On Dec 10, 2015, at 3:27 AM, Sagi Grimberg  wrote:
>>>
>>>
>>>
 Doug this is going to conflict with the rdmavt work.  So if you take this 
 could
 you respond on the list.
>>>
>>> It will also conflict with the iser remote invalidate series.
>>>
>>> Doug it would help if you share your plans so people can rebase
>>> accordingly.
>>
>> I would be remiss not to mention that it probably also
>> conflicts with the NFS server bi-directional RPC/RDMA
>> series.
>>
>> Invasive IB core changes like this clean up are especially
>> burdensome for me because NFS/RDMA changes do not normally
>> go through Doug's tree, so it takes extra co-ordination.
>>
>> Here is a modest proposal. An obvious way to split the
>> device attr cleanup might go like this:
>>
>> a. first patch: add new fields to ib_device
>> b. then one patch for each provider to populate these fields
>> c. then one patch for each kernel ULP to use the new fields
>> d. then one patch for each provider to remove ->query_attr
>> e. last patch: remove ib_device_attr from the IB core
>>
>> That way each provider and ULP maintainer can review and
>> ack the portion of the changes that he or she is responsible
>> for, and it should help make it much easier to merge with
>> conflicting changes.
>>
>> Splitting it across more than one kernel release would be
>> helpful too, IMO. a. and b. can go into 4.5, c. into 4.6,
>> and d. and e. can go in any time after that.
>>
>> This adds more "process" but given the long chain of core
>> changes now in plan, we should acknowledge how disruptive
>> they will be, and come up with ways to make it possible to
>> get other work done while the core maintenance work
>> progresses.
>>
>>
> 
> The approach sounds reasonable to me.

This isn't a bad plan if we go that way.

Now here's where I get to speak my mind on things (since people have
been asking).


-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH for-next V1 2/5] IB/core: Add ib_is_udata_cleared

2015-12-10 Thread Matan Barak
On Thu, Dec 10, 2015 at 5:20 PM, Haggai Eran  wrote:
> On 10/12/2015 16:59, Matan Barak wrote:
>> On Mon, Dec 7, 2015 at 3:18 PM, Haggai Eran  wrote:
>>> On 12/03/2015 05:44 PM, Matan Barak wrote:
 Extending core and vendor verb commands require us to check that the
 unknown part of the user's given command is all zeros.
 Adding ib_is_udata_cleared in order to do so.

>>>
>>> Why not copy the data into kernel space and run memchr_inv() on it?
>>>
>>
>> Probably less efficient, isn't it?
> Why do you think it is less efficient?
>
> I'm not sure calling copy_from_user multiple times is very efficient.
> For once, you are calling access_ok multiple times. I guess it depends
> on the amount of data you are copying.
>

Isn't access_ok pretty cheap?
It calls __chk_range_not_ok which on x86 seems like a very cheap
function and __chk_user_ptr which is a compiler check.
I guess most kernel-user implementation will be pretty much in sync,
so we'll possibly call it for a few/dozens of bytes. In that case, I
think this implementation is a bit faster.

>> I know it isn't data path, but we'll execute this code in all extended
>> functions (sometimes even more than once).
> Do you think it is important enough to maintain our own copy of
> memchr_inv()?
>

True, I'm not sure it's important enough, but do you think it's that
complicated?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 07/11] IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path

2015-12-10 Thread Matan Barak
On Mon, Dec 7, 2015 at 3:42 PM, Haggai Eran  wrote:
> On 12/03/2015 03:47 PM, Matan Barak wrote:
>> +static int addr_resolve_neigh(struct dst_entry *dst,
>> +   const struct sockaddr *dst_in,
>> +   struct rdma_dev_addr *addr)
>> +{
>> + if (dst->dev->flags & IFF_LOOPBACK) {
>> + int ret;
>> +
>> + ret = rdma_translate_ip(dst_in, addr, NULL);
>> + if (!ret)
>> + memcpy(addr->dst_dev_addr, addr->src_dev_addr,
>> +MAX_ADDR_LEN);
>> +
>> + return ret;
>> + }
>> +
>> + /* If the device does ARP internally */
> You mean "doesn't do ARP internally" right?
>

Correct, nice catch :)

>> + if (!(dst->dev->flags & IFF_NOARP)) {
>> + const struct sockaddr_in *dst_in4 =
>> + (const struct sockaddr_in *)dst_in;
>> + const struct sockaddr_in6 *dst_in6 =
>> + (const struct sockaddr_in6 *)dst_in;
>> +
>> + return dst_fetch_ha(dst, addr,
>> + dst_in->sa_family == AF_INET ?
>> + (const void *)_in4->sin_addr.s_addr :
>> + (const void *)_in6->sin6_addr);
>> + }
>> +
>> + return rdma_copy_addr(addr, dst->dev, NULL);
>> +}
>
>> +int rdma_resolve_ip_route(struct sockaddr *src_addr,
>> +   const struct sockaddr *dst_addr,
>> +   struct rdma_dev_addr *addr)
>> +{
>> + struct sockaddr_storage ssrc_addr;
>> + struct sockaddr *src_in = (struct sockaddr *)_addr;
>> +
>> + if (src_addr->sa_family != dst_addr->sa_family)
>> + return -EINVAL;
>> +
>> + if (src_addr)
>> + memcpy(src_in, src_addr, rdma_addr_size(src_addr));
>> + else
>> + src_in->sa_family = dst_addr->sa_family;
> Don't you need to clear the rest of src_in? I believe you pass
> uninitialized memory to it.
>

Correct, I'll fix that.

>> +
>> + return addr_resolve(src_in, dst_addr, addr, false);
>> +}
>> +EXPORT_SYMBOL(rdma_resolve_ip_route);
>
> Haggai

Thanks for taking a look.

Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Leon Romanovsky
On Thu, Dec 10, 2015 at 11:17:09AM -0500, Dennis Dalessandro wrote:
> On Tue, Dec 08, 2015 at 08:08:21AM +0200, Leon Romanovsky wrote:
> >On Mon, Dec 07, 2015 at 03:43:06PM -0500, Dennis Dalessandro wrote:
> >>+
> >>+#define BAD_DMA_ADDRESS ((u64)0)
> >What is the advantage in using directly u64 values instead of
> >pointers? You will get NULL and functions which return pointers
> >without need of casting.
> >
> >...
> >>+static u64 rvt_dma_map_single(struct ib_device *dev, void *cpu_addr,
> >>+ size_t size, enum dma_data_direction direction)
> >>+{
> >>+   if (WARN_ON(!valid_dma_direction(direction)))
> >>+   return BAD_DMA_ADDRESS;
> >>+
> >>+   return (u64)cpu_addr;
> >>+}
> >An example of such function.
> 
> Honestly I'm not really sure why it's done this way. We are just following
> the signature of the function in struct ib_dma_mapping_ops.
Is it worth to consider to implement these functions with the pointers?

> 
> -Denny
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2 1/1] [PATCH 1/1] Ibacm: default pkey for partitioned fabrics

2015-12-10 Thread kaike . wan
From: Kaike Wan 

In an insecure IB fabric, the default pkey in a port is 0x, where each
node is allowed to talk to any other node in the fabric, including the SA
node. However, in a secure fabric, to limit member access, not all nodes
can have the full-member default pkey 0x. A typical configuration is
to let SA node have pkey 0x while all other nodes have pkey 0x7fff; in
addition, each node can be assigned some other full-member pkeys, such as
0x8001 and 0x8002, so that it can be assigned to different partitions.
In this case, each node can access SA, and yet limits its other access to
only those nodes in its assigned partitions. In such a secure fabric,
however, ibacm will not work by interpreting "default" in its default
address file as 0x.

To solve the problem, we will use pkey 0 as the default pkey, in
line with ipoib convention.

Signed-off-by: Kaike Wan 
---
Change since v1:
- Use pkey 0 as the default pkey for parsing address file.

 src/acm.c |   20 +---
 1 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/src/acm.c b/src/acm.c
index ada0bfb..8b6d762 100644
--- a/src/acm.c
+++ b/src/acm.c
@@ -114,7 +114,8 @@ struct acmc_port {
union ibv_gid   *gid_tbl;
uint16_tlid;
uint16_tlid_mask;
-   int default_pkey_index;
+   int sa_pkey_index;
+   uint16_tdef_acm_pkey;
 };
 
 struct acmc_device {
@@ -2009,7 +2010,7 @@ static int acm_assign_ep_names(struct acmc_ep *ep)
continue;
}
} else {
-   pkey = 0x;
+   pkey = ep->port->def_acm_pkey;
}
 
if (!stricmp(dev_name, dev) &&
@@ -2203,6 +2204,7 @@ static void acm_port_up(struct acmc_port *port)
int i, ret;
struct acmc_prov_context *dev_ctx;
int index = -1;
+   uint16_t first_pkey = 0;
 
acm_log(1, "%s %d\n", port->dev->device.verbs->device->name, 
port->port.port_num);
@@ -2248,15 +2250,18 @@ static void acm_port_up(struct acmc_port *port)
goto err1;
}
 
-   /* Determine the default pkey first.
-  Order of preference: 0x, 0x7fff, first pkey
-   */
+   /* Determine the default pkey for SA access first.
+* Order of preference: 0x, 0x7fff
+* Use the first pkey as the default pkey for parsing address file.
+*/
for (i = 0; i < attr.pkey_tbl_len; i++) {
ret = ibv_query_pkey(port->dev->device.verbs, 
 port->port.port_num, i, );
if (ret)
continue;
pkey = ntohs(pkey);
+   if (i == 0)
+   first_pkey = pkey;
if (pkey == 0x) {
index = i;
break;
@@ -2265,7 +2270,8 @@ static void acm_port_up(struct acmc_port *port)
index = i;
}
}
-   port->default_pkey_index = index < 0 ? 0: index;
+   port->sa_pkey_index = index < 0 ? 0 : index;
+   port->def_acm_pkey = first_pkey;
 
for (i = 0; i < attr.pkey_tbl_len; i++) {
ret = ibv_query_pkey(port->dev->device.verbs, 
@@ -2775,7 +2781,7 @@ int acm_send_sa_mad(struct acm_sa_mad *mad)
mad->umad.addr.qkey = port->sa_addr.qkey;
mad->umad.addr.lid = htons(port->sa_addr.lid);
mad->umad.addr.sl = port->sa_addr.sl;
-   mad->umad.addr.pkey_index = req->ep->port->default_pkey_index;
+   mad->umad.addr.pkey_index = req->ep->port->sa_pkey_index;
 
lock_acquire(>lock);
if (port->sa_credits && DListEmpty(>sa_wait)) {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/37] IB/rdmavt: Add ib core device attributes to rvt driver params list

2015-12-10 Thread Dennis Dalessandro

On Thu, Dec 10, 2015 at 02:29:18PM +0200, Haggai Eran wrote:

On 07/12/2015 22:43, Dennis Dalessandro wrote:

 struct rvt_dev_info {
+   /*
+* Prior to calling for registration the driver will be responsible for
+* allocating space for this structure. The driver will also need to
+* allocate space for any private device or per port data structures.
+* Alternatively rvt could do this allocation and the registration API
+* would then change to accept an "extra" piece to allocate.

I don't think you need rvt to allocate the private data, but even if you
decide to do that, there's no need for a comment that describes all the
alternative designs here.


Agree. 

This is another of those comments which was meant to provide an early look 
at the design (and alternatives) when I first posted to GitHub.  The 
underlying issue here is who allocates the rvt_dev_info.  

The approach I went with was having the driver do the allocation. This also 
means there is no need for any "private" data in the rvt_dev_info structure.  
If drivers need some private struct they can just create a data structure 
and embed both rvt_dev_info and that private data. See struct qib_ibdev.


This comment needs updated/removed.




+*
+* The driver will also be
+* responsible for filling in certain members of dparms.props
+*/


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/37] IB/rdmavt: Add ib core device attributes to rvt driver params list

2015-12-10 Thread Dennis Dalessandro

On Thu, Dec 10, 2015 at 02:26:11PM +0200, Haggai Eran wrote:

On 07/12/2015 22:43, Dennis Dalessandro wrote:

+   /*
+* Drivers will need to support a number of notifications to rvt in
+* accordance with certain events. This structure should contain a mask
+* of the supported events. Such events that the rvt may need to know
+* about include:
+* port errors
+* port active
+* lid change
+* sm change
+* client reregister
+* pkey change

Where are the event values defined?


They aren't really yet. A lot of the comments like this were put in and made 
available on GitHub for an early look into the design. This will all get 
cleaned up as the code continues to come together.



+*

+* There may also be other events that the rvt layers needs to know
+* about this is not an exhaustive list. Some events though rvt does not
+* need to rely on the driver for such as completion queue error.
+*/
+int rvt_signal_supported;


What will rvt do if it learns that the device doesn't support some of
the events?


Good question. I don't really have a great answer right now. Mostly it sort 
of depends on what events we end up. If it's something critical rdmavt will 
have to check during the registration and fail the call. This is not yet 
implemented in the two patch sets posted.


-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/37] IB/rdmavt: Create module framework and handle driver registration

2015-12-10 Thread Dennis Dalessandro

On Thu, Dec 10, 2015 at 01:49:31PM +0200, Haggai Eran wrote:

On 07/12/2015 22:43, Dennis Dalessandro wrote:

+struct rvt_dev_info {
+   struct ib_device ibdev;
+   int (*port_callback)(struct ib_device *, u8, struct kobject *);
+
+   /*
+* TODO:
+*  need to reflect module parameters that may vary by dev
+*/
+};


Could you explain what you mean by the TODO comment? Different device
drivers using rvt will have their own modules with their own parameters,
right? Why do you need them reflected here?


I'm referring to module parameters of the driver which may have some impact 
on rdmavt. Initially things like the size of the qpn table which comes as a 
module parameter for qib and hfi1. However, I've started to fold that stuff 
into the dparms field so this comment can probably just be removed.


-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread santosh.shilim...@oracle.com

+Dave,

On 12/10/15 4:49 PM, Chuck Lever wrote:



On Dec 10, 2015, at 6:30 PM, Christoph Hellwig  wrote:

On Thu, Dec 10, 2015 at 11:07:03AM -0700, Jason Gunthorpe wrote:

The ARM folks do this sort of stuff on a regular basis.. Very early on
Doug prepares a topic branch with only the big change, NFS folks pull
it and then pull your work. Then Doug would send the topic branch to
Linus as soon as the merge window opens, then NFS would send theirs.

This is alot less work overall than trying to sequence multiple
patches over multiple releases..


Agreed.  Staging has alaways been a giant pain and things tend to never
finish moving over that way if they are non-trivial enough.


In that case:

You need to make sure you have all the right Acks. I've added
Anna and Bruce to Ack the NFS-related portions. Santosh should
Ack the RDS part.


To make this flow work for RDS, Doug's topic branch needs to
pulled into Dave's net-next as well.


http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/ib_device_attr

Given the proximity to the holidays and the next merge window,
Doug will need to get a properly-acked topic branch published
in the next day or two so the rest of us can rebase and start
testing before the relevant parties disappear for the holiday.



Regards,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next] iw_cxgb4: Pass qid range to user space driver

2015-12-10 Thread Hariprasad Shenai
Enhances the t4_dev_status_page to pass the qid start and size
attributes from iw_cxgb4 to libcxgb4.
Bump the ABI Version to 3 -> To allow libcxgb4 to detect old drivers and
revert to the old way of computing the qid ranges.

Signed-off-by: Hariprasad Shenai 
Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/device.c | 4 
 drivers/infiniband/hw/cxgb4/t4.h | 7 +++
 drivers/infiniband/hw/cxgb4/user.h   | 2 +-
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/cxgb4/device.c 
b/drivers/infiniband/hw/cxgb4/device.c
index 1a29739..fab8ba3 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -850,6 +850,10 @@ static int c4iw_rdev_open(struct c4iw_rdev *rdev)
}
rdev->status_page = (struct t4_dev_status_page *)
__get_free_page(GFP_KERNEL);
+   rdev->status_page->qp_start = rdev->lldi.vr->qp.start;
+   rdev->status_page->qp_size = rdev->lldi.vr->qp.size;
+   rdev->status_page->cq_start = rdev->lldi.vr->cq.start;
+   rdev->status_page->cq_size = rdev->lldi.vr->cq.size;
if (!rdev->status_page) {
pr_err(MOD "error allocating status page\n");
goto err4;
diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index 274a7ab..483d99b 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -700,4 +700,11 @@ static inline void t4_set_cq_in_error(struct t4_cq *cq)
 
 struct t4_dev_status_page {
u8 db_off;
+   u8 pad1;
+   u16 pad2;
+   u32 pad3;
+   u64 qp_start;
+   u64 qp_size;
+   u64 cq_start;
+   u64 cq_size;
 };
diff --git a/drivers/infiniband/hw/cxgb4/user.h 
b/drivers/infiniband/hw/cxgb4/user.h
index cbd0ce1..295f422 100644
--- a/drivers/infiniband/hw/cxgb4/user.h
+++ b/drivers/infiniband/hw/cxgb4/user.h
@@ -32,7 +32,7 @@
 #ifndef __C4IW_USER_H__
 #define __C4IW_USER_H__
 
-#define C4IW_UVERBS_ABI_VERSION2
+#define C4IW_UVERBS_ABI_VERSION3
 
 /*
  * Make sure that all structs defined in this file remain laid out so
-- 
2.3.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Chuck Lever

> On Dec 10, 2015, at 3:27 AM, Sagi Grimberg  wrote:
> 
> 
> 
>> Doug this is going to conflict with the rdmavt work.  So if you take this 
>> could
>> you respond on the list.
> 
> It will also conflict with the iser remote invalidate series.
> 
> Doug it would help if you share your plans so people can rebase
> accordingly.

I would be remiss not to mention that it probably also
conflicts with the NFS server bi-directional RPC/RDMA
series.

Invasive IB core changes like this clean up are especially
burdensome for me because NFS/RDMA changes do not normally
go through Doug's tree, so it takes extra co-ordination.

Here is a modest proposal. An obvious way to split the
device attr cleanup might go like this:

a. first patch: add new fields to ib_device
b. then one patch for each provider to populate these fields
c. then one patch for each kernel ULP to use the new fields
d. then one patch for each provider to remove ->query_attr
e. last patch: remove ib_device_attr from the IB core

That way each provider and ULP maintainer can review and
ack the portion of the changes that he or she is responsible
for, and it should help make it much easier to merge with
conflicting changes.

Splitting it across more than one kernel release would be
helpful too, IMO. a. and b. can go into 4.5, c. into 4.6,
and d. and e. can go in any time after that.

This adds more "process" but given the long chain of core
changes now in plan, we should acknowledge how disruptive
they will be, and come up with ways to make it possible to
get other work done while the core maintenance work
progresses.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Jason Gunthorpe
On Thu, Dec 10, 2015 at 04:46:24PM -0800, Christoph Hellwig wrote:
> On Thu, Dec 10, 2015 at 05:29:50PM -0700, Jason Gunthorpe wrote:
> > Hrm.. sizeof(void *) > sizeof(dma_addr_t) seemed pretty obscure to me,
> > here is the original discussion:
> > 
> > https://lkml.org/lkml/2006/12/13/245
> > 
> > Sounds like someone was worried about sparc64. I doubt it is an actual
> > issue today, but granted the u64 did make some sense.
> 
> sparc64 still uses a 32-bit dma_addr_t.

Hmm, too bad, that choice severly compromises the rdma userspace -
user space can't create mrs that exceed 4G in total :(

> > So, I *believe* the issue is that linux has (had?) no approved way to
> > convert from a device specific dma_addr_t to a virtual address.
> 
> Linux doesn't have an approved way because it's impossible for the
> generic case.  When you have an iommu you have potentially multiple
> page tables mapping physical addresses to device virtual addresses,
> and there is no easy way to do a reverse mapping.

Yes, I know

> > It is really too bad we can't just use get_dma_ops to handle this case
> > and instead require our own infrastructure.
> 
> FYI, I have a patch series in linux-next to switches all remaining
> architectures to use get_dma_ops, and there are plans to allow generic
> per-device dma_ops based on that.

Great, so once that is merged we can drop the ib_* versions of all
this and just have qib/etc customize get_dma_ops? Other than the
dma_addr_t size issue that sounds great..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Jason Gunthorpe
On Thu, Dec 10, 2015 at 11:56:50PM -0500, Doug Ledford wrote:

> Looking at struct netdevice, it has the sort of organization I would
> call reasonable.  Things like struct tx_stats is a struct, even though
> it's embedded in the parent struct and not a pointer and there is
> exactly and only one copy, so it could be flat.

struct net_device_stats exists because it is used in many more places
that just in struct net_device, so this is a code sharing thing.

> something I would call well organized.  And speaking of that, in struct
> netdevice, all of the various ops elements are handled as structs,

.. and 'struct net_device_ops' is used extensively.

Whereas once we get rid of the query call ib_device_attr has no second
user.

> including the base netdevice ops, whereas struct ib_device puts the base
> ops flat in the struct.  So if we wanted to be more like struct
> netdevice, we would move the base ops out of struct ib_device.

It would make the drivers a bit nicer if they initialized an ib_ops
structure like netdev does instead of in code, but performance wise
this is probably better, especially if we sorted the struct members
sanely..

> out like this.  An ib_device can be multi-port, and each port can be a
> different RDMA device type.

I thought we figured out that wasn't actually allowed today when
working on the caps rework?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Doug Ledford
On 12/10/2015 06:29 PM, Christoph Hellwig wrote:
> On Thu, Dec 10, 2015 at 12:46:54PM -0500, Doug Ledford wrote:
>> Organization.  Let's be fair, the totally flat namespace you are
>> preferring is the equivalent of a teenager that is completely incapable
>> of picking thier dirty laundry up off the floor.  It is sloppy,
>> disorganized, often full of old cruft that you don't know if you can get
>> rid of or not, often so disorganized you might have three similarly
>> named items that you can't figure out which one should be used in which
>> circumstances, etc.
> 
> The most cruft I've found in major subsystems in years has been
> in the RDMA code, so I'm not sure where that argument comes from.

From reading struct netdevice and trying to work out what all of the
items do.  The portion I didn't like before is gone now, so it doesn't
do any good for me to reference it (it was related to the address lists
for the device).

> We're pretty good at garbage collecting cruft in Linux, and the
> typical counter examples are arbitrarily split structures where it's
> easy to hide things.
> 
>>>  And looking at the existing members of
>>> struct ib_device what determines if it goes straight into the device
>>> or the attribute?
>>
>> Organization.  What goes where depends on what makes sense according to
>> the organization you are doing.
> 
> So what makes num_comp_vectors or phys_port_cnt fit into ib_device,
> while max_qp or max_cq are in struct ib_device_attr?

Historically, the latter two were part of the device query and
represented device maximums (and almost all of ib_device_attr is this
sort of thing).  The first depends on the initialization results for the
PCI dev and is not a device limit that can be queried.  The second is a
high level device property that allows you to know how many devices need
queried.  If you were to ever have per-port struct for device
capabilities, the port count would necessarily need to be outside of
that area.

So, from a logical perspective, ib_device_attr contents can be thought
of as those things you get when you query the card, and only things you
get by querying the card (although not all cards have all items in this
struct), while the others in ib_device are part of other initialization
steps.

> I really like clean data structures, but keeping structures that
> have 1:1 relationships and sit in the same module separate never
> has been a good idea.

Looking at struct netdevice, it has the sort of organization I would
call reasonable.  Things like struct tx_stats is a struct, even though
it's embedded in the parent struct and not a pointer and there is
exactly and only one copy, so it could be flat.

>>>  There is a reason why we don't do this weird
>>> attr split in other Linux subsystems, and making IB follow this pattern
>>> makes everyone feel right at home instead of wondering about the
>>> weird attribute.
>>
>> Being organized is not "weird".  Let's not wax poetic about sloppy,
>> disorganized structures.  Let's be honest about what they are so we
>> don't feel like we need to take a shower every time we talk about them
>> to purge us of the sins of our lies.
> 
> I call that utter BS.  Being organized is exactly not having multiple
> structures that have the same scope or life time, it's actually what
> I call disorganzied.

While I didn't make a huge search for anything, at least the tx stats in
struct netdevice are a direct contradiction to this statement.  They are
something I would call well organized.  And speaking of that, in struct
netdevice, all of the various ops elements are handled as structs,
including the base netdevice ops, whereas struct ib_device puts the base
ops flat in the struct.  So if we wanted to be more like struct
netdevice, we would move the base ops out of struct ib_device.

>  There is a lot to be said about grouping the
> fields in the structure, and that's how sensible subsystems handle it:
> 
> stuct foo_bar {
>   /* read/write in the hot path, keep together and tightly packed: */
>   ...
> 
>   /* read-only in the hot path */
>   ...
> 
>   /* random members: */
>   ...
> 
>   /* properties here, immutable after setup: */
>   ...
> };
> 
> but that's completely inverse to what we're having with ib_device
> currently.

Well, the current struct ib_device is not well laid out.  That I'll
grant.  But it's not because ib_device_attr is a struct.  It simply
isn't well laid out.  Among other things, for the example you give
above, struct ib_device is the wrong layer to be trying to lay things
out like this.  An ib_device can be multi-port, and each port can be a
different RDMA device type.  I would argue then that the hotpath items
need to be done at the port level, not the parent device level.  We
started this with the port_immutable array.  I would argue that
continuing with that work and moving further down that path is the right
way to go.

-- 
Doug Ledford 
  GPG KeyID: 

Re: [PATCH v2 02/10] IB/iser: Reuse ib_sg_to_pages

2015-12-10 Thread Christoph Hellwig
On Thu, Dec 10, 2015 at 10:28:59AM +0200, Sagi Grimberg wrote:
> We can do that. We'd need to add ib_mr a list member just for fmr
> routines.

We'll also need that for a FR pool abstraction that would be very
helpful.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 2/5] IB/core: Add ib_is_udata_cleared

2015-12-10 Thread Matan Barak
On Mon, Dec 7, 2015 at 3:18 PM, Haggai Eran  wrote:
> On 12/03/2015 05:44 PM, Matan Barak wrote:
>> Extending core and vendor verb commands require us to check that the
>> unknown part of the user's given command is all zeros.
>> Adding ib_is_udata_cleared in order to do so.
>>
>
> Why not copy the data into kernel space and run memchr_inv() on it?
>

Probably less efficient, isn't it?
I know it isn't data path, but we'll execute this code in all extended
functions (sometimes even more than once).

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] staging/rdma/hfi1: HFI now sends OPA Traps instead of IBTA

2015-12-10 Thread Mike Marciniszyn
From: Erik E. Kahn 

send_trap() was still using old ib_smp instead of opa_smp
for formatting and sending traps.

Reviewed-by: Arthur Kepner 
Reviewed-by: Ira Weiny 
Reviewed-by: Dennis Dalessandro 
Signed-off-by: Jubin John 
Signed-off-by: Erik E. Kahn 
---
 drivers/staging/rdma/hfi1/mad.c   |  121 +++--
 drivers/staging/rdma/hfi1/ruc.c   |   10 ++-
 drivers/staging/rdma/hfi1/ud.c|   21 +++---
 drivers/staging/rdma/hfi1/verbs.h |2 -
 4 files changed, 79 insertions(+), 75 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/mad.c b/drivers/staging/rdma/hfi1/mad.c
index a122565..28c3ad2 100644
--- a/drivers/staging/rdma/hfi1/mad.c
+++ b/drivers/staging/rdma/hfi1/mad.c
@@ -84,7 +84,7 @@ static void send_trap(struct hfi1_ibport *ibp, void *data, 
unsigned len)
 {
struct ib_mad_send_buf *send_buf;
struct ib_mad_agent *agent;
-   struct ib_smp *smp;
+   struct opa_smp *smp;
int ret;
unsigned long flags;
unsigned long timeout;
@@ -117,15 +117,15 @@ static void send_trap(struct hfi1_ibport *ibp, void 
*data, unsigned len)
return;
 
smp = send_buf->mad;
-   smp->base_version = IB_MGMT_BASE_VERSION;
+   smp->base_version = OPA_MGMT_BASE_VERSION;
smp->mgmt_class = IB_MGMT_CLASS_SUBN_LID_ROUTED;
-   smp->class_version = 1;
+   smp->class_version = OPA_SMI_CLASS_VERSION;
smp->method = IB_MGMT_METHOD_TRAP;
ibp->tid++;
smp->tid = cpu_to_be64(ibp->tid);
smp->attr_id = IB_SMP_ATTR_NOTICE;
/* o14-1: smp->mkey = 0; */
-   memcpy(smp->data, data, len);
+   memcpy(smp->route.lid.data, data, len);
 
spin_lock_irqsave(>lock, flags);
if (!ibp->sm_ah) {
@@ -164,11 +164,16 @@ static void send_trap(struct hfi1_ibport *ibp, void 
*data, unsigned len)
  * Send a bad [PQ]_Key trap (ch. 14.3.8).
  */
 void hfi1_bad_pqkey(struct hfi1_ibport *ibp, __be16 trap_num, u32 key, u32 sl,
-   u32 qp1, u32 qp2, __be16 lid1, __be16 lid2)
+   u32 qp1, u32 qp2, u16 lid1, u16 lid2)
 {
-   struct ib_mad_notice_attr data;
+   struct opa_mad_notice_attr data;
+   u32 lid = ppd_from_ibp(ibp)->lid;
+   u32 _lid1 = lid1;
+   u32 _lid2 = lid2;
 
-   if (trap_num == IB_NOTICE_TRAP_BAD_PKEY)
+   memset(, 0, sizeof(data));
+
+   if (trap_num == OPA_TRAP_BAD_P_KEY)
ibp->pkey_violations++;
else
ibp->qkey_violations++;
@@ -176,17 +181,15 @@ void hfi1_bad_pqkey(struct hfi1_ibport *ibp, __be16 
trap_num, u32 key, u32 sl,
 
/* Send violation trap */
data.generic_type = IB_NOTICE_TYPE_SECURITY;
-   data.prod_type_msb = 0;
data.prod_type_lsb = IB_NOTICE_PROD_CA;
data.trap_num = trap_num;
-   data.issuer_lid = cpu_to_be16(ppd_from_ibp(ibp)->lid);
-   data.toggle_count = 0;
-   memset(, 0, sizeof(data.details));
-   data.details.ntc_257_258.lid1 = lid1;
-   data.details.ntc_257_258.lid2 = lid2;
-   data.details.ntc_257_258.key = cpu_to_be32(key);
-   data.details.ntc_257_258.sl_qp1 = cpu_to_be32((sl << 28) | qp1);
-   data.details.ntc_257_258.qp2 = cpu_to_be32(qp2);
+   data.issuer_lid = cpu_to_be32(lid);
+   data.ntc_257_258.lid1 = cpu_to_be32(_lid1);
+   data.ntc_257_258.lid2 = cpu_to_be32(_lid2);
+   data.ntc_257_258.key = cpu_to_be32(key);
+   data.ntc_257_258.sl = sl << 3;
+   data.ntc_257_258.qp1 = cpu_to_be32(qp1);
+   data.ntc_257_258.qp2 = cpu_to_be32(qp2);
 
send_trap(ibp, , sizeof(data));
 }
@@ -197,32 +200,30 @@ void hfi1_bad_pqkey(struct hfi1_ibport *ibp, __be16 
trap_num, u32 key, u32 sl,
 static void bad_mkey(struct hfi1_ibport *ibp, struct ib_mad_hdr *mad,
 __be64 mkey, __be32 dr_slid, u8 return_path[], u8 hop_cnt)
 {
-   struct ib_mad_notice_attr data;
+   struct opa_mad_notice_attr data;
+   u32 lid = ppd_from_ibp(ibp)->lid;
 
+   memset(, 0, sizeof(data));
/* Send violation trap */
data.generic_type = IB_NOTICE_TYPE_SECURITY;
-   data.prod_type_msb = 0;
data.prod_type_lsb = IB_NOTICE_PROD_CA;
-   data.trap_num = IB_NOTICE_TRAP_BAD_MKEY;
-   data.issuer_lid = cpu_to_be16(ppd_from_ibp(ibp)->lid);
-   data.toggle_count = 0;
-   memset(, 0, sizeof(data.details));
-   data.details.ntc_256.lid = data.issuer_lid;
-   data.details.ntc_256.method = mad->method;
-   data.details.ntc_256.attr_id = mad->attr_id;
-   data.details.ntc_256.attr_mod = mad->attr_mod;
-   data.details.ntc_256.mkey = mkey;
+   data.trap_num = OPA_TRAP_BAD_M_KEY;
+   data.issuer_lid = cpu_to_be32(lid);
+   data.ntc_256.lid = data.issuer_lid;
+   data.ntc_256.method = mad->method;
+   data.ntc_256.attr_id = mad->attr_id;
+   

[PATCH 0/2] Convert IBTA traps to OPA traps

2015-12-10 Thread Mike Marciniszyn
This two patch series gets rid of the vestigal use of IBTA
traps in the hfi1 driver.
---

Erik E. Kahn (1):
  staging/rdma/hfi1: HFI now sends OPA Traps instead of IBTA

Jubin John (1):
  staging/rdma/hfi1: add definitions for OPA traps


 drivers/staging/rdma/hfi1/mad.c   |  121 +++--
 drivers/staging/rdma/hfi1/mad.h   |  114 +++
 drivers/staging/rdma/hfi1/ruc.c   |   10 ++-
 drivers/staging/rdma/hfi1/ud.c|   21 +++---
 drivers/staging/rdma/hfi1/verbs.h |2 -
 5 files changed, 193 insertions(+), 75 deletions(-)

-- 
Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 26/37] IB/rdmavt: Move memory registration into rdmavt

2015-12-10 Thread Sagi Grimberg




  /**
@@ -83,7 +357,31 @@ struct ib_mr *rvt_reg_phys_mr(struct ib_pd *pd,
  int num_phys_buf, int acc, u64 *iova_start)
  {


Why do we even bother with reg_phys_mr? Let's drop it entirely.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: net/mlx5: E-Switch, Add SR-IOV (FDB) support

2015-12-10 Thread Saeed Mahameed
On Thu, Dec 10, 2015 at 12:35 PM, Dan Carpenter
 wrote:
> Hello Saeed Mahameed,
>
> The patch 81848731ff40: "net/mlx5: E-Switch, Add SR-IOV (FDB)
> support" from Dec 1, 2015, leads to the following static checker
> warning:
>
> drivers/net/ethernet/mellanox/mlx5/core/eswitch.c:579 
> esw_fdb_set_vport_rule()
> warn: passing zero to 'PTR_ERR'
>
> drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
>568  esw_debug(esw->dev,
>569"\tFDB add rule dmac_v(%pM) dmac_c(%pM) -> 
> vport(%d)\n",
>570dmac_v, dmac_c, vport);
>571  flow_rule =
>572  mlx5_add_flow_rule(esw,
>573 match_header,
>574 match_c,
>575 match_v,
>576 MLX5_FLOW_CONTEXT_ACTION_FWD_DEST,
>577 0, );
>578  if (IS_ERR_OR_NULL(flow_rule)) {
>
> mlx5_add_flow_rule() only returns NULL on error.  It never returns
> ERR_PTRs.
>
>579  pr_warn(
>580  "FDB: Failed to add flow rule: dmac_v(%pM) 
> dmac_c(%pM) -> vport(%d), err(%ld)\n",
>   
>
> It's not a terrible bug but this always says "err(0)" which is not very
> useful, and it causes this static checker warning.
Hi Dan,

Thanks for pointing this out, this is already fixed in the latest flow
steering patches sent to net-next.
"net/mlx5: Use flow steering infrastructure for mlx5_en"

>
>581   dmac_v, dmac_c, vport, PTR_ERR(flow_rule));
>582  flow_rule = NULL;
>583  }
>584  out:
>585  kfree(match_v);
>586  kfree(match_c);
>587  return flow_rule;
>588  }
>
> regards,
> dan carpenter
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libmlx5 V1 6/6] Add always_inline check

2015-12-10 Thread Matan Barak
On Mon, Dec 7, 2015 at 3:07 PM, Haggai Eran  wrote:
> On 12/03/2015 06:02 PM, Matan Barak wrote:
>> Always inline isn't supported by every compiler. Adding it to
>> configure.ac in order to support it only when possible.
>> Inline other poll_one data path functions in order to eliminate
>> "ifs".
>>
>> Signed-off-by: Matan Barak 
>> ---
>>  configure.ac | 17 +
>>  src/cq.c | 42 +-
>>  src/mlx5.h   |  6 ++
>>  3 files changed, 52 insertions(+), 13 deletions(-)
>>
>> diff --git a/configure.ac b/configure.ac
>> index fca0b46..50b4f9c 100644
>> --- a/configure.ac
>> +++ b/configure.ac
>> @@ -65,6 +65,23 @@ AC_CHECK_FUNC(ibv_read_sysfs_file, [],
>>  AC_MSG_ERROR([ibv_read_sysfs_file() not found.  libmlx5 requires 
>> libibverbs >= 1.0.3.]))
>>  AC_CHECK_FUNCS(ibv_dontfork_range ibv_dofork_range ibv_register_driver)
>>
>> +AC_MSG_CHECKING("always inline")
> Did you consider using an existing script like AX_GCC_FUNC_ATTRIBUTE [1]?
>

That's probably better, I'll change.

>> +CFLAGS_BAK="$CFLAGS"
>> +CFLAGS="$CFLAGS -Werror"
>> +AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
>> + static inline int f(void)
>> + __attribute__((always_inline));
>> + static inline int f(void)
>> + {
>> + return 1;
>> + }
>> +]],[[
>> + int a = f();
>> + a = a;
>> +]])], [AC_MSG_RESULT([yes]) AC_DEFINE([HAVE_ALWAYS_INLINE], [1], [Define if 
>> __attribute((always_inline)).])]
> The description here doesn't look right. How about "Define if
> __attribute__((always_inline) is supported"?
>

If I use AX_GCC_FUNC_ATTRIBUTE, I don't need this anymore.

> Regards,
> Haggai
>
> [1] https://www.gnu.org/software/autoconf-archive/ax_gcc_func_attribute.html

Thanks for the review.

Regards,
Matan

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] staging/rdma/hfi1: add definitions for OPA traps

2015-12-10 Thread Mike Marciniszyn
From: Jubin John 

These new definitions will be used by follow-on
patches for formating and sending OPA traps.

Reviewed-by: Ira Weiny 
Signed-off-by: Jubin John 
---
 drivers/staging/rdma/hfi1/mad.h |  114 +++
 1 file changed, 114 insertions(+)

diff --git a/drivers/staging/rdma/hfi1/mad.h b/drivers/staging/rdma/hfi1/mad.h
index 4745750..f031775 100644
--- a/drivers/staging/rdma/hfi1/mad.h
+++ b/drivers/staging/rdma/hfi1/mad.h
@@ -60,7 +60,121 @@
 #endif
 #include "opa_compat.h"
 
+/*
+ * OPA Traps
+ */
+#define OPA_TRAP_GID_NOW_IN_SERVICE cpu_to_be16(64)
+#define OPA_TRAP_GID_OUT_OF_SERVICE cpu_to_be16(65)
+#define OPA_TRAP_ADD_MULTICAST_GROUPcpu_to_be16(66)
+#define OPA_TRAL_DEL_MULTICAST_GROUPcpu_to_be16(67)
+#define OPA_TRAP_UNPATH cpu_to_be16(68)
+#define OPA_TRAP_REPATH cpu_to_be16(69)
+#define OPA_TRAP_PORT_CHANGE_STATE  cpu_to_be16(128)
+#define OPA_TRAP_LINK_INTEGRITY cpu_to_be16(129)
+#define OPA_TRAP_EXCESSIVE_BUFFER_OVERRUN   cpu_to_be16(130)
+#define OPA_TRAP_FLOW_WATCHDOG  cpu_to_be16(131)
+#define OPA_TRAP_CHANGE_CAPABILITY  cpu_to_be16(144)
+#define OPA_TRAP_CHANGE_SYSGUID cpu_to_be16(145)
+#define OPA_TRAP_BAD_M_KEY  cpu_to_be16(256)
+#define OPA_TRAP_BAD_P_KEY  cpu_to_be16(257)
+#define OPA_TRAP_BAD_Q_KEY  cpu_to_be16(258)
+#define OPA_TRAP_SWITCH_BAD_PKEYcpu_to_be16(259)
+#define OPA_SMA_TRAP_DATA_LINK_WIDTHcpu_to_be16(2048)
 
+/*
+ * Generic trap/notice other local changes flags (trap 144).
+ */
+#defineOPA_NOTICE_TRAP_LWDE_CHG0x08 /* Link Width Downgrade 
Enable
+ * changed
+ */
+#define OPA_NOTICE_TRAP_LSE_CHG 0x04 /* Link Speed Enable changed */
+#define OPA_NOTICE_TRAP_LWE_CHG 0x02 /* Link Width Enable changed */
+#define OPA_NOTICE_TRAP_NODE_DESC_CHG   0x01
+
+struct opa_mad_notice_attr {
+   u8 generic_type;
+   u8 prod_type_msb;
+   __be16 prod_type_lsb;
+   __be16 trap_num;
+   __be16 toggle_count;
+   __be32 issuer_lid;
+   __be32 reserved1;
+   union ib_gid issuer_gid;
+
+   union {
+   struct {
+   u8  details[64];
+   } raw_data;
+
+   struct {
+   union ib_gidgid;
+   } __packed ntc_64_65_66_67;
+
+   struct {
+   __be32  lid;
+   } __packed ntc_128;
+
+   struct {
+   __be32  lid;/* where violation happened */
+   u8  port_num;   /* where violation happened */
+   } __packed ntc_129_130_131;
+
+   struct {
+   __be32  lid;/* LID where change occurred */
+   __be32  new_cap_mask;   /* new capability mask */
+   __be16  reserved2;
+   __be16  cap_mask;
+   __be16  change_flags;   /* low 4 bits only */
+   } __packed ntc_144;
+
+   struct {
+   __be64  new_sys_guid;
+   __be32  lid;/* lid where sys guid changed */
+   } __packed ntc_145;
+
+   struct {
+   __be32  lid;
+   __be32  dr_slid;
+   u8  method;
+   u8  dr_trunc_hop;
+   __be16  attr_id;
+   __be32  attr_mod;
+   __be64  mkey;
+   u8  dr_rtn_path[30];
+   } __packed ntc_256;
+
+   struct {
+   __be32  lid1;
+   __be32  lid2;
+   __be32  key;
+   u8  sl; /* SL: high 5 bits */
+   u8  reserved3[3];
+   union ib_gidgid1;
+   union ib_gidgid2;
+   __be32  qp1;/* high 8 bits reserved */
+   __be32  qp2;/* high 8 bits reserved */
+   } __packed ntc_257_258;
+
+   struct {
+   __be16  flags;  /* low 8 bits reserved */
+   __be16  pkey;
+   __be32  lid1;
+   __be32  lid2;
+   u8  sl; /* SL: high 5 bits */
+   u8  reserved4[3];
+   union ib_gidgid1;
+   union ib_gidgid2;
+  

Re: [PATCH 37/37] IB/rdmavt: Add support for new memory registration API

2015-12-10 Thread Sagi Grimberg



commit 38071a461f0a ("IB/qib: Support the new memory registration API")"
added support for map_mr_sg to qib. This patch brings that
support into rdmavt, it also adds a register mr function that will be
called from the post send routine which is inline with the qib patch.


Hi Dennis and Ira,

This question is not directly related to this patch, but given that
this is a copy-paste from the qib driver I'll go ahead and take it
anyway. How does qib (and rvt now) do memory key invalidation? I didn't
see any reference to IB_WR_LOCAL_INV anywhere in the qib driver...

What am I missing?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 26/37] IB/rdmavt: Move memory registration into rdmavt

2015-12-10 Thread Christoph Hellwig
On Thu, Dec 10, 2015 at 10:14:56AM -0500, Dennis Dalessandro wrote:
> Why? Because, it exists in qib and hfi1.
> 
> However, it seems no one is actually using this in the kernel these days and
> the core support was removed in commit 1241d7bf. Yet the function pointer
> still exists in struct ib_device. Are there plans to remove this?

Yes, I've send a patch that remove it, and which only got positive
review so far.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/37] IB/rdmavt: Add ib core device attributes to rvt driver params list

2015-12-10 Thread Haggai Eran
On 10/12/2015 15:25, Dennis Dalessandro wrote:
> On Thu, Dec 10, 2015 at 02:26:11PM +0200, Haggai Eran wrote:
>> On 07/12/2015 22:43, Dennis Dalessandro wrote:
>>> +/*
>>> + * Drivers will need to support a number of notifications to rvt in
>>> + * accordance with certain events. This structure should contain
>>> a mask
>>> + * of the supported events. Such events that the rvt may need to
>>> know
>>> + * about include:
>>> + * port errors
>>> + * port active
>>> + * lid change
>>> + * sm change
>>> + * client reregister
>>> + * pkey change
>> Where are the event values defined?
> 
> They aren't really yet. A lot of the comments like this were put in and
> made available on GitHub for an early look into the design. This will
> all get cleaned up as the code continues to come together.
Okay, Great. Since the events above are already defined as
ib_event/ib_event_type maybe you can reuse them somehow.

 + *
>>> + * There may also be other events that the rvt layers needs to know
>>> + * about this is not an exhaustive list. Some events though rvt
>>> does not
>>> + * need to rely on the driver for such as completion queue error.
>>> + */
>>> + int rvt_signal_supported;
>>
>> What will rvt do if it learns that the device doesn't support some of
>> the events?
> 
> Good question. I don't really have a great answer right now. Mostly it
> sort of depends on what events we end up. If it's something critical
> rdmavt will have to check during the registration and fail the call.
> This is not yet implemented in the two patch sets posted.
I'm not sure that's needed. If the driver doesn't satisfy what rvt
expects from it they can break in many ways. I'm not sure this specific
instance requires a check during registration. But it really depends on
the specific drivers and the specific events we're going to have.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 26/37] IB/rdmavt: Move memory registration into rdmavt

2015-12-10 Thread Dennis Dalessandro

On Thu, Dec 10, 2015 at 04:24:44PM +0200, Sagi Grimberg wrote:




  /**
@@ -83,7 +357,31 @@ struct ib_mr *rvt_reg_phys_mr(struct ib_pd *pd,
  int num_phys_buf, int acc, u64 *iova_start)
  {


Why do we even bother with reg_phys_mr? Let's drop it entirely.


Why? Because, it exists in qib and hfi1.

However, it seems no one is actually using this in the kernel these days and 
the core support was removed in commit 1241d7bf. Yet the function pointer 
still exists in struct ib_device. Are there plans to remove this?


For this patch series I'll go ahead and remove rvt_reg_phys_mr.

-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/37] IB/rdmavt: Add ib core device attributes to rvt driver params list

2015-12-10 Thread Dennis Dalessandro

On Thu, Dec 10, 2015 at 05:12:52PM +0200, Haggai Eran wrote:

On 10/12/2015 15:25, Dennis Dalessandro wrote:

On Thu, Dec 10, 2015 at 02:26:11PM +0200, Haggai Eran wrote:

On 07/12/2015 22:43, Dennis Dalessandro wrote:

+/*
+ * Drivers will need to support a number of notifications to rvt in
+ * accordance with certain events. This structure should contain
a mask
+ * of the supported events. Such events that the rvt may need to
know
+ * about include:
+ * port errors
+ * port active
+ * lid change
+ * sm change
+ * client reregister
+ * pkey change

Where are the event values defined?


They aren't really yet. A lot of the comments like this were put in and
made available on GitHub for an early look into the design. This will
all get cleaned up as the code continues to come together.

Okay, Great. Since the events above are already defined as
ib_event/ib_event_type maybe you can reuse them somehow.


Yeah, I don't think there is a need to redefine this.


+ *

+ * There may also be other events that the rvt layers needs to know
+ * about this is not an exhaustive list. Some events though rvt
does not
+ * need to rely on the driver for such as completion queue error.
+ */
+ int rvt_signal_supported;


What will rvt do if it learns that the device doesn't support some of
the events?


Good question. I don't really have a great answer right now. Mostly it
sort of depends on what events we end up. If it's something critical
rdmavt will have to check during the registration and fail the call.
This is not yet implemented in the two patch sets posted.

I'm not sure that's needed. If the driver doesn't satisfy what rvt
expects from it they can break in many ways. I'm not sure this specific
instance requires a check during registration. But it really depends on
the specific drivers and the specific events we're going to have.


Basically if the driver is just not going to function because it doesn't 
provide something, that's probably not such a big deal. If it can lead to a 
kernel panic then we probably want to validate. In the end, yes, it just all 
depends on what we end up with.


-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V1 2/5] IB/core: Add ib_is_udata_cleared

2015-12-10 Thread Haggai Eran
On 10/12/2015 16:59, Matan Barak wrote:
> On Mon, Dec 7, 2015 at 3:18 PM, Haggai Eran  wrote:
>> On 12/03/2015 05:44 PM, Matan Barak wrote:
>>> Extending core and vendor verb commands require us to check that the
>>> unknown part of the user's given command is all zeros.
>>> Adding ib_is_udata_cleared in order to do so.
>>>
>>
>> Why not copy the data into kernel space and run memchr_inv() on it?
>>
> 
> Probably less efficient, isn't it?
Why do you think it is less efficient?

I'm not sure calling copy_from_user multiple times is very efficient.
For once, you are calling access_ok multiple times. I guess it depends
on the amount of data you are copying.

> I know it isn't data path, but we'll execute this code in all extended
> functions (sometimes even more than once).
Do you think it is important enough to maintain our own copy of
memchr_inv()?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-10 Thread Steve Wise


> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org 
> [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Chuck Lever
> Sent: Thursday, December 10, 2015 10:08 AM
> To: linux-rdma@vger.kernel.org
> Cc: ira.weiny; Christoph Hellwig; Jason Gunthorpe; Or Gerlitz; Steve Wise; Or 
> Gerlitz; Sagi Grimberg; Doug Ledford
> Subject: Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)
> 
> 
> > On Dec 10, 2015, at 3:27 AM, Sagi Grimberg  wrote:
> >
> >
> >
> >> Doug this is going to conflict with the rdmavt work.  So if you take this 
> >> could
> >> you respond on the list.
> >
> > It will also conflict with the iser remote invalidate series.
> >
> > Doug it would help if you share your plans so people can rebase
> > accordingly.
> 
> I would be remiss not to mention that it probably also
> conflicts with the NFS server bi-directional RPC/RDMA
> series.
> 
> Invasive IB core changes like this clean up are especially
> burdensome for me because NFS/RDMA changes do not normally
> go through Doug's tree, so it takes extra co-ordination.
> 
> Here is a modest proposal. An obvious way to split the
> device attr cleanup might go like this:
> 
> a. first patch: add new fields to ib_device
> b. then one patch for each provider to populate these fields
> c. then one patch for each kernel ULP to use the new fields
> d. then one patch for each provider to remove ->query_attr
> e. last patch: remove ib_device_attr from the IB core
> 
> That way each provider and ULP maintainer can review and
> ack the portion of the changes that he or she is responsible
> for, and it should help make it much easier to merge with
> conflicting changes.
> 
> Splitting it across more than one kernel release would be
> helpful too, IMO. a. and b. can go into 4.5, c. into 4.6,
> and d. and e. can go in any time after that.
> 
> This adds more "process" but given the long chain of core
> changes now in plan, we should acknowledge how disruptive
> they will be, and come up with ways to make it possible to
> get other work done while the core maintenance work
> progresses.
> 
> 

The approach sounds reasonable to me.

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] staging/rdma/hfi1: Fix a possible null pointer dereference

2015-12-10 Thread Mike Marciniszyn
From: Easwar Hariharan 

A code inspection pointed out that kmalloc_array may return NULL and
memset doesn't check the input pointer for NULL, resulting in a possible
NULL dereference. This patch fixes this.

Reviewed-by: Mike Marciniszyn 
Signed-off-by: Easwar Hariharan 
---
 drivers/staging/rdma/hfi1/chip.c |2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/staging/rdma/hfi1/chip.c b/drivers/staging/rdma/hfi1/chip.c
index dc69159..49d49b2 100644
--- a/drivers/staging/rdma/hfi1/chip.c
+++ b/drivers/staging/rdma/hfi1/chip.c
@@ -10129,6 +10129,8 @@ static void init_qos(struct hfi1_devdata *dd, u32 
first_ctxt)
if (num_vls * qpns_per_vl > dd->chip_rcv_contexts)
goto bail;
rsmmap = kmalloc_array(NUM_MAP_REGS, sizeof(u64), GFP_KERNEL);
+   if (!rsmmap)
+   goto bail;
memset(rsmmap, rxcontext, NUM_MAP_REGS * sizeof(u64));
/* init the local copy of the table */
for (i = 0, ctxt = first_ctxt; i < num_vls; i++) {

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/37] IB/rdmavt: Consolidate dma ops in rdmavt.

2015-12-10 Thread Dennis Dalessandro

On Tue, Dec 08, 2015 at 08:08:21AM +0200, Leon Romanovsky wrote:

On Mon, Dec 07, 2015 at 03:43:06PM -0500, Dennis Dalessandro wrote:

+
+#define BAD_DMA_ADDRESS ((u64)0)

What is the advantage in using directly u64 values instead of
pointers? You will get NULL and functions which return pointers
without need of casting.

...

+static u64 rvt_dma_map_single(struct ib_device *dev, void *cpu_addr,
+ size_t size, enum dma_data_direction direction)
+{
+   if (WARN_ON(!valid_dma_direction(direction)))
+   return BAD_DMA_ADDRESS;
+
+   return (u64)cpu_addr;
+}

An example of such function.


Honestly I'm not really sure why it's done this way. We are just following 
the signature of the function in struct ib_dma_mapping_ops.


-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/37] IB/rdmavt: Add protection domain to rdmavt.

2015-12-10 Thread Dennis Dalessandro

On Tue, Dec 08, 2015 at 08:28:17AM +0200, Leon Romanovsky wrote:

On Mon, Dec 07, 2015 at 03:43:10PM -0500, Dennis Dalessandro wrote:

+
+/*
+ * Things that are driver specific, module parameters in hfi1 and qib
+ */
+struct rvt_driver_params {
+   int max_pds;

Can it be negative value?

+};


If so no protection domains would ever get allocated. I don't think anything 
else would break though if a driver were to do that.


-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html