[PATCH 2/2] IB/mlx4: Convert kmalloc to be kmalloc_array to fix checkpatch warnings

2015-12-16 Thread Leon Romanovsky
From: Leon Romanovsky 

Convert kmalloc to be kmalloc_array to fix warnings below:

WARNING: Prefer kmalloc_array over kmalloc with multiply
+   qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(u64),

WARNING: Prefer kmalloc_array over kmalloc with multiply
+   qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(u64),

WARNING: Prefer kmalloc_array over kmalloc with multiply
+   srq->wrid = kmalloc(srq->msrq.max * sizeof(u64),

Signed-off-by: Leon Romanovsky 
Reviewed-by: Or Gerlitz 
---
 drivers/infiniband/hw/mlx4/qp.c  | 4 ++--
 drivers/infiniband/hw/mlx4/srq.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index dc86975fe1a9..70de13ed9da7 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -796,12 +796,12 @@ static int create_qp_common(struct mlx4_ib_dev *dev, 
struct ib_pd *pd,
if (err)
goto err_mtt;
 
-   qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(u64),
+   qp->sq.wrid = kmalloc_array(qp->sq.wqe_cnt, sizeof(u64),
gfp | __GFP_NOWARN);
if (!qp->sq.wrid)
qp->sq.wrid = __vmalloc(qp->sq.wqe_cnt * sizeof(u64),
gfp, PAGE_KERNEL);
-   qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(u64),
+   qp->rq.wrid = kmalloc_array(qp->rq.wqe_cnt, sizeof(u64),
gfp | __GFP_NOWARN);
if (!qp->rq.wrid)
qp->rq.wrid = __vmalloc(qp->rq.wqe_cnt * sizeof(u64),
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index f416c7463827..68d5a5fda271 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -171,7 +171,7 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
if (err)
goto err_mtt;
 
-   srq->wrid = kmalloc(srq->msrq.max * sizeof(u64),
+   srq->wrid = kmalloc_array(srq->msrq.max, sizeof(u64),
GFP_KERNEL | __GFP_NOWARN);
if (!srq->wrid) {
srq->wrid = __vmalloc(srq->msrq.max * sizeof(u64),
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] IB/mlx4: Suppress memory allocations warnings in kmalloc->__vmalloc flows

2015-12-16 Thread Leon Romanovsky
From: Leon Romanovsky 

Failure in kmalloc memory allocations will throw a warning about it.
Such warnings are not needed anymore, since in commit 0ef2f05c7e02
("IB/mlx4: Use vmalloc for WR buffers when needed"), fallback mechanism
from kmalloc() to __vmalloc() was added.

Signed-off-by: Leon Romanovsky 
Reviewed-by: Or Gerlitz 
---
 drivers/infiniband/hw/mlx4/qp.c  | 6 --
 drivers/infiniband/hw/mlx4/srq.c | 3 ++-
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 13eaaf45288f..dc86975fe1a9 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -796,11 +796,13 @@ static int create_qp_common(struct mlx4_ib_dev *dev, 
struct ib_pd *pd,
if (err)
goto err_mtt;
 
-   qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(u64), gfp);
+   qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(u64),
+   gfp | __GFP_NOWARN);
if (!qp->sq.wrid)
qp->sq.wrid = __vmalloc(qp->sq.wqe_cnt * sizeof(u64),
gfp, PAGE_KERNEL);
-   qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(u64), gfp);
+   qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(u64),
+   gfp | __GFP_NOWARN);
if (!qp->rq.wrid)
qp->rq.wrid = __vmalloc(qp->rq.wqe_cnt * sizeof(u64),
gfp, PAGE_KERNEL);
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index 8d133c40fa0e..f416c7463827 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -171,7 +171,8 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,
if (err)
goto err_mtt;
 
-   srq->wrid = kmalloc(srq->msrq.max * sizeof (u64), GFP_KERNEL);
+   srq->wrid = kmalloc(srq->msrq.max * sizeof(u64),
+   GFP_KERNEL | __GFP_NOWARN);
if (!srq->wrid) {
srq->wrid = __vmalloc(srq->msrq.max * sizeof(u64),
  GFP_KERNEL, PAGE_KERNEL);
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/14] staging/rdma/hfi1: Enable TID caching feature

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

This commit "flips the switch" on the TID caching feature
implemented in this patch series.

As well as enabling the new feature by tying the new function
with the PSM API, it also cleans up the old unneeded code,
data structure members, and variables.

Due to difference in operation and information, the tracing
functions related to expected receives had to be changed. This
patch include these changes.

The tracing function changes could not be split into a separate
commit without including both tracing variants at the same time.
This would have caused other complications and ugliness.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 drivers/staging/rdma/hfi1/file_ops.c | 448 +++
 drivers/staging/rdma/hfi1/hfi.h  |  14 -
 drivers/staging/rdma/hfi1/init.c |   3 -
 drivers/staging/rdma/hfi1/trace.h| 132 +
 drivers/staging/rdma/hfi1/user_exp_rcv.c |  12 +
 drivers/staging/rdma/hfi1/user_pages.c   |  14 -
 include/uapi/rdma/hfi/hfi1_user.h|   7 +-
 7 files changed, 132 insertions(+), 498 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/file_ops.c 
b/drivers/staging/rdma/hfi1/file_ops.c
index b0348263b901..d36588934f99 100644
--- a/drivers/staging/rdma/hfi1/file_ops.c
+++ b/drivers/staging/rdma/hfi1/file_ops.c
@@ -96,9 +96,6 @@ static int user_event_ack(struct hfi1_ctxtdata *, int, 
unsigned long);
 static int set_ctxt_pkey(struct hfi1_ctxtdata *, unsigned, u16);
 static int manage_rcvq(struct hfi1_ctxtdata *, unsigned, int);
 static int vma_fault(struct vm_area_struct *, struct vm_fault *);
-static int exp_tid_setup(struct file *, struct hfi1_tid_info *);
-static int exp_tid_free(struct file *, struct hfi1_tid_info *);
-static void unlock_exp_tids(struct hfi1_ctxtdata *);
 
 static const struct file_operations hfi1_file_ops = {
.owner = THIS_MODULE,
@@ -188,6 +185,7 @@ static ssize_t hfi1_file_write(struct file *fp, const char 
__user *data,
struct hfi1_cmd cmd;
struct hfi1_user_info uinfo;
struct hfi1_tid_info tinfo;
+   unsigned long addr;
ssize_t consumed = 0, copy = 0, ret = 0;
void *dest = NULL;
__u64 user_val = 0;
@@ -219,6 +217,7 @@ static ssize_t hfi1_file_write(struct file *fp, const char 
__user *data,
break;
case HFI1_CMD_TID_UPDATE:
case HFI1_CMD_TID_FREE:
+   case HFI1_CMD_TID_INVAL_READ:
copy = sizeof(tinfo);
dest = &tinfo;
break;
@@ -241,7 +240,6 @@ static ssize_t hfi1_file_write(struct file *fp, const char 
__user *data,
must_be_root = 1;   /* validate user */
copy = 0;
break;
-   case HFI1_CMD_TID_INVAL_READ:
default:
ret = -EINVAL;
goto bail;
@@ -295,9 +293,8 @@ static ssize_t hfi1_file_write(struct file *fp, const char 
__user *data,
sc_return_credits(uctxt->sc);
break;
case HFI1_CMD_TID_UPDATE:
-   ret = exp_tid_setup(fp, &tinfo);
+   ret = hfi1_user_exp_rcv_setup(fp, &tinfo);
if (!ret) {
-   unsigned long addr;
/*
 * Copy the number of tidlist entries we used
 * and the length of the buffer we registered.
@@ -312,8 +309,25 @@ static ssize_t hfi1_file_write(struct file *fp, const char 
__user *data,
ret = -EFAULT;
}
break;
+   case HFI1_CMD_TID_INVAL_READ:
+   ret = hfi1_user_exp_rcv_invalid(fp, &tinfo);
+   if (ret)
+   break;
+   addr = (unsigned long)cmd.addr +
+   offsetof(struct hfi1_tid_info, tidcnt);
+   if (copy_to_user((void __user *)addr, &tinfo.tidcnt,
+sizeof(tinfo.tidcnt)))
+   ret = -EFAULT;
+   break;
case HFI1_CMD_TID_FREE:
-   ret = exp_tid_free(fp, &tinfo);
+   ret = hfi1_user_exp_rcv_clear(fp, &tinfo);
+   if (ret)
+   break;
+   addr = (unsigned long)cmd.addr +
+   offsetof(struct hfi1_tid_info, tidcnt);
+   if (copy_to_user((void __user *)addr, &tinfo.tidcnt,
+sizeof(tinfo.tidcnt)))
+   ret = -EFAULT;
break;
case HFI1_CMD_RECV_CTRL:
ret = manage_rcvq(uctxt, fd->subctxt, (int)user_val);
@@ -779,12 +793,9 @@ static int hfi1_file_close(struct inode *inode, struct 
file *fp)
uctxt->pionowait = 0;
uctxt->event_flags = 0;
 
-   hfi1_clear_tids(uctxt);
+   hfi1_user_exp_rcv_free(fdata);
hfi1_clear_ctxt_pkey(dd, uctxt->ctxt);
 
-   if (uctxt->tid_pg_list)
-   unlock_exp_tids(uctxt);
-
hfi

[PATCH 13/14] staging/rdma/hfi1: Add TID entry program function body

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

The previous patch in the series added the free/invalidate
function bodies. Now, it's time for the programming side.

This large function takes the user's buffer, breaks it up
into manageable chunks, allocates enough RcvArray groups
and programs the chunks into the RcvArray entries in the
hardware.

With this function, the TID caching functionality is implemented.
However, it is still unused. The switch will come in a later
patch in the series, which will remove the old functionality and
switch the driver over to TID caching.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 drivers/staging/rdma/hfi1/user_exp_rcv.c | 263 ++-
 1 file changed, 259 insertions(+), 4 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c 
b/drivers/staging/rdma/hfi1/user_exp_rcv.c
index 91950d225da5..6d21c1349b77 100644
--- a/drivers/staging/rdma/hfi1/user_exp_rcv.c
+++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c
@@ -97,8 +97,7 @@ struct tid_pageset {
 
 static void unlock_exp_tids(struct hfi1_ctxtdata *, struct exp_tid_set *,
struct rb_root *);
-static u32 find_phys_blocks(struct page **, unsigned,
-   struct tid_pageset *) __maybe_unused;
+static u32 find_phys_blocks(struct page **, unsigned, struct tid_pageset *);
 static int set_rcvarray_entry(struct file *, unsigned long, u32,
  struct tid_group *, struct page **, unsigned);
 static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long,
@@ -119,7 +118,7 @@ static inline void mmu_notifier_range_start(struct 
mmu_notifier *,
unsigned long, unsigned long);
 static int program_rcvarray(struct file *, unsigned long, struct tid_group *,
struct tid_pageset *, unsigned, u16, struct page **,
-   u32 *, unsigned *, unsigned *) __maybe_unused;
+   u32 *, unsigned *, unsigned *);
 static int unprogram_rcvarray(struct file *, u32, struct tid_group **);
 static void clear_tid_node(struct hfi1_filedata *, u16, struct mmu_rb_node *);
 
@@ -339,9 +338,265 @@ static inline void rcv_array_wc_fill(struct hfi1_devdata 
*dd, u32 index)
writeq(0, dd->rcvarray_wc + (index * 8));
 }
 
+/*
+ * RcvArray entry allocation for Expected Receives is done by the
+ * following algorithm:
+ *
+ * The context keeps 3 lists of groups of RcvArray entries:
+ *   1. List of empty groups - tid_group_list
+ *  This list is created during user context creation and
+ *  contains elements which describe sets (of 8) of empty
+ *  RcvArray entries.
+ *   2. List of partially used groups - tid_used_list
+ *  This list contains sets of RcvArray entries which are
+ *  not completely used up. Another mapping request could
+ *  use some of all of the remaining entries.
+ *   3. List of full groups - tid_full_list
+ *  This is the list where sets that are completely used
+ *  up go.
+ *
+ * An attempt to optimize the usage of RcvArray entries is
+ * made by finding all sets of physically contiguous pages in a
+ * user's buffer.
+ * These physically contiguous sets are further split into
+ * sizes supported by the receive engine of the HFI. The
+ * resulting sets of pages are stored in struct tid_pageset,
+ * which describes the sets as:
+ ** .count - number of pages in this set
+ ** .idx - starting index into struct page ** array
+ *of this set
+ *
+ * From this point on, the algorithm deals with the page sets
+ * described above. The number of pagesets is divided by the
+ * RcvArray group size to produce the number of full groups
+ * needed.
+ *
+ * Groups from the 3 lists are manipulated using the following
+ * rules:
+ *   1. For each set of 8 pagesets, a complete group from
+ *  tid_group_list is taken, programmed, and moved to
+ *  the tid_full_list list.
+ *   2. For all remaining pagesets:
+ *  2.1 If the tid_used_list is empty and the tid_group_list
+ *  is empty, stop processing pageset and return only
+ *  what has been programmed up to this point.
+ *  2.2 If the tid_used_list is empty and the tid_group_list
+ *  is not empty, move a group from tid_group_list to
+ *  tid_used_list.
+ *  2.3 For each group is tid_used_group, program as much as
+ *  can fit into the group. If the group becomes fully
+ *  used, move it to tid_full_list.
+ */
 int hfi1_user_exp_rcv_setup(struct file *fp, struct hfi1_tid_info *tinfo)
 {
-   return -EINVAL;
+   int ret = 0, need_group = 0, pinned;
+   struct hfi1_filedata *fd = fp->private_data;
+   struct hfi1_ctxtdata *uctxt = fd->uctxt;
+   struct hfi1_devdata *dd = uctxt->dd;
+   unsigned npages, ngroups, pageidx = 0, pageset_count, npagesets,
+   tididx = 0, mapped, mapped_pages = 0;
+   unsigned long vaddr = tinfo->vaddr;
+

[PATCH 11/14] staging/rdma/hfi1: Add MMU notifier callback function

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

TID caching will rely on the MMU notifier to be told
when memory is being invalidated. When the callback
is called, the driver will find all RcvArray entries
that span the invalidated buffer and "schedule" them
to be freed by the PSM library.

This function is currently unused and is being added
in preparation for the TID caching feature.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 drivers/staging/rdma/hfi1/user_exp_rcv.c | 67 +++-
 1 file changed, 65 insertions(+), 2 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c 
b/drivers/staging/rdma/hfi1/user_exp_rcv.c
index b303182be08a..7996ce763adf 100644
--- a/drivers/staging/rdma/hfi1/user_exp_rcv.c
+++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c
@@ -104,7 +104,7 @@ static int set_rcvarray_entry(struct file *, unsigned long, 
u32,
 static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long,
   unsigned long);
 static struct mmu_rb_node *mmu_rb_search_by_addr(struct rb_root *,
-unsigned long) __maybe_unused;
+unsigned long);
 static inline struct mmu_rb_node *mmu_rb_search_by_entry(struct rb_root *,
 u32);
 static int mmu_rb_insert_by_addr(struct rb_root *, struct mmu_rb_node *);
@@ -656,7 +656,70 @@ static void mmu_notifier_mem_invalidate(struct 
mmu_notifier *mn,
unsigned long start, unsigned long end,
enum mmu_call_types type)
 {
-   /* Stub for now */
+   struct hfi1_filedata *fd = container_of(mn, struct hfi1_filedata, mn);
+   struct hfi1_ctxtdata *uctxt = fd->uctxt;
+   struct rb_root *root = &fd->tid_rb_root;
+   struct mmu_rb_node *node;
+   unsigned long addr = start;
+
+   spin_lock(&fd->rb_lock);
+   while (addr < end) {
+   node = mmu_rb_search_by_addr(root, addr);
+
+   if (!node) {
+   /*
+* Didn't find a node at this address. However, the
+* range could be bigger than what we have registered
+* so we have to keep looking.
+*/
+   addr += PAGE_SIZE;
+   continue;
+   }
+
+   /*
+* The next address to be looked up is computed based
+* on the node's starting address. This is due to the
+* fact that the range where we start might be in the
+* middle of the node's buffer so simply incrementing
+* the address by the node's size would result is a
+* bad address.
+*/
+   addr = node->virt + (node->npages * PAGE_SIZE);
+   if (node->freed)
+   continue;
+
+   node->freed = true;
+
+   spin_lock(&fd->invalid_lock);
+   if (fd->invalid_tid_idx < uctxt->expected_count) {
+   fd->invalid_tids[fd->invalid_tid_idx] =
+   rcventry2tidinfo(node->rcventry -
+uctxt->expected_base);
+   fd->invalid_tids[fd->invalid_tid_idx] |=
+   EXP_TID_SET(LEN, node->npages);
+   if (!fd->invalid_tid_idx) {
+   unsigned long *ev;
+
+   /*
+* hfi1_set_uevent_bits() sets a user event flag
+* for all processes. Because calling into the
+* driver to process TID cache invalidations is
+* expensive and TID cache invalidations are
+* handled on a per-process basis, we can
+* optimize this to set the flag only for the
+* process in question.
+*/
+   ev = uctxt->dd->events +
+   (((uctxt->ctxt -
+  uctxt->dd->first_user_ctxt) *
+ HFI1_MAX_SHARED_CTXTS) + fd->subctxt);
+   set_bit(_HFI1_EVENT_TID_MMU_NOTIFY_BIT, ev);
+   }
+   fd->invalid_tid_idx++;
+   }
+   spin_unlock(&fd->invalid_lock);
+   }
+   spin_unlock(&fd->rb_lock);
 }
 
 static inline int mmu_addr_cmp(struct mmu_rb_node *node, unsigned long addr,
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/14] staging/rdma/hfi1: Add Expected receive init and free functions

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

The upcoming TID caching feature requires different data
structures and, by extension, different initialization for each
of the MPI processes.

The two new functions (currently unused) perform the required
initialization and freeing of required resources and structures.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 drivers/staging/rdma/hfi1/user_exp_rcv.c | 154 +--
 1 file changed, 144 insertions(+), 10 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c 
b/drivers/staging/rdma/hfi1/user_exp_rcv.c
index 0eb888fcaf70..b303182be08a 100644
--- a/drivers/staging/rdma/hfi1/user_exp_rcv.c
+++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c
@@ -90,23 +90,25 @@ struct tid_pageset {
 
 #define EXP_TID_SET_EMPTY(set) (set.count == 0 && list_empty(&set.list))
 
+#define num_user_pages(vaddr, len)\
+   (1 + (unsigned long)(vaddr) +  \
+(unsigned long)(len) - 1) & PAGE_MASK) -  \
+  ((unsigned long)vaddr & PAGE_MASK)) >> PAGE_SHIFT))
+
 static void unlock_exp_tids(struct hfi1_ctxtdata *, struct exp_tid_set *,
-   struct rb_root *) __maybe_unused;
+   struct rb_root *);
 static u32 find_phys_blocks(struct page **, unsigned,
struct tid_pageset *) __maybe_unused;
 static int set_rcvarray_entry(struct file *, unsigned long, u32,
- struct tid_group *, struct page **,
- unsigned) __maybe_unused;
+ struct tid_group *, struct page **, unsigned);
 static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long,
   unsigned long);
 static struct mmu_rb_node *mmu_rb_search_by_addr(struct rb_root *,
 unsigned long) __maybe_unused;
 static inline struct mmu_rb_node *mmu_rb_search_by_entry(struct rb_root *,
 u32);
-static int mmu_rb_insert_by_addr(struct rb_root *,
-struct mmu_rb_node *) __maybe_unused;
-static int mmu_rb_insert_by_entry(struct rb_root *,
- struct mmu_rb_node *) __maybe_unused;
+static int mmu_rb_insert_by_addr(struct rb_root *, struct mmu_rb_node *);
+static int mmu_rb_insert_by_entry(struct rb_root *, struct mmu_rb_node *);
 static void mmu_notifier_mem_invalidate(struct mmu_notifier *,
unsigned long, unsigned long,
enum mmu_call_types);
@@ -168,7 +170,7 @@ static inline void tid_group_move(struct tid_group *group,
tid_group_add_tail(group, s2);
 }
 
-static struct mmu_notifier_ops __maybe_unused mn_opts = {
+static struct mmu_notifier_ops mn_opts = {
.invalidate_page = mmu_notifier_page,
.invalidate_range_start = mmu_notifier_range_start,
 };
@@ -180,12 +182,144 @@ static struct mmu_notifier_ops __maybe_unused mn_opts = {
  */
 int hfi1_user_exp_rcv_init(struct file *fp)
 {
-   return -EINVAL;
+   struct hfi1_filedata *fd = fp->private_data;
+   struct hfi1_ctxtdata *uctxt = fd->uctxt;
+   struct hfi1_devdata *dd = uctxt->dd;
+   unsigned tidbase;
+   int i, ret = 0;
+
+   INIT_HLIST_NODE(&fd->mn.hlist);
+   spin_lock_init(&fd->rb_lock);
+   spin_lock_init(&fd->tid_lock);
+   spin_lock_init(&fd->invalid_lock);
+   fd->mn.ops = &mn_opts;
+   fd->tid_rb_root = RB_ROOT;
+
+   if (!uctxt->subctxt_cnt || !fd->subctxt) {
+   exp_tid_group_init(&uctxt->tid_group_list);
+   exp_tid_group_init(&uctxt->tid_used_list);
+   exp_tid_group_init(&uctxt->tid_full_list);
+
+   tidbase = uctxt->expected_base;
+   for (i = 0; i < uctxt->expected_count /
+dd->rcv_entries.group_size; i++) {
+   struct tid_group *grp;
+
+   grp = kzalloc(sizeof(*grp), GFP_KERNEL);
+   if (!grp) {
+   /*
+* If we fail here, the groups already
+* allocated will be freed by the close
+* call.
+*/
+   ret = -ENOMEM;
+   goto done;
+   }
+   grp->size = dd->rcv_entries.group_size;
+   grp->base = tidbase;
+   tid_group_add_tail(grp, &uctxt->tid_group_list);
+   tidbase += dd->rcv_entries.group_size;
+   }
+   }
+
+   if (!HFI1_CAP_IS_USET(TID_UNMAP)) {
+   fd->invalid_tid_idx = 0;
+   fd->invalid_tids = kzalloc(uctxt->expected_count *
+  

[PATCH 12/14] staging/rdma/hfi1: Add TID free/clear function bodies

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

Up to now, the functions which cleared the programmed
TID entries and gave PSM the list of invalidated TID entries
were just stubs. With this commit, the bodies of these
functions are added.

This commit is a bit asymmetric as it only contains the
free code path. This is done on purpose to help with patch
reviews as the programming code path is much longer.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 drivers/staging/rdma/hfi1/user_exp_rcv.c | 91 +---
 1 file changed, 85 insertions(+), 6 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c 
b/drivers/staging/rdma/hfi1/user_exp_rcv.c
index 7996ce763adf..91950d225da5 100644
--- a/drivers/staging/rdma/hfi1/user_exp_rcv.c
+++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c
@@ -120,10 +120,8 @@ static inline void mmu_notifier_range_start(struct 
mmu_notifier *,
 static int program_rcvarray(struct file *, unsigned long, struct tid_group *,
struct tid_pageset *, unsigned, u16, struct page **,
u32 *, unsigned *, unsigned *) __maybe_unused;
-static int unprogram_rcvarray(struct file *, u32,
- struct tid_group **) __maybe_unused;
-static void clear_tid_node(struct hfi1_filedata *, u16,
-  struct mmu_rb_node *) __maybe_unused;
+static int unprogram_rcvarray(struct file *, u32, struct tid_group **);
+static void clear_tid_node(struct hfi1_filedata *, u16, struct mmu_rb_node *);
 
 static inline u32 rcventry2tidinfo(u32 rcventry)
 {
@@ -264,6 +262,7 @@ int hfi1_user_exp_rcv_init(struct file *fp)
 * Make sure that we set the tid counts only after successful
 * init.
 */
+   spin_lock(&fd->tid_lock);
if (uctxt->subctxt_cnt && !HFI1_CAP_IS_USET(TID_UNMAP)) {
u16 remainder;
 
@@ -274,6 +273,7 @@ int hfi1_user_exp_rcv_init(struct file *fp)
} else {
fd->tid_limit = uctxt->expected_count;
}
+   spin_unlock(&fd->tid_lock);
 done:
return ret;
 }
@@ -346,12 +346,91 @@ int hfi1_user_exp_rcv_setup(struct file *fp, struct 
hfi1_tid_info *tinfo)
 
 int hfi1_user_exp_rcv_clear(struct file *fp, struct hfi1_tid_info *tinfo)
 {
-   return -EINVAL;
+   int ret = 0;
+   struct hfi1_filedata *fd = fp->private_data;
+   struct hfi1_ctxtdata *uctxt = fd->uctxt;
+   u32 *tidinfo;
+   unsigned tididx;
+
+   tidinfo = kcalloc(tinfo->tidcnt, sizeof(*tidinfo), GFP_KERNEL);
+   if (!tidinfo)
+   return -ENOMEM;
+
+   if (copy_from_user(tidinfo, (void __user *)(unsigned long)
+  tinfo->tidlist, sizeof(tidinfo[0]) *
+  tinfo->tidcnt)) {
+   ret = -EFAULT;
+   goto done;
+   }
+
+   mutex_lock(&uctxt->exp_lock);
+   for (tididx = 0; tididx < tinfo->tidcnt; tididx++) {
+   ret = unprogram_rcvarray(fp, tidinfo[tididx], NULL);
+   if (ret) {
+   hfi1_cdbg(TID, "Failed to unprogram rcv array %d",
+ ret);
+   break;
+   }
+   }
+   spin_lock(&fd->tid_lock);
+   fd->tid_used -= tididx;
+   spin_unlock(&fd->tid_lock);
+   tinfo->tidcnt = tididx;
+   mutex_unlock(&uctxt->exp_lock);
+done:
+   kfree(tidinfo);
+   return ret;
 }
 
 int hfi1_user_exp_rcv_invalid(struct file *fp, struct hfi1_tid_info *tinfo)
 {
-   return -EINVAL;
+   struct hfi1_filedata *fd = fp->private_data;
+   struct hfi1_ctxtdata *uctxt = fd->uctxt;
+   unsigned long *ev = uctxt->dd->events +
+   (((uctxt->ctxt - uctxt->dd->first_user_ctxt) *
+ HFI1_MAX_SHARED_CTXTS) + fd->subctxt);
+   u32 *array;
+   int ret = 0;
+
+   if (!fd->invalid_tids)
+   return -EINVAL;
+
+   /*
+* copy_to_user() can sleep, which will leave the invalid_lock
+* locked and cause the MMU notifier to be blocked on the lock
+* for a long time.
+* Copy the data to a local buffer so we can release the lock.
+*/
+   array = kcalloc(uctxt->expected_count, sizeof(*array), GFP_KERNEL);
+   if (!array)
+   return -EFAULT;
+
+   spin_lock(&fd->invalid_lock);
+   if (fd->invalid_tid_idx) {
+   memcpy(array, fd->invalid_tids, sizeof(*array) *
+  fd->invalid_tid_idx);
+   memset(fd->invalid_tids, 0, sizeof(*fd->invalid_tids) *
+  fd->invalid_tid_idx);
+   tinfo->tidcnt = fd->invalid_tid_idx;
+   fd->invalid_tid_idx = 0;
+   /*
+* Reset the user flag while still holding the lock.
+* Otherwise, PSM can miss events.
+*/
+   clear_bit(_HFI1_EVENT_TID_MMU_NOTIFY_BIT, ev);
+   } else {
+   tinfo->tidcnt = 0;
+   }
+   spi

[PATCH 02/14] uapi/rdma/hfi/hfi1_user.h: Correct comment for capability bit

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

The HFI1_CAP_TID_UNMAP comment was incorrectly implying the
opposite of what capability actually did. Correct this error.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 include/uapi/rdma/hfi/hfi1_user.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/rdma/hfi/hfi1_user.h 
b/include/uapi/rdma/hfi/hfi1_user.h
index 288694e422fb..cf172718e3d5 100644
--- a/include/uapi/rdma/hfi/hfi1_user.h
+++ b/include/uapi/rdma/hfi/hfi1_user.h
@@ -93,7 +93,7 @@
 #define HFI1_CAP_MULTI_PKT_EGR(1UL <<  7) /* Enable multi-packet Egr 
buffs*/
 #define HFI1_CAP_NODROP_RHQ_FULL  (1UL <<  8) /* Don't drop on Hdr Q full */
 #define HFI1_CAP_NODROP_EGR_FULL  (1UL <<  9) /* Don't drop on EGR buffs full 
*/
-#define HFI1_CAP_TID_UNMAP(1UL << 10) /* Enable Expected TID caching */
+#define HFI1_CAP_TID_UNMAP(1UL << 10) /* Disable Expected TID caching 
*/
 #define HFI1_CAP_PRINT_UNIMPL (1UL << 11) /* Show for unimplemented feats 
*/
 #define HFI1_CAP_ALLOW_PERM_JKEY  (1UL << 12) /* Allow use of permissive JKEY 
*/
 #define HFI1_CAP_NO_INTEGRITY (1UL << 13) /* Enable ctxt integrity checks 
*/
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/14] staging/rdma/hfi1: Remove un-needed variable

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

There is no need to use a separate variable for a
return value and a label when returning right away
would do just as well.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 drivers/staging/rdma/hfi1/file_ops.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/file_ops.c 
b/drivers/staging/rdma/hfi1/file_ops.c
index c66693532be0..76fe60315bb4 100644
--- a/drivers/staging/rdma/hfi1/file_ops.c
+++ b/drivers/staging/rdma/hfi1/file_ops.c
@@ -1037,22 +1037,19 @@ static int allocate_ctxt(struct file *fp, struct 
hfi1_devdata *dd,
 static int init_subctxts(struct hfi1_ctxtdata *uctxt,
 const struct hfi1_user_info *uinfo)
 {
-   int ret = 0;
unsigned num_subctxts;
 
num_subctxts = uinfo->subctxt_cnt;
-   if (num_subctxts > HFI1_MAX_SHARED_CTXTS) {
-   ret = -EINVAL;
-   goto bail;
-   }
+   if (num_subctxts > HFI1_MAX_SHARED_CTXTS)
+   return -EINVAL;
 
uctxt->subctxt_cnt = uinfo->subctxt_cnt;
uctxt->subctxt_id = uinfo->subctxt_id;
uctxt->active_slaves = 1;
uctxt->redirect_seq_cnt = 1;
set_bit(HFI1_CTXT_MASTER_UNINIT, &uctxt->event_flags);
-bail:
-   return ret;
+
+   return 0;
 }
 
 static int setup_subctxt(struct hfi1_ctxtdata *uctxt)
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/14] staging/rdma/hfi1: Add definitions needed for TID caching support

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

In preparation for adding the TID caching support, there is a set
of headers, structures, and variables which will be needed. This
commit adds them to the hfi.h header file.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 drivers/staging/rdma/hfi1/hfi.h | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/staging/rdma/hfi1/hfi.h b/drivers/staging/rdma/hfi1/hfi.h
index 057a41cee734..996dd520cf41 100644
--- a/drivers/staging/rdma/hfi1/hfi.h
+++ b/drivers/staging/rdma/hfi1/hfi.h
@@ -179,6 +179,11 @@ struct ctxt_eager_bufs {
} *rcvtids;
 };
 
+struct exp_tid_set {
+   struct list_head list;
+   u32 count;
+};
+
 struct hfi1_ctxtdata {
/* shadow the ctxt's RcvCtrl register */
u64 rcvctrl;
@@ -247,6 +252,11 @@ struct hfi1_ctxtdata {
struct page **tid_pg_list;
/* dma handles for exp tid pages */
dma_addr_t *physshadow;
+
+   struct exp_tid_set tid_group_list;
+   struct exp_tid_set tid_used_list;
+   struct exp_tid_set tid_full_list;
+
/* lock protecting all Expected TID data */
spinlock_t exp_lock;
/* number of pio bufs for this ctxt (all procs, if shared) */
@@ -1138,6 +1148,16 @@ struct hfi1_filedata {
struct hfi1_user_sdma_pkt_q *pq;
/* for cpu affinity; -1 if none */
int rec_cpu_num;
+   struct mmu_notifier mn;
+   struct rb_root tid_rb_root;
+   spinlock_t tid_lock; /* protect tid_[limit,used] counters */
+   u32 tid_limit;
+   u32 tid_used;
+   spinlock_t rb_lock; /* protect tid_rb_root RB tree */
+   u32 *invalid_tids;
+   u32 invalid_tid_idx;
+   spinlock_t invalid_lock; /* protect the invalid_tids array */
+   int (*mmu_rb_insert)(struct rb_root *, struct mmu_rb_node *);
 };
 
 extern struct list_head hfi1_dev_list;
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 07/14] staging/rdma/hfi1: Add definitions and support functions for TID groups

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

Definitions and functions use to manage sets of TID/RcvArray groups.
These will be used by the TID cacheline functionality coming with
later patches.

TID groups (or RcvArray groups) are groups of TID/RcvArray entries
organized in sets of 8 and aligned on cacheline boundaries. The
TID/RcvArray entries are managed in this way to make taking
advantage of write-combining easier - each group is a entire
cacheline.

rcv_array_wc_fill() is provided to allow of generating writes to
TIDs which are not currently being used in order to cause the
flush of the write-combining buffer.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 drivers/staging/rdma/hfi1/user_exp_rcv.c | 64 
 1 file changed, 64 insertions(+)

diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c 
b/drivers/staging/rdma/hfi1/user_exp_rcv.c
index bafeddf67c8f..7f15024daab9 100644
--- a/drivers/staging/rdma/hfi1/user_exp_rcv.c
+++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c
@@ -52,6 +52,14 @@
 #include "user_exp_rcv.h"
 #include "trace.h"
 
+struct tid_group {
+   struct list_head list;
+   unsigned base;
+   u8 size;
+   u8 used;
+   u8 map;
+};
+
 struct mmu_rb_node {
struct rb_node rbnode;
unsigned long virt;
@@ -75,6 +83,8 @@ static const char * const mmu_types[] = {
"RANGE"
 };
 
+#define EXP_TID_SET_EMPTY(set) (set.count == 0 && list_empty(&set.list))
+
 static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long,
   unsigned long);
 static struct mmu_rb_node *mmu_rb_search_by_addr(struct rb_root *,
@@ -94,6 +104,43 @@ static inline void mmu_notifier_range_start(struct 
mmu_notifier *,
struct mm_struct *,
unsigned long, unsigned long);
 
+static inline void exp_tid_group_init(struct exp_tid_set *set)
+{
+   INIT_LIST_HEAD(&set->list);
+   set->count = 0;
+}
+
+static inline void tid_group_remove(struct tid_group *grp,
+   struct exp_tid_set *set)
+{
+   list_del_init(&grp->list);
+   set->count--;
+}
+
+static inline void tid_group_add_tail(struct tid_group *grp,
+ struct exp_tid_set *set)
+{
+   list_add_tail(&grp->list, &set->list);
+   set->count++;
+}
+
+static inline struct tid_group *tid_group_pop(struct exp_tid_set *set)
+{
+   struct tid_group *grp =
+   list_first_entry(&set->list, struct tid_group, list);
+   list_del_init(&grp->list);
+   set->count--;
+   return grp;
+}
+
+static inline void tid_group_move(struct tid_group *group,
+ struct exp_tid_set *s1,
+ struct exp_tid_set *s2)
+{
+   tid_group_remove(group, s1);
+   tid_group_add_tail(group, s2);
+}
+
 static struct mmu_notifier_ops __maybe_unused mn_opts = {
.invalidate_page = mmu_notifier_page,
.invalidate_range_start = mmu_notifier_range_start,
@@ -114,6 +161,23 @@ int hfi1_user_exp_rcv_free(struct hfi1_filedata *fd)
return -EINVAL;
 }
 
+/*
+ * Write an "empty" RcvArray entry.
+ * This function exists so the TID registaration code can use it
+ * to write to unused/unneeded entries and still take advantage
+ * of the WC performance improvements. The HFI will ignore this
+ * write to the RcvArray entry.
+ */
+static inline void rcv_array_wc_fill(struct hfi1_devdata *dd, u32 index)
+{
+   /*
+* Doing the WC fill writes only makes sense if the device is
+* present and the RcvArray has been mapped as WC memory.
+*/
+   if ((dd->flags & HFI1_PRESENT) && dd->rcvarray_wc)
+   writeq(0, dd->rcvarray_wc + (index * 8));
+}
+
 int hfi1_user_exp_rcv_setup(struct file *fp, struct hfi1_tid_info *tinfo)
 {
return -EINVAL;
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/14] staging/rdma/hfi1: Convert lock to mutex

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

The exp_lock lock does not need to be a spinlock as
all its uses are in process context and allowing the
process to sleep when the mutex is contended might
be benefitial.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 drivers/staging/rdma/hfi1/file_ops.c | 12 ++--
 drivers/staging/rdma/hfi1/hfi.h  |  2 +-
 drivers/staging/rdma/hfi1/init.c |  2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/file_ops.c 
b/drivers/staging/rdma/hfi1/file_ops.c
index 76fe60315bb4..b0348263b901 100644
--- a/drivers/staging/rdma/hfi1/file_ops.c
+++ b/drivers/staging/rdma/hfi1/file_ops.c
@@ -1611,14 +1611,14 @@ static int exp_tid_setup(struct file *fp, struct 
hfi1_tid_info *tinfo)
 * reserved, we don't need the lock anymore since we
 * are guaranteed the groups.
 */
-   spin_lock(&uctxt->exp_lock);
+   mutex_lock(&uctxt->exp_lock);
if (uctxt->tidusemap[useidx] == -1ULL ||
bitidx >= BITS_PER_LONG) {
/* no free groups in the set, use the next */
useidx = (useidx + 1) % uctxt->tidmapcnt;
idx++;
bitidx = 0;
-   spin_unlock(&uctxt->exp_lock);
+   mutex_unlock(&uctxt->exp_lock);
continue;
}
ngroups = ((npages - mapped) / dd->rcv_entries.group_size) +
@@ -1635,13 +1635,13 @@ static int exp_tid_setup(struct file *fp, struct 
hfi1_tid_info *tinfo)
 * as 0 because we don't check the entire bitmap but
 * we start from bitidx.
 */
-   spin_unlock(&uctxt->exp_lock);
+   mutex_unlock(&uctxt->exp_lock);
continue;
}
bits_used = min(free, ngroups);
tidmap[useidx] |= ((1ULL << bits_used) - 1) << bitidx;
uctxt->tidusemap[useidx] |= tidmap[useidx];
-   spin_unlock(&uctxt->exp_lock);
+   mutex_unlock(&uctxt->exp_lock);
 
/*
 * At this point, we know where in the map we have free bits.
@@ -1677,10 +1677,10 @@ static int exp_tid_setup(struct file *fp, struct 
hfi1_tid_info *tinfo)
 * Let go of the bits that we reserved since we are not
 * going to use them.
 */
-   spin_lock(&uctxt->exp_lock);
+   mutex_lock(&uctxt->exp_lock);
uctxt->tidusemap[useidx] &=
~(((1ULL << bits_used) - 1) << bitidx);
-   spin_unlock(&uctxt->exp_lock);
+   mutex_unlock(&uctxt->exp_lock);
goto done;
}
/*
diff --git a/drivers/staging/rdma/hfi1/hfi.h b/drivers/staging/rdma/hfi1/hfi.h
index 996dd520cf41..8ae914aab9bf 100644
--- a/drivers/staging/rdma/hfi1/hfi.h
+++ b/drivers/staging/rdma/hfi1/hfi.h
@@ -258,7 +258,7 @@ struct hfi1_ctxtdata {
struct exp_tid_set tid_full_list;
 
/* lock protecting all Expected TID data */
-   spinlock_t exp_lock;
+   struct mutex exp_lock;
/* number of pio bufs for this ctxt (all procs, if shared) */
u32 piocnt;
/* first pio buffer for this ctxt */
diff --git a/drivers/staging/rdma/hfi1/init.c b/drivers/staging/rdma/hfi1/init.c
index 98aaa0ebff51..503dc7a397a5 100644
--- a/drivers/staging/rdma/hfi1/init.c
+++ b/drivers/staging/rdma/hfi1/init.c
@@ -227,7 +227,7 @@ struct hfi1_ctxtdata *hfi1_create_ctxtdata(struct 
hfi1_pportdata *ppd, u32 ctxt)
rcd->numa_id = numa_node_id();
rcd->rcv_array_groups = dd->rcv_entries.ngroups;
 
-   spin_lock_init(&rcd->exp_lock);
+   mutex_init(&rcd->exp_lock);
 
/*
 * Calculate the context's RcvArray entry starting point.
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/14] staging/rdma/hfi1: Start adding building blocks for TID caching

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

Functions added by this patch are building blocks for the upcoming
TID caching functionality. The functions added are currently unsed
(and marked as such.)

The functions' purposes are to find physically contigous pages in
the user's virtual buffer, program the RcvArray group entries with
these physical chunks, and unprogram the RcvArray groups.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 drivers/staging/rdma/hfi1/user_exp_rcv.c | 310 +++
 1 file changed, 310 insertions(+)

diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c 
b/drivers/staging/rdma/hfi1/user_exp_rcv.c
index 7f15024daab9..0eb888fcaf70 100644
--- a/drivers/staging/rdma/hfi1/user_exp_rcv.c
+++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c
@@ -83,8 +83,20 @@ static const char * const mmu_types[] = {
"RANGE"
 };
 
+struct tid_pageset {
+   u16 idx;
+   u16 count;
+};
+
 #define EXP_TID_SET_EMPTY(set) (set.count == 0 && list_empty(&set.list))
 
+static void unlock_exp_tids(struct hfi1_ctxtdata *, struct exp_tid_set *,
+   struct rb_root *) __maybe_unused;
+static u32 find_phys_blocks(struct page **, unsigned,
+   struct tid_pageset *) __maybe_unused;
+static int set_rcvarray_entry(struct file *, unsigned long, u32,
+ struct tid_group *, struct page **,
+ unsigned) __maybe_unused;
 static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long,
   unsigned long);
 static struct mmu_rb_node *mmu_rb_search_by_addr(struct rb_root *,
@@ -103,6 +115,21 @@ static inline void mmu_notifier_page(struct mmu_notifier 
*, struct mm_struct *,
 static inline void mmu_notifier_range_start(struct mmu_notifier *,
struct mm_struct *,
unsigned long, unsigned long);
+static int program_rcvarray(struct file *, unsigned long, struct tid_group *,
+   struct tid_pageset *, unsigned, u16, struct page **,
+   u32 *, unsigned *, unsigned *) __maybe_unused;
+static int unprogram_rcvarray(struct file *, u32,
+ struct tid_group **) __maybe_unused;
+static void clear_tid_node(struct hfi1_filedata *, u16,
+  struct mmu_rb_node *) __maybe_unused;
+
+static inline u32 rcventry2tidinfo(u32 rcventry)
+{
+   u32 pair = rcventry & ~0x1;
+
+   return EXP_TID_SET(IDX, pair >> 1) |
+   EXP_TID_SET(CTRL, 1 << (rcventry - pair));
+}
 
 static inline void exp_tid_group_init(struct exp_tid_set *set)
 {
@@ -193,6 +220,289 @@ int hfi1_user_exp_rcv_invalid(struct file *fp, struct 
hfi1_tid_info *tinfo)
return -EINVAL;
 }
 
+static u32 find_phys_blocks(struct page **pages, unsigned npages,
+   struct tid_pageset *list)
+{
+   unsigned pagecount, pageidx, setcount = 0, i;
+   unsigned long pfn, this_pfn;
+
+   if (!npages)
+   return 0;
+
+   /*
+* Look for sets of physically contiguous pages in the user buffer.
+* This will allow us to optimize Expected RcvArray entry usage by
+* using the bigger supported sizes.
+*/
+   pfn = page_to_pfn(pages[0]);
+   for (pageidx = 0, pagecount = 1, i = 1; i <= npages; i++) {
+   this_pfn = i < npages ? page_to_pfn(pages[i]) : 0;
+
+   /*
+* If the pfn's are not sequential, pages are not physically
+* contiguous.
+*/
+   if (this_pfn != ++pfn) {
+   /*
+* At this point we have to loop over the set of
+* physically contiguous pages and break them down it
+* sizes supported by the HW.
+* There are two main constraints:
+* 1. The max buffer size is MAX_EXPECTED_BUFFER.
+*If the total set size is bigger than that
+*program only a MAX_EXPECTED_BUFFER chunk.
+* 2. The buffer size has to be a power of two. If
+*it is not, round down to the closes power of
+*2 and program that size.
+*/
+   while (pagecount) {
+   int maxpages = pagecount;
+   u32 bufsize = pagecount * PAGE_SIZE;
+
+   if (bufsize > MAX_EXPECTED_BUFFER)
+   maxpages =
+   MAX_EXPECTED_BUFFER >>
+   PAGE_SHIFT;
+   else if (!is_power_of_2(bufsize))
+   maxpages =
+   

[PATCH 01/14] staging/rdma/hfi1: Add function stubs for TID caching

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

Add mmu notify helper functions and TID caching function
stubs in preparation for the TID caching implementation.

TID caching makes use of the MMU notifier to allow the driver
to respond to the user freeing memory which is allocated to
the HFI.

This patch implements the basic MMU notifier functions to insert,
find and remove buffer pages from memory based on the mmu_notifier
being invoked.

In addition it places stubs in place for the main entry points by
follow on code.

Follow up patches will complete the implementation of the interaction
with user space and makes use of these functions.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 drivers/staging/rdma/hfi1/Kconfig|   1 +
 drivers/staging/rdma/hfi1/Makefile   |   2 +-
 drivers/staging/rdma/hfi1/hfi.h  |   4 +
 drivers/staging/rdma/hfi1/user_exp_rcv.c | 264 +++
 drivers/staging/rdma/hfi1/user_exp_rcv.h |   8 +
 5 files changed, 278 insertions(+), 1 deletion(-)
 create mode 100644 drivers/staging/rdma/hfi1/user_exp_rcv.c

diff --git a/drivers/staging/rdma/hfi1/Kconfig 
b/drivers/staging/rdma/hfi1/Kconfig
index fd25078ee923..bd0249bcf199 100644
--- a/drivers/staging/rdma/hfi1/Kconfig
+++ b/drivers/staging/rdma/hfi1/Kconfig
@@ -1,6 +1,7 @@
 config INFINIBAND_HFI1
tristate "Intel OPA Gen1 support"
depends on X86_64
+   select MMU_NOTIFIER
default m
---help---
This is a low-level driver for Intel OPA Gen1 adapter.
diff --git a/drivers/staging/rdma/hfi1/Makefile 
b/drivers/staging/rdma/hfi1/Makefile
index 68c5a315e557..e63251b9c56b 100644
--- a/drivers/staging/rdma/hfi1/Makefile
+++ b/drivers/staging/rdma/hfi1/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_INFINIBAND_HFI1) += hfi1.o
 hfi1-y := chip.o cq.o device.o diag.o dma.o driver.o efivar.o eprom.o 
file_ops.o firmware.o \
init.o intr.o keys.o mad.o mmap.o mr.o pcie.o pio.o pio_copy.o \
qp.o qsfp.o rc.o ruc.o sdma.o srq.o sysfs.o trace.o twsi.o \
-   uc.o ud.o user_pages.o user_sdma.o verbs_mcast.o verbs.o
+   uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs_mcast.o verbs.o
 hfi1-$(CONFIG_DEBUG_FS) += debugfs.o
 
 CFLAGS_trace.o = -I$(src)
diff --git a/drivers/staging/rdma/hfi1/hfi.h b/drivers/staging/rdma/hfi1/hfi.h
index a4a294558c03..057a41cee734 100644
--- a/drivers/staging/rdma/hfi1/hfi.h
+++ b/drivers/staging/rdma/hfi1/hfi.h
@@ -65,6 +65,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include "chip_registers.h"
 #include "common.h"
@@ -1126,6 +1128,8 @@ struct hfi1_devdata {
 #define PT_EAGER1
 #define PT_INVALID  2
 
+struct mmu_rb_node;
+
 /* Private data for file operations */
 struct hfi1_filedata {
struct hfi1_ctxtdata *uctxt;
diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c 
b/drivers/staging/rdma/hfi1/user_exp_rcv.c
new file mode 100644
index ..bafeddf67c8f
--- /dev/null
+++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c
@@ -0,0 +1,264 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR 

[PATCH 03/14] uapi/rdma/hfi/hfi1_user.h: Convert definitions to use BIT() macro

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

Convert bit definitions to use BIT() macro as per checkpatch.pl
requirements.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 include/uapi/rdma/hfi/hfi1_user.h | 56 +++
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/include/uapi/rdma/hfi/hfi1_user.h 
b/include/uapi/rdma/hfi/hfi1_user.h
index cf172718e3d5..a65f2fe17660 100644
--- a/include/uapi/rdma/hfi/hfi1_user.h
+++ b/include/uapi/rdma/hfi/hfi1_user.h
@@ -83,29 +83,29 @@
  * driver features. The same set of bits are communicated to user
  * space.
  */
-#define HFI1_CAP_DMA_RTAIL(1UL <<  0) /* Use DMA'ed RTail value */
-#define HFI1_CAP_SDMA (1UL <<  1) /* Enable SDMA support */
-#define HFI1_CAP_SDMA_AHG (1UL <<  2) /* Enable SDMA AHG support */
-#define HFI1_CAP_EXTENDED_PSN (1UL <<  3) /* Enable Extended PSN support */
-#define HFI1_CAP_HDRSUPP  (1UL <<  4) /* Enable Header Suppression */
-/* 1UL << 5 unused */
-#define HFI1_CAP_USE_SDMA_HEAD(1UL <<  6) /* DMA Hdr Q tail vs. use CSR */
-#define HFI1_CAP_MULTI_PKT_EGR(1UL <<  7) /* Enable multi-packet Egr 
buffs*/
-#define HFI1_CAP_NODROP_RHQ_FULL  (1UL <<  8) /* Don't drop on Hdr Q full */
-#define HFI1_CAP_NODROP_EGR_FULL  (1UL <<  9) /* Don't drop on EGR buffs full 
*/
-#define HFI1_CAP_TID_UNMAP(1UL << 10) /* Disable Expected TID caching 
*/
-#define HFI1_CAP_PRINT_UNIMPL (1UL << 11) /* Show for unimplemented feats 
*/
-#define HFI1_CAP_ALLOW_PERM_JKEY  (1UL << 12) /* Allow use of permissive JKEY 
*/
-#define HFI1_CAP_NO_INTEGRITY (1UL << 13) /* Enable ctxt integrity checks 
*/
-#define HFI1_CAP_PKEY_CHECK   (1UL << 14) /* Enable ctxt PKey checking */
-#define HFI1_CAP_STATIC_RATE_CTRL (1UL << 15) /* Allow PBC.StaticRateControl */
-/* 1UL << 16 unused */
-#define HFI1_CAP_SDMA_HEAD_CHECK  (1UL << 17) /* SDMA head checking */
-#define HFI1_CAP_EARLY_CREDIT_RETURN (1UL << 18) /* early credit return */
-
-#define HFI1_RCVHDR_ENTSIZE_2(1UL << 0)
-#define HFI1_RCVHDR_ENTSIZE_16   (1UL << 1)
-#define HFI1_RCVDHR_ENTSIZE_32   (1UL << 2)
+#define HFI1_CAP_DMA_RTAILBIT(0) /* Use DMA'ed RTail value */
+#define HFI1_CAP_SDMA BIT(1) /* Enable SDMA support */
+#define HFI1_CAP_SDMA_AHG BIT(2) /* Enable SDMA AHG support */
+#define HFI1_CAP_EXTENDED_PSN BIT(3) /* Enable Extended PSN support */
+#define HFI1_CAP_HDRSUPP  BIT(4) /* Enable Header Suppression */
+/* BIT(5) unused */
+#define HFI1_CAP_USE_SDMA_HEADBIT(6) /* DMA Hdr Q tail vs. use CSR */
+#define HFI1_CAP_MULTI_PKT_EGRBIT(7) /* Enable multi-packet Egr buffs*/
+#define HFI1_CAP_NODROP_RHQ_FULL  BIT(8) /* Don't drop on Hdr Q full */
+#define HFI1_CAP_NODROP_EGR_FULL  BIT(9) /* Don't drop on EGR buffs full */
+#define HFI1_CAP_TID_UNMAPBIT(10) /* Disable Expected TID caching */
+#define HFI1_CAP_PRINT_UNIMPL BIT(11) /* Show for unimplemented feats */
+#define HFI1_CAP_ALLOW_PERM_JKEY  BIT(12) /* Allow use of permissive JKEY */
+#define HFI1_CAP_NO_INTEGRITY BIT(13) /* Enable ctxt integrity checks */
+#define HFI1_CAP_PKEY_CHECK   BIT(14) /* Enable ctxt PKey checking */
+#define HFI1_CAP_STATIC_RATE_CTRL BIT(15) /* Allow PBC.StaticRateControl */
+/* BIT(16) unused */
+#define HFI1_CAP_SDMA_HEAD_CHECK  BIT(17) /* SDMA head checking */
+#define HFI1_CAP_EARLY_CREDIT_RETURN BIT(18) /* early credit return */
+
+#define HFI1_RCVHDR_ENTSIZE_2BIT(0)
+#define HFI1_RCVHDR_ENTSIZE_16   BIT(1)
+#define HFI1_RCVDHR_ENTSIZE_32   BIT(2)
 
 /*
  * If the unit is specified via open, HFI choice is fixed.  If port is
@@ -149,11 +149,11 @@
 #define _HFI1_EVENT_SL2VL_CHANGE_BIT   4
 #define _HFI1_MAX_EVENT_BIT _HFI1_EVENT_SL2VL_CHANGE_BIT
 
-#define HFI1_EVENT_FROZEN(1UL << _HFI1_EVENT_FROZEN_BIT)
-#define HFI1_EVENT_LINKDOWN  (1UL << _HFI1_EVENT_LINKDOWN_BIT)
-#define HFI1_EVENT_LID_CHANGE(1UL << _HFI1_EVENT_LID_CHANGE_BIT)
-#define HFI1_EVENT_LMC_CHANGE(1UL << _HFI1_EVENT_LMC_CHANGE_BIT)
-#define HFI1_EVENT_SL2VL_CHANGE  (1UL << _HFI1_EVENT_SL2VL_CHANGE_BIT)
+#define HFI1_EVENT_FROZENBIT(_HFI1_EVENT_FROZEN_BIT)
+#define HFI1_EVENT_LINKDOWN  BIT(_HFI1_EVENT_LINKDOWN_BIT)
+#define HFI1_EVENT_LID_CHANGEBIT(_HFI1_EVENT_LID_CHANGE_BIT)
+#define HFI1_EVENT_LMC_CHANGEBIT(_HFI1_EVENT_LMC_CHANGE_BIT)
+#define HFI1_EVENT_SL2VL_CHANGE  BIT(_HFI1_EVENT_SL2VL_CHANGE_BIT)
 
 /*
  * These are the status bits readable (in ASCII form, 64bit value)
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/14] Implement Expected Receive TID Caching

2015-12-16 Thread ira . weiny
From: Ira Weiny 

Expected receives work by user-space libraries (PSM) calling into the driver
with information about the user's receive buffer and have the driver DMA-map
that buffer and program the HFI to receive data directly into it.

This is an expensive operation as it requires the driver to pin the pages which
the user's buffer maps to, DMA-map them, and then program the HFI.

When the receive is complete, user-space libraries have to call into the driver
again so the buffer is removed from the HFI, un-mapped, and the pages unpinned.

All of these operations are expensive, considering that a lot of applications
(especially micro-benchmarks) use the same buffer over and over.

In order to get better performance for user-space applications, it is highly
beneficial that they don't continuously call into the driver to register and
unregister the same buffer. Rather, they can register the buffer and cache it
for future work. The buffer can be unregistered when it is freed by the user.

This change implements such buffer caching by making use of the kernel's MMU
notifier API. User-space libraries call into the driver only when they need to
register a new buffer.

Once a buffer is registered, it stays programmed into the HFI until the kernel
notifies the driver that the buffer has been freed by the user. At that time,
the user-space library is notified and it can do the necessary work to remove
the buffer from its cache.

Buffers which have been invalidated by the kernel are not automatically removed
from the HFI and do not have their pages unpinned. Buffers are only completely
removed when the user-space libraries call into the driver to free them.  This
is done to ensure that any ongoing transfers into that buffer are complete.
This is important when a buffer is not completely freed but rather it is
shrunk. The user-space library could still have uncompleted transfers into the
remaining buffer.

With this feature, it is important that systems are setup with reasonable
limits for the amount of lockable memory.  Keeping the limit at "unlimited" (as
we've done up to this point), may result in jobs being killed by the kernel's
OOM due to them taking up excessive amounts of memory.


TID caching started as a single patch which we have broken up.

Original patch here.

http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2015-November/080855.html


This directly depends on the initial break up work which was submitted before:

http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2015-December/082339.html



Mitko Haralanov (14):
  staging/rdma/hfi1: Add function stubs for TID caching
  uapi/rdma/hfi/hfi1_user.h: Correct comment for capability bit
  uapi/rdma/hfi/hfi1_user.h: Convert definitions to use BIT() macro
  uapi/rdma/hfi/hfi1_user.h: Add command and event for TID caching
  staging/rdma/hfi1: Add definitions needed for TID caching support
  staging/rdma/hfi1: Remove un-needed variable
  staging/rdma/hfi1: Add definitions and support functions for TID groups
  staging/rdma/hfi1: Start adding building blocks for TID caching
  staging/rdma/hfi1: Convert lock to mutex
  staging/rdma/hfi1: Add Expected receive init and free functions
  staging/rdma/hfi1: Add MMU notifier callback function
  staging/rdma/hfi1: Add TID free/clear function bodies
  staging/rdma/hfi1: Add TID entry program function body
  staging/rdma/hfi1: Enable TID caching feature

 drivers/staging/rdma/hfi1/Kconfig|1 +
 drivers/staging/rdma/hfi1/Makefile   |2 +-
 drivers/staging/rdma/hfi1/file_ops.c |  458 +---
 drivers/staging/rdma/hfi1/hfi.h  |   40 +-
 drivers/staging/rdma/hfi1/init.c |5 +-
 drivers/staging/rdma/hfi1/trace.h|  132 ++--
 drivers/staging/rdma/hfi1/user_exp_rcv.c | 1181 ++
 drivers/staging/rdma/hfi1/user_exp_rcv.h |8 +
 drivers/staging/rdma/hfi1/user_pages.c   |   14 -
 include/uapi/rdma/hfi/hfi1_user.h|   68 +-
 10 files changed, 1373 insertions(+), 536 deletions(-)
 create mode 100644 drivers/staging/rdma/hfi1/user_exp_rcv.c

-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/14] uapi/rdma/hfi/hfi1_user.h: Add command and event for TID caching

2015-12-16 Thread ira . weiny
From: Mitko Haralanov 

TID caching will use a new event to signal userland that cache
invalidation has occurred and needs a matching command code that
will be used to read the invalidated TIDs.

Add the event bit and the new command to the exported header file.

The command is also added to the switch() statement in file_ops.c
for completeness and in preparation for its usage later.

Reviewed-by: Ira Weiny 
Signed-off-by: Mitko Haralanov 
---
 drivers/staging/rdma/hfi1/file_ops.c | 1 +
 include/uapi/rdma/hfi/hfi1_user.h| 5 -
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/staging/rdma/hfi1/file_ops.c 
b/drivers/staging/rdma/hfi1/file_ops.c
index d57d549052c8..c66693532be0 100644
--- a/drivers/staging/rdma/hfi1/file_ops.c
+++ b/drivers/staging/rdma/hfi1/file_ops.c
@@ -241,6 +241,7 @@ static ssize_t hfi1_file_write(struct file *fp, const char 
__user *data,
must_be_root = 1;   /* validate user */
copy = 0;
break;
+   case HFI1_CMD_TID_INVAL_READ:
default:
ret = -EINVAL;
goto bail;
diff --git a/include/uapi/rdma/hfi/hfi1_user.h 
b/include/uapi/rdma/hfi/hfi1_user.h
index a65f2fe17660..959204df5318 100644
--- a/include/uapi/rdma/hfi/hfi1_user.h
+++ b/include/uapi/rdma/hfi/hfi1_user.h
@@ -134,6 +134,7 @@
 #define HFI1_CMD_ACK_EVENT   10/* ack & clear user status bits */
 #define HFI1_CMD_SET_PKEY11 /* set context's pkey */
 #define HFI1_CMD_CTXT_RESET  12 /* reset context's HW send context */
+#define HFI1_CMD_TID_INVAL_READ  13 /* read TID cache invalidations */
 /* separate EPROM commands from normal PSM commands */
 #define HFI1_CMD_EP_INFO 64  /* read EPROM device ID */
 #define HFI1_CMD_EP_ERASE_CHIP   65  /* erase whole EPROM */
@@ -147,13 +148,15 @@
 #define _HFI1_EVENT_LID_CHANGE_BIT 2
 #define _HFI1_EVENT_LMC_CHANGE_BIT 3
 #define _HFI1_EVENT_SL2VL_CHANGE_BIT   4
-#define _HFI1_MAX_EVENT_BIT _HFI1_EVENT_SL2VL_CHANGE_BIT
+#define _HFI1_EVENT_TID_MMU_NOTIFY_BIT 5
+#define _HFI1_MAX_EVENT_BIT _HFI1_EVENT_TID_MMU_NOTIFY_BIT
 
 #define HFI1_EVENT_FROZENBIT(_HFI1_EVENT_FROZEN_BIT)
 #define HFI1_EVENT_LINKDOWN  BIT(_HFI1_EVENT_LINKDOWN_BIT)
 #define HFI1_EVENT_LID_CHANGEBIT(_HFI1_EVENT_LID_CHANGE_BIT)
 #define HFI1_EVENT_LMC_CHANGEBIT(_HFI1_EVENT_LMC_CHANGE_BIT)
 #define HFI1_EVENT_SL2VL_CHANGE  BIT(_HFI1_EVENT_SL2VL_CHANGE_BIT)
+#define HFI1_EVENT_TID_MMU_NOTIFYBIT(_HFI1_EVENT_TID_MMU_NOTIFY_BIT)
 
 /*
  * These are the status bits readable (in ASCII form, 64bit value)
-- 
1.8.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/15] i40iw: changes for build of i40iw module

2015-12-16 Thread kbuild test robot
Hi Faisal,

[auto build test WARNING on net/master]
[also build test WARNING on v4.4-rc5 next-20151216]
[cannot apply to net-next/master]

url:
https://github.com/0day-ci/linux/commits/Faisal-Latif/add-Intel-R-X722-iWARP-driver/20151217-040340
config: sparc-allyesconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=sparc 

All warnings (new ones prefixed by >>):

   drivers/infiniband/hw/i40iw/i40iw_verbs.c: In function 
'i40iw_setup_kmode_qp':
>> drivers/infiniband/hw/i40iw/i40iw_verbs.c:571:28: warning: cast to pointer 
>> from integer of different size [-Wint-to-pointer-cast]
 info->rq_pa = (uintptr_t)((u8 *)mem->pa + (sqdepth * 
I40IW_QP_WQE_MIN_SIZE));
   ^

vim +571 drivers/infiniband/hw/i40iw/i40iw_verbs.c

e4d636f5 Faisal Latif 2015-12-16  555   ukinfo->rq_wrid_array = (u64 
*)&ukinfo->sq_wrtrk_array[sqdepth];
e4d636f5 Faisal Latif 2015-12-16  556  
e4d636f5 Faisal Latif 2015-12-16  557   size = (sqdepth + rqdepth) * 
I40IW_QP_WQE_MIN_SIZE;
e4d636f5 Faisal Latif 2015-12-16  558   size += (I40IW_SHADOW_AREA_SIZE << 3);
e4d636f5 Faisal Latif 2015-12-16  559  
e4d636f5 Faisal Latif 2015-12-16  560   status = 
i40iw_allocate_dma_mem(iwdev->sc_dev.hw, mem, size, 256);
e4d636f5 Faisal Latif 2015-12-16  561   if (status) {
e4d636f5 Faisal Latif 2015-12-16  562   kfree(ukinfo->sq_wrtrk_array);
e4d636f5 Faisal Latif 2015-12-16  563   ukinfo->sq_wrtrk_array = NULL;
e4d636f5 Faisal Latif 2015-12-16  564   return -ENOMEM;
e4d636f5 Faisal Latif 2015-12-16  565   }
e4d636f5 Faisal Latif 2015-12-16  566  
e4d636f5 Faisal Latif 2015-12-16  567   ukinfo->sq = mem->va;
e4d636f5 Faisal Latif 2015-12-16  568   info->sq_pa = mem->pa;
e4d636f5 Faisal Latif 2015-12-16  569  
e4d636f5 Faisal Latif 2015-12-16  570   ukinfo->rq = (u64 *)((u8 *)mem->va + 
(sqdepth * I40IW_QP_WQE_MIN_SIZE));
e4d636f5 Faisal Latif 2015-12-16 @571   info->rq_pa = (uintptr_t)((u8 *)mem->pa 
+ (sqdepth * I40IW_QP_WQE_MIN_SIZE));
e4d636f5 Faisal Latif 2015-12-16  572  
e4d636f5 Faisal Latif 2015-12-16  573   ukinfo->shadow_area = (u64 *)((u8 
*)ukinfo->rq +
e4d636f5 Faisal Latif 2015-12-16  574 (rqdepth 
* I40IW_QP_WQE_MIN_SIZE));
e4d636f5 Faisal Latif 2015-12-16  575   info->shadow_area_pa = info->rq_pa + 
(rqdepth * I40IW_QP_WQE_MIN_SIZE);
e4d636f5 Faisal Latif 2015-12-16  576  
e4d636f5 Faisal Latif 2015-12-16  577   ukinfo->sq_size = sq_size;
e4d636f5 Faisal Latif 2015-12-16  578   ukinfo->rq_size = rq_size;
e4d636f5 Faisal Latif 2015-12-16  579   ukinfo->qp_id = iwqp->ibqp.qp_num;

:: The code at line 571 was first introduced by commit
:: e4d636f5c9dea5d2dd1f5c74e3a2235218a537a8 i40iw: add files for iwarp 
interface

:: TO: Faisal Latif 
:: CC: 0day robot 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


[PATCH] IB/mlx4: Replace kfree with kvfree in mlx4_ib_destroy_srq

2015-12-16 Thread Wengang Wang
Commit 0ef2f05c7e02ff99c0b5b583d7dee2cd12b053f2 uses vmalloc for WR buffers
when needed and uses kvfree to free the buffers. It missed changing kfree
to kvfree in mlx4_ib_destroy_srq().

Reported-by: Matthew Finaly 
Signed-off-by: Wengang Wang 
---
 drivers/infiniband/hw/mlx4/srq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index 8d133c4..c394376 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -286,7 +286,7 @@ int mlx4_ib_destroy_srq(struct ib_srq *srq)
mlx4_ib_db_unmap_user(to_mucontext(srq->uobject->context), 
&msrq->db);
ib_umem_release(msrq->umem);
} else {
-   kfree(msrq->wrid);
+   kvfree(msrq->wrid);
mlx4_buf_free(dev->dev, msrq->msrq.max << msrq->msrq.wqe_shift,
  &msrq->buf);
mlx4_db_free(dev->dev, &msrq->db);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] IB/mlx4: Use vmalloc for WR buffers when needed

2015-12-16 Thread Wengang Wang

Hi Matt,

Yes, you are right.
Since the patch is already merged in, I am going to make a separated 
patch for that.


thanks,
wengang

在 2015年12月12日 04:28, Matthew Finlay 写道:

Hi Wengang,

I was going through your patch set here, and it seems that you missed changing 
kfree to kvfree in mlx4_ib_destroy_srq().  In the current code if the srq wrid 
is allocated using vmalloc, then on cleanup we will use kfree, which is a bug.

Thanks,
-matt




On 10/7/15, 10:27 PM, "linux-rdma-ow...@vger.kernel.org on behalf of Wengang Wang" 
 wrote:


There are several hits that WR buffer allocation(kmalloc) failed.
It failed at order 3 and/or 4 contigous pages allocation. At the same time
there are actually 100MB+ free memory but well fragmented.
So try vmalloc when kmalloc failed.

Signed-off-by: Wengang Wang 
Acked-by: Or Gerlitz 
---
drivers/infiniband/hw/mlx4/qp.c  | 19 +--
drivers/infiniband/hw/mlx4/srq.c | 11 ---
2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 4ad9be3..3ccbd3a 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -34,6 +34,7 @@
#include 
#include 
#include 
+#include 

#include 
#include 
@@ -786,8 +787,14 @@ static int create_qp_common(struct mlx4_ib_dev *dev, 
struct ib_pd *pd,
if (err)
goto err_mtt;

-   qp->sq.wrid  = kmalloc(qp->sq.wqe_cnt * sizeof (u64), gfp);
-   qp->rq.wrid  = kmalloc(qp->rq.wqe_cnt * sizeof (u64), gfp);
+   qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(u64), gfp);
+   if (!qp->sq.wrid)
+   qp->sq.wrid = __vmalloc(qp->sq.wqe_cnt * sizeof(u64),
+   gfp, PAGE_KERNEL);
+   qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(u64), gfp);
+   if (!qp->rq.wrid)
+   qp->rq.wrid = __vmalloc(qp->rq.wqe_cnt * sizeof(u64),
+   gfp, PAGE_KERNEL);
if (!qp->sq.wrid || !qp->rq.wrid) {
err = -ENOMEM;
goto err_wrid;
@@ -874,8 +881,8 @@ err_wrid:
if (qp_has_rq(init_attr))
mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), 
&qp->db);
} else {
-   kfree(qp->sq.wrid);
-   kfree(qp->rq.wrid);
+   kvfree(qp->sq.wrid);
+   kvfree(qp->rq.wrid);
}

err_mtt:
@@ -1050,8 +1057,8 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, 
struct mlx4_ib_qp *qp,
  &qp->db);
ib_umem_release(qp->umem);
} else {
-   kfree(qp->sq.wrid);
-   kfree(qp->rq.wrid);
+   kvfree(qp->sq.wrid);
+   kvfree(qp->rq.wrid);
if (qp->mlx4_ib_qp_type & (MLX4_IB_QPT_PROXY_SMI_OWNER |
MLX4_IB_QPT_PROXY_SMI | MLX4_IB_QPT_PROXY_GSI))
free_proxy_bufs(&dev->ib_dev, qp);
diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c
index dce5dfe..8d133c4 100644
--- a/drivers/infiniband/hw/mlx4/srq.c
+++ b/drivers/infiniband/hw/mlx4/srq.c
@@ -34,6 +34,7 @@
#include 
#include 
#include 
+#include 

#include "mlx4_ib.h"
#include "user.h"
@@ -172,8 +173,12 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd,

srq->wrid = kmalloc(srq->msrq.max * sizeof (u64), GFP_KERNEL);
if (!srq->wrid) {
-   err = -ENOMEM;
-   goto err_mtt;
+   srq->wrid = __vmalloc(srq->msrq.max * sizeof(u64),
+ GFP_KERNEL, PAGE_KERNEL);
+   if (!srq->wrid) {
+   err = -ENOMEM;
+   goto err_mtt;
+   }
}
}

@@ -204,7 +209,7 @@ err_wrid:
if (pd->uobject)
mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), 
&srq->db);
else
-   kfree(srq->wrid);
+   kvfree(srq->wrid);

err_mtt:
mlx4_mtt_cleanup(dev->dev, &srq->mtt);
--
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

N�r��y���b�X��ǧv�^�)޺{.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj"��!tml=


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 08/10] xprtrdma: Add ro_unmap_sync method for all-physical registration

2015-12-16 Thread Chuck Lever
physical's ro_unmap is synchronous already. The new ro_unmap_sync
method just has to DMA unmap all MRs associated with the RPC
request.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/physical_ops.c |   13 +
 1 file changed, 13 insertions(+)

diff --git a/net/sunrpc/xprtrdma/physical_ops.c 
b/net/sunrpc/xprtrdma/physical_ops.c
index 617b76f..dbb302e 100644
--- a/net/sunrpc/xprtrdma/physical_ops.c
+++ b/net/sunrpc/xprtrdma/physical_ops.c
@@ -83,6 +83,18 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
return 1;
 }
 
+/* DMA unmap all memory regions that were mapped for "req".
+ */
+static void
+physical_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   unsigned int i;
+
+   for (i = 0; req->rl_nchunks; --req->rl_nchunks)
+   rpcrdma_unmap_one(device, &req->rl_segments[i++]);
+}
+
 static void
 physical_op_destroy(struct rpcrdma_buffer *buf)
 {
@@ -90,6 +102,7 @@ physical_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = {
.ro_map = physical_op_map,
+   .ro_unmap_sync  = physical_op_unmap_sync,
.ro_unmap   = physical_op_unmap,
.ro_open= physical_op_open,
.ro_maxpages= physical_op_maxpages,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 09/10] xprtrdma: Invalidate in the RPC reply handler

2015-12-16 Thread Chuck Lever
There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.

The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.

This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.

Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.

Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.

Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.

Reported-by: Jason Gunthorpe 
Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/rpc_rdma.c |   16 
 1 file changed, 16 insertions(+)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c10d969..0f28f2d 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -804,6 +804,11 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
if (req->rl_reply)
goto out_duplicate;
 
+   /* Sanity checking has passed. We are now committed
+* to complete this transaction.
+*/
+   list_del_init(&rqst->rq_list);
+   spin_unlock_bh(&xprt->transport_lock);
dprintk("RPC:   %s: reply 0x%p completes request 0x%p\n"
"   RPC request 0x%p xid 0x%08x\n",
__func__, rep, req, rqst,
@@ -888,12 +893,23 @@ badheader:
break;
}
 
+   /* Invalidate and flush the data payloads before waking the
+* waiting application. This guarantees the memory region is
+* properly fenced from the server before the application
+* accesses the data. It also ensures proper send flow
+* control: waking the next RPC waits until this RPC has
+* relinquished all its Send Queue entries.
+*/
+   if (req->rl_nchunks)
+   r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req);
+
credits = be32_to_cpu(headerp->rm_credit);
if (credits == 0)
credits = 1;/* don't deadlock */
else if (credits > r_xprt->rx_buf.rb_max_requests)
credits = r_xprt->rx_buf.rb_max_requests;
 
+   spin_lock_bh(&xprt->transport_lock);
cwnd = xprt->cwnd;
xprt->cwnd = credits << RPC_CWNDSHIFT;
if (xprt->cwnd > cwnd)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 10/10] xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').

2015-12-16 Thread Chuck Lever
The root of the problem was that sends (especially unsignalled
FASTREG and LOCAL_INV Work Requests) were not properly flow-
controlled, which allowed a send queue overrun.

Now that the RPC/RDMA reply handler waits for invalidation to
complete, the send queue is properly flow-controlled. Thus this
limit is no longer necessary.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/verbs.c |6 ++
 net/sunrpc/xprtrdma/xprt_rdma.h |6 --
 2 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index f23f3d6..1867e3a 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -608,10 +608,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia 
*ia,
 
/* set trigger for requesting send completion */
ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1;
-   if (ep->rep_cqinit > RPCRDMA_MAX_UNSIGNALED_SENDS)
-   ep->rep_cqinit = RPCRDMA_MAX_UNSIGNALED_SENDS;
-   else if (ep->rep_cqinit <= 2)
-   ep->rep_cqinit = 0;
+   if (ep->rep_cqinit <= 2)
+   ep->rep_cqinit = 0; /* always signal? */
INIT_CQCOUNT(ep);
init_waitqueue_head(&ep->rep_connect_wait);
INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker);
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index b8bac41..a563ffc 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -87,12 +87,6 @@ struct rpcrdma_ep {
struct delayed_work rep_connect_worker;
 };
 
-/*
- * Force a signaled SEND Work Request every so often,
- * in case the provider needs to do some housekeeping.
- */
-#define RPCRDMA_MAX_UNSIGNALED_SENDS   (32)
-
 #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit)
 #define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount)
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 06/10] xprtrdma: Add ro_unmap_sync method for FRWR

2015-12-16 Thread Chuck Lever
FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts
LOCAL_INV Work Requests and waits for them to complete before
returning.

Note also, DMA unmapping is now done _after_ invalidation.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/frwr_ops.c  |  136 ++-
 net/sunrpc/xprtrdma/xprt_rdma.h |2 +
 2 files changed, 134 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 660d0b6..aa078a0 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -244,12 +244,14 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt)
 rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth);
 }
 
-/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs to be reset. */
+/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs
+ * to be reset.
+ *
+ * WARNING: Only wr_id and status are reliable at this point
+ */
 static void
-frwr_sendcompletion(struct ib_wc *wc)
+__frwr_sendcompletion_flush(struct ib_wc *wc, struct rpcrdma_mw *r)
 {
-   struct rpcrdma_mw *r;
-
if (likely(wc->status == IB_WC_SUCCESS))
return;
 
@@ -260,9 +262,23 @@ frwr_sendcompletion(struct ib_wc *wc)
else
pr_warn("RPC:   %s: frmr %p error, status %s (%d)\n",
__func__, r, ib_wc_status_msg(wc->status), wc->status);
+
r->r.frmr.fr_state = FRMR_IS_STALE;
 }
 
+static void
+frwr_sendcompletion(struct ib_wc *wc)
+{
+   struct rpcrdma_mw *r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id;
+   struct rpcrdma_frmr *f = &r->r.frmr;
+
+   if (unlikely(wc->status != IB_WC_SUCCESS))
+   __frwr_sendcompletion_flush(wc, r);
+
+   if (f->fr_waiter)
+   complete(&f->fr_linv_done);
+}
+
 static int
 frwr_op_init(struct rpcrdma_xprt *r_xprt)
 {
@@ -334,6 +350,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
} while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
+   frmr->fr_waiter = false;
mr = frmr->fr_mr;
reg_wr = &frmr->fr_regwr;
 
@@ -413,6 +430,116 @@ out_senderr:
return rc;
 }
 
+static struct ib_send_wr *
+__frwr_prepare_linv_wr(struct rpcrdma_mr_seg *seg)
+{
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   struct rpcrdma_frmr *f = &mw->r.frmr;
+   struct ib_send_wr *invalidate_wr;
+
+   f->fr_waiter = false;
+   f->fr_state = FRMR_IS_INVALID;
+   invalidate_wr = &f->fr_invwr;
+
+   memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+   invalidate_wr->wr_id = (unsigned long)(void *)mw;
+   invalidate_wr->opcode = IB_WR_LOCAL_INV;
+   invalidate_wr->ex.invalidate_rkey = f->fr_mr->rkey;
+
+   return invalidate_wr;
+}
+
+static void
+__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+int rc)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   struct rpcrdma_frmr *f = &mw->r.frmr;
+
+   seg->rl_mw = NULL;
+
+   ib_dma_unmap_sg(device, f->sg, f->sg_nents, seg->mr_dir);
+
+   if (!rc)
+   rpcrdma_put_mw(r_xprt, mw);
+   else
+   __frwr_queue_recovery(mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct ib_send_wr *invalidate_wrs, *pos, *prev, *bad_wr;
+   struct rpcrdma_ia *ia = &r_xprt->rx_ia;
+   struct rpcrdma_mr_seg *seg;
+   unsigned int i, nchunks;
+   struct rpcrdma_frmr *f;
+   int rc;
+
+   dprintk("RPC:   %s: req %p\n", __func__, req);
+
+   /* ORDER: Invalidate all of the req's MRs first
+*
+* Chain the LOCAL_INV Work Requests and post them with
+* a single ib_post_send() call.
+*/
+   invalidate_wrs = pos = prev = NULL;
+   seg = NULL;
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = &req->rl_segments[i];
+
+   pos = __frwr_prepare_linv_wr(seg);
+
+   if (!invalidate_wrs)
+   invalidate_wrs = pos;
+   else
+   prev->next = pos;
+   prev = pos;
+
+   i += seg->mr_nsegs;
+   }
+   f = &seg->rl_mw->r.frmr;
+
+   /* Strong send queue ordering guarantees that when the
+* last WR in the chain completes, all WRs in the chain
+* are complete.
+*/
+   f->fr_invwr.send_flags = IB_SEND_SIGNALED;
+   f->fr_waiter = true;
+   init_completion(&f->fr_linv_done);
+   INIT_CQCOUNT(&r_xprt->rx_ep);
+
+   /* Transport disconnect drains the receive CQ before it
+* replaces the Q

[PATCH v4 07/10] xprtrdma: Add ro_unmap_sync method for FMR

2015-12-16 Thread Chuck Lever
FMR's ro_unmap method is already synchronous because ib_unmap_fmr()
is a synchronous verb. However, some improvements can be made here.

1. Gather all the MRs for the RPC request onto a list, and invoke
   ib_unmap_fmr() once with that list. This reduces the number of
   doorbells when there is more than one MR to invalidate

2. Perform the DMA unmap _after_ the MRs are unmapped, not before.
   This is critical after invalidating a Write chunk.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/fmr_ops.c |   64 +
 1 file changed, 64 insertions(+)

diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c
index f1e8daf..c14f3a4 100644
--- a/net/sunrpc/xprtrdma/fmr_ops.c
+++ b/net/sunrpc/xprtrdma/fmr_ops.c
@@ -179,6 +179,69 @@ out_maperr:
return rc;
 }
 
+static void
+__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   int nsegs = seg->mr_nsegs;
+
+   seg->rl_mw = NULL;
+
+   while (nsegs--)
+   rpcrdma_unmap_one(device, seg++);
+
+   rpcrdma_put_mw(r_xprt, mw);
+}
+
+/* Invalidate all memory regions that were registered for "req".
+ *
+ * Sleeps until it is safe for the host CPU to access the
+ * previously mapped memory regions.
+ */
+static void
+fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req)
+{
+   struct rpcrdma_mr_seg *seg;
+   unsigned int i, nchunks;
+   struct rpcrdma_mw *mw;
+   LIST_HEAD(unmap_list);
+   int rc;
+
+   dprintk("RPC:   %s: req %p\n", __func__, req);
+
+   /* ORDER: Invalidate all of the req's MRs first
+*
+* ib_unmap_fmr() is slow, so use a single call instead
+* of one call per mapped MR.
+*/
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = &req->rl_segments[i];
+   mw = seg->rl_mw;
+
+   list_add(&mw->r.fmr.fmr->list, &unmap_list);
+
+   i += seg->mr_nsegs;
+   }
+   rc = ib_unmap_fmr(&unmap_list);
+   if (rc)
+   pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc);
+
+   /* ORDER: Now DMA unmap all of the req's MRs, and return
+* them to the free MW list.
+*/
+   for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) {
+   seg = &req->rl_segments[i];
+
+   __fmr_dma_unmap(r_xprt, seg);
+
+   i += seg->mr_nsegs;
+   seg->mr_nsegs = 0;
+   }
+
+   req->rl_nchunks = 0;
+}
+
 /* Use the ib_unmap_fmr() verb to prevent further remote
  * access via RDMA READ or RDMA WRITE.
  */
@@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf)
 
 const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = {
.ro_map = fmr_op_map,
+   .ro_unmap_sync  = fmr_op_unmap_sync,
.ro_unmap   = fmr_op_unmap,
.ro_open= fmr_op_open,
.ro_maxpages= fmr_op_maxpages,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 04/10] xprtrdma: Move struct ib_send_wr off the stack

2015-12-16 Thread Chuck Lever
For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off
the stack. This allows frwr_op_map and frwr_op_unmap to chain
WRs together without limit to register or invalidate a set of MRs
with a single ib_post_send().

(This will be for chaining LOCAL_INV requests).

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/frwr_ops.c  |   38 --
 net/sunrpc/xprtrdma/xprt_rdma.h |4 
 2 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index ae2a241..660d0b6 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -318,7 +318,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
struct rpcrdma_mw *mw;
struct rpcrdma_frmr *frmr;
struct ib_mr *mr;
-   struct ib_reg_wr reg_wr;
+   struct ib_reg_wr *reg_wr;
struct ib_send_wr *bad_wr;
int rc, i, n, dma_nents;
u8 key;
@@ -335,6 +335,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
mr = frmr->fr_mr;
+   reg_wr = &frmr->fr_regwr;
 
if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;
@@ -380,19 +381,19 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
key = (u8)(mr->rkey & 0x00FF);
ib_update_fast_reg_key(mr, ++key);
 
-   reg_wr.wr.next = NULL;
-   reg_wr.wr.opcode = IB_WR_REG_MR;
-   reg_wr.wr.wr_id = (uintptr_t)mw;
-   reg_wr.wr.num_sge = 0;
-   reg_wr.wr.send_flags = 0;
-   reg_wr.mr = mr;
-   reg_wr.key = mr->rkey;
-   reg_wr.access = writing ?
-   IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
-   IB_ACCESS_REMOTE_READ;
+   reg_wr->wr.next = NULL;
+   reg_wr->wr.opcode = IB_WR_REG_MR;
+   reg_wr->wr.wr_id = (uintptr_t)mw;
+   reg_wr->wr.num_sge = 0;
+   reg_wr->wr.send_flags = 0;
+   reg_wr->mr = mr;
+   reg_wr->key = mr->rkey;
+   reg_wr->access = writing ?
+IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+IB_ACCESS_REMOTE_READ;
 
DECR_CQCOUNT(&r_xprt->rx_ep);
-   rc = ib_post_send(ia->ri_id->qp, ®_wr.wr, &bad_wr);
+   rc = ib_post_send(ia->ri_id->qp, ®_wr->wr, &bad_wr);
if (rc)
goto out_senderr;
 
@@ -422,23 +423,24 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_mw *mw = seg1->rl_mw;
struct rpcrdma_frmr *frmr = &mw->r.frmr;
-   struct ib_send_wr invalidate_wr, *bad_wr;
+   struct ib_send_wr *invalidate_wr, *bad_wr;
int rc, nsegs = seg->mr_nsegs;
 
dprintk("RPC:   %s: FRMR %p\n", __func__, mw);
 
seg1->rl_mw = NULL;
frmr->fr_state = FRMR_IS_INVALID;
+   invalidate_wr = &mw->r.frmr.fr_invwr;
 
-   memset(&invalidate_wr, 0, sizeof(invalidate_wr));
-   invalidate_wr.wr_id = (unsigned long)(void *)mw;
-   invalidate_wr.opcode = IB_WR_LOCAL_INV;
-   invalidate_wr.ex.invalidate_rkey = frmr->fr_mr->rkey;
+   memset(invalidate_wr, 0, sizeof(*invalidate_wr));
+   invalidate_wr->wr_id = (uintptr_t)mw;
+   invalidate_wr->opcode = IB_WR_LOCAL_INV;
+   invalidate_wr->ex.invalidate_rkey = frmr->fr_mr->rkey;
DECR_CQCOUNT(&r_xprt->rx_ep);
 
ib_dma_unmap_sg(ia->ri_device, frmr->sg, frmr->sg_nents, seg1->mr_dir);
read_lock(&ia->ri_qplock);
-   rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr);
+   rc = ib_post_send(ia->ri_id->qp, invalidate_wr, &bad_wr);
read_unlock(&ia->ri_qplock);
if (rc)
goto out_err;
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 4197191..b1065ca 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -206,6 +206,10 @@ struct rpcrdma_frmr {
enum rpcrdma_frmr_state fr_state;
struct work_struct  fr_work;
struct rpcrdma_xprt *fr_xprt;
+   union {
+   struct ib_reg_wrfr_regwr;
+   struct ib_send_wr   fr_invwr;
+   };
 };
 
 struct rpcrdma_fmr {

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 02/10] xprtrdma: xprt_rdma_free() must not release backchannel reqs

2015-12-16 Thread Chuck Lever
Preserve any rpcrdma_req that is attached to rpc_rqst's allocated
for the backchannel. Otherwise, after all the pre-allocated
backchannel req's are consumed, incoming backward calls start
writing on freed memory.

Somehow this hunk got lost.

Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/transport.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 8c545f7..740bddc 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -576,6 +576,9 @@ xprt_rdma_free(void *buffer)
 
rb = container_of(buffer, struct rpcrdma_regbuf, rg_base[0]);
req = rb->rg_owner;
+   if (req->rl_backchannel)
+   return;
+
r_xprt = container_of(req->rl_buffer, struct rpcrdma_xprt, rx_buf);
 
dprintk("RPC:   %s: called on 0x%p\n", __func__, req->rl_reply);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 03/10] xprtrdma: Disable RPC/RDMA backchannel debugging messages

2015-12-16 Thread Chuck Lever
Clean up.

Fixes: 63cae47005af ('xprtrdma: Handle incoming backward direction')
Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/backchannel.c |   16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c 
b/net/sunrpc/xprtrdma/backchannel.c
index 11d2cfb..cd31181 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -15,7 +15,7 @@
 # define RPCDBG_FACILITY   RPCDBG_TRANS
 #endif
 
-#define RPCRDMA_BACKCHANNEL_DEBUG
+#undef RPCRDMA_BACKCHANNEL_DEBUG
 
 static void rpcrdma_bc_free_rqst(struct rpcrdma_xprt *r_xprt,
 struct rpc_rqst *rqst)
@@ -136,6 +136,7 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int 
reqs)
   __func__);
goto out_free;
}
+   dprintk("RPC:   %s: new rqst %p\n", __func__, rqst);
 
rqst->rq_xprt = &r_xprt->rx_xprt;
INIT_LIST_HEAD(&rqst->rq_list);
@@ -216,12 +217,14 @@ int rpcrdma_bc_marshal_reply(struct rpc_rqst *rqst)
 
rpclen = rqst->rq_svec[0].iov_len;
 
+#ifdef RPCRDMA_BACKCHANNEL_DEBUG
pr_info("RPC:   %s: rpclen %zd headerp 0x%p lkey 0x%x\n",
__func__, rpclen, headerp, rdmab_lkey(req->rl_rdmabuf));
pr_info("RPC:   %s: RPC/RDMA: %*ph\n",
__func__, (int)RPCRDMA_HDRLEN_MIN, headerp);
pr_info("RPC:   %s:  RPC: %*ph\n",
__func__, (int)rpclen, rqst->rq_svec[0].iov_base);
+#endif
 
req->rl_send_iov[0].addr = rdmab_addr(req->rl_rdmabuf);
req->rl_send_iov[0].length = RPCRDMA_HDRLEN_MIN;
@@ -265,6 +268,9 @@ void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst)
 {
struct rpc_xprt *xprt = rqst->rq_xprt;
 
+   dprintk("RPC:   %s: freeing rqst %p (req %p)\n",
+   __func__, rqst, rpcr_to_rdmar(rqst));
+
smp_mb__before_atomic();
WARN_ON_ONCE(!test_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state));
clear_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state);
@@ -329,9 +335,7 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
struct rpc_rqst, rq_bc_pa_list);
list_del(&rqst->rq_bc_pa_list);
spin_unlock(&xprt->bc_pa_lock);
-#ifdef RPCRDMA_BACKCHANNEL_DEBUG
-   pr_info("RPC:   %s: using rqst %p\n", __func__, rqst);
-#endif
+   dprintk("RPC:   %s: using rqst %p\n", __func__, rqst);
 
/* Prepare rqst */
rqst->rq_reply_bytes_recvd = 0;
@@ -351,10 +355,8 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
 * direction reply.
 */
req = rpcr_to_rdmar(rqst);
-#ifdef RPCRDMA_BACKCHANNEL_DEBUG
-   pr_info("RPC:   %s: attaching rep %p to req %p\n",
+   dprintk("RPC:   %s: attaching rep %p to req %p\n",
__func__, rep, req);
-#endif
req->rl_reply = rep;
 
/* Defeat the retransmit detection logic in send_request */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 05/10] xprtrdma: Introduce ro_unmap_sync method

2015-12-16 Thread Chuck Lever
In the current xprtrdma implementation, some memreg strategies
implement ro_unmap synchronously (the MR is knocked down before the
method returns) and some asynchonously (the MR will be knocked down
and returned to the pool in the background).

To guarantee the MR is truly invalid before the RPC consumer is
allowed to resume execution, we need an unmap method that is
always synchronous, invoked from the RPC/RDMA reply handler.

The new method unmaps all MRs for an RPC. The existing ro_unmap
method unmaps only one MR at a time.

Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/xprt_rdma.h |2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index b1065ca..512184d 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -367,6 +367,8 @@ struct rpcrdma_xprt;
 struct rpcrdma_memreg_ops {
int (*ro_map)(struct rpcrdma_xprt *,
  struct rpcrdma_mr_seg *, int, bool);
+   void(*ro_unmap_sync)(struct rpcrdma_xprt *,
+struct rpcrdma_req *);
int (*ro_unmap)(struct rpcrdma_xprt *,
struct rpcrdma_mr_seg *);
int (*ro_open)(struct rpcrdma_ia *,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 01/10] xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)

2015-12-16 Thread Chuck Lever
Clean up.

rb_lock critical sections added in rpcrdma_ep_post_extra_recv()
should have first been converted to use normal spin_lock now that
the reply handler is a work queue.

The backchannel set up code should use the appropriate helper
instead of open-coding a rb_recv_bufs list add.

Problem introduced by glib patch re-ordering on my part.

Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst')
Signed-off-by: Chuck Lever 
Tested-by: Devesh Sharma 
---
 net/sunrpc/xprtrdma/backchannel.c |6 +-
 net/sunrpc/xprtrdma/verbs.c   |7 +++
 2 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/net/sunrpc/xprtrdma/backchannel.c 
b/net/sunrpc/xprtrdma/backchannel.c
index 2dcb44f..11d2cfb 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -84,9 +84,7 @@ out_fail:
 static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
 unsigned int count)
 {
-   struct rpcrdma_buffer *buffers = &r_xprt->rx_buf;
struct rpcrdma_rep *rep;
-   unsigned long flags;
int rc = 0;
 
while (count--) {
@@ -98,9 +96,7 @@ static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt,
break;
}
 
-   spin_lock_irqsave(&buffers->rb_lock, flags);
-   list_add(&rep->rr_list, &buffers->rb_recv_bufs);
-   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   rpcrdma_recv_buffer_put(rep);
}
 
return rc;
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 650034b..f23f3d6 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -1329,15 +1329,14 @@ rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, 
unsigned int count)
struct rpcrdma_ia *ia = &r_xprt->rx_ia;
struct rpcrdma_ep *ep = &r_xprt->rx_ep;
struct rpcrdma_rep *rep;
-   unsigned long flags;
int rc;
 
while (count--) {
-   spin_lock_irqsave(&buffers->rb_lock, flags);
+   spin_lock(&buffers->rb_lock);
if (list_empty(&buffers->rb_recv_bufs))
goto out_reqbuf;
rep = rpcrdma_buffer_get_rep_locked(buffers);
-   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   spin_unlock(&buffers->rb_lock);
 
rc = rpcrdma_ep_post_recv(ia, ep, rep);
if (rc)
@@ -1347,7 +1346,7 @@ rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, 
unsigned int count)
return 0;
 
 out_reqbuf:
-   spin_unlock_irqrestore(&buffers->rb_lock, flags);
+   spin_unlock(&buffers->rb_lock);
pr_warn("%s: no extra receive buffers\n", __func__);
return -ENOMEM;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 00/10] NFS/RDMA client patches for 4.5

2015-12-16 Thread Chuck Lever
For 4.5, I'd like to address the send queue accounting and
invalidation/unmap ordering issues Jason brought up a couple of
months ago.

Also available in the "nfs-rdma-for-4.5" topic branch of this git repo:

git://git.linux-nfs.org/projects/cel/cel-2.6.git

Or for browsing:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5


Changes since v3:
- Dropped xprt_commit_rqst()
- __frmr_dma_unmap now uses ib_dma_unmap_sg()
- Use transparent union in struct rpcrdma_frmr


Changes since v2:
- Rebased on Christoph's ib_device_attr branch


Changes since v1:

- Rebased on v4.4-rc3
- Receive buffer safety margin patch dropped
- Backchannel pr_err and pr_info converted to dprintk
- Backchannel spin locks converted to work queue-safe locks
- Fixed premature release of backchannel request buffer
- NFSv4.1 callbacks tested with for-4.5 server

---

Chuck Lever (10):
  xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
  xprtrdma: xprt_rdma_free() must not release backchannel reqs
  xprtrdma: Disable RPC/RDMA backchannel debugging messages
  xprtrdma: Move struct ib_send_wr off the stack
  xprtrdma: Introduce ro_unmap_sync method
  xprtrdma: Add ro_unmap_sync method for FRWR
  xprtrdma: Add ro_unmap_sync method for FMR
  xprtrdma: Add ro_unmap_sync method for all-physical registration
  xprtrdma: Invalidate in the RPC reply handler
  xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').


 net/sunrpc/xprtrdma/backchannel.c  |   22 ++---
 net/sunrpc/xprtrdma/fmr_ops.c  |   64 +
 net/sunrpc/xprtrdma/frwr_ops.c |  174 +++-
 net/sunrpc/xprtrdma/physical_ops.c |   13 +++
 net/sunrpc/xprtrdma/rpc_rdma.c |   16 +++
 net/sunrpc/xprtrdma/transport.c|3 +
 net/sunrpc/xprtrdma/verbs.c|   13 +--
 net/sunrpc/xprtrdma/xprt_rdma.h|   14 ++-
 8 files changed, 271 insertions(+), 48 deletions(-)

--
Chuck Lever
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/15] add Intel(R) X722 iWARP driver

2015-12-16 Thread Joe Perches
On Wed, 2015-12-16 at 13:58 -0600, Faisal Latif wrote:
> This series contains the addition of the i40iw.ko driver.

This series should probably be respun against -next
instead of linus' tree.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-16 Thread Jason Gunthorpe
On Wed, Dec 16, 2015 at 03:39:16PM -0500, Doug Ledford wrote:

> These patches add the concept of duplicate GIDs that are differentiated
> by their RoCE version (also called network type).

and by vlan, and smac, and ... Basically everything network unique about
a namespace has to be encapsulted in the gid index now. Each namespace
thus has a subset of gid indexes that are valid for it to use for
outbound and to recieve packets on.

roce didn't really have a way to work with net namespaces, AFAIK (?)
so it gets a pass.

But rocev2 very clearly does. It needs needs to address the issue
outlined in commit b8cab5dab15ff5c2acc3faefdde28919b0341c11 (IB/cma:
Accept connection without a valid netdev on RoCE)

That means cma.c needs to get the gid index every single CMA packet it
processes and confirm that the associated net device is permitted to
talk to the matching CM ID. 

It is no mistake there is a hole in cma.c waiting for this code, when
Haggai did that work it was very clear in my mind that rocev2 would
need to slot into here as well.

> Jason's objections are this:
> 
> 1)  The lazy resolution is wrong.

Wrong in the sense it doesn't actually exist in a usable form
anyplace.

cma.c does not do it, and absoultely must as discussed above.

init_ah_from_wc needs to do it, and maybe does. It is hard to tell,
perhaps a 'rdma_wc_to_dgid_index()' is actually open coded in there
now. Just from a code readability perspective that is ugly.

Then we get into the missing route handling in all places that
construct a rocev2 AH...

> Jason's preference would be that the above issues be resolved by
> skipping the lazy resolution and instead doing proactive resolution
> on

I am happy with lazy resolution, that is a fine compromise.

I just want to see kapi that makes sense here. It is very clear to me
no kernel user can possibly correctly touch a rocev2 UD packet without
retrieving the gid index, so we must have a kAPI for this.

> namespace.  Or, at a minimum, at least make the information added to the
> core API not something vendor specific like network_type, which is a
> detail of the Mellanox implementation.

I keep suggesting a rdma_wc_to_dgid_index() API call.

Perhasp most of he code for this already seems to exist in
init_ah_from_wc.

> 1 - Actually, for any received packet with associated IP address
> information.  We've only enabled net namespaces for IP connections
> between user space applications, for direct IB connections or for kernel
> connections there is not yet any namespace support.

IHMO, this is actually a problem for rocev2.

IB needs more work to create a rdma namespace, but rocve2 does not.

The kernel software side should certainly be completed as a quick
follow on to this series, that means the use of gid_indexes at all
uAPI access points needs to be checked for rocev2.

HW support is needed to complete rocve2 containment, as the hw must
check the gid_index on all directly posted WCs and *ALL* rx'd packets
for a QP to ensure it is allowed.

Some kind of warn on until that support is available would also be
great.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: warning in ext4 with nfs/rdma server

2015-12-16 Thread J. Bruce Fields
On Tue, Dec 08, 2015 at 07:31:56AM -0600, Steve Wise wrote:
> 
> 
> > -Original Message-
> > From: Chuck Lever [mailto:chuck.le...@oracle.com]
> > Sent: Monday, December 07, 2015 9:45 AM
> > To: Steve Wise
> > Cc: linux-rdma@vger.kernel.org; Veeresh U. Kokatnur; Linux NFS Mailing List
> > Subject: Re: warning in ext4 with nfs/rdma server
> > 
> > Hi Steve-
> > 
> > > On Dec 7, 2015, at 10:38 AM, Steve Wise  
> > > wrote:
> > >
> > > Hey Chuck/NFS developers,
> > >
> > > We're hitting this warning in ext4 on the linux-4.3 nfs server running 
> > > over RDMA/cxgb4.  We're still gathering data, like if it
> > > happens with NFS/TCP.  But has anyone seen this warning on 4.3?  Is it 
> > > likely to indicate some bug in the xprtrdma transport or
> > > above it in NFS?
> > 
> > Yes, please confirm with NFS/TCP. Thanks!
> >
> 
> The same thing happens with NFS/TCP, so this isn't related to xprtrdma.
>  
> > 
> > > We can hit this running cthon tests over 2 mount points:
> > >
> > > -
> > > #!/bin/bash
> > > rm -rf /root/cthon04/loop_iter.txt
> > > while [ 1 ]
> > > do
> > > {
> > >
> > > ./server -s -m /mnt/share1 -o rdma,port=20049,vers=4 -p /mnt/share1 -N 100
> > > 102.1.1.162 &
> > > ./server -s -m /mnt/share2 -o 
> > > rdma,port=20049,vers=3,rsize=65535,wsize=65535 -p
> > > /mnt/share2 -N 100 102.2.2.162 &
> > > wait
> > > echo "iteration $i" >>/root/cthon04/loop_iter.txt
> > > date >>/root/cthon04/loop_iter.txt
> > > }
> > > done
> > > --
> > >
> > > Thanks,
> > >
> > > Steve.
> > >
> > > [ cut here ]
> > > WARNING: CPU: 14 PID: 6689 at fs/ext4/inode.c:231 
> > > ext4_evict_inode+0x41e/0x490

Looks like this is the

WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));

in ext4_evice_inode?  Ext4 developers, any idea how that could happen?

--b.


> > > [ext4]()
> > > Modules linked in: nfsd(E) lockd(E) grace(E) nfs_acl(E) exportfs(E)
> > > auth_rpcgss(E) rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_uverbs(E) rdma_cm(E)
> > > ib_cm(E) ib_sa(E) ib_mad(E) iw_cxgb4(E) iw_cm(E) ib_core(E) ib_addr(E) 
> > > cxgb4(E)
> > > autofs4(E) target_core_iblock(E) target_core_file(E) target_core_pscsi(E)
> > > target_core_mod(E) configfs(E) bnx2fc(E) cnic(E) uio(E) fcoe(E) libfcoe(E)
> > > 8021q(E) libfc(E) garp(E) stp(E) llc(E) cpufreq_ondemand(E) cachefiles(E)
> > > fscache(E) ipv6(E) dm_mirror(E) dm_region_hash(E) dm_log(E) vhost_net(E)
> > > macvtap(E) macvlan(E) vhost(E) tun(E) kvm(E) uinput(E) microcode(E) sg(E)
> > > pcspkr(E) serio_raw(E) fam15h_power(E) k10temp(E) amd64_edac_mod(E)
> > > edac_core(E) edac_mce_amd(E) i2c_piix4(E) igb(E) dca(E) i2c_algo_bit(E)
> > > i2c_core(E) ptp(E) pps_core(E) scsi_transport_fc(E) acpi_cpufreq(E) 
> > > dm_mod(E)
> > > ext4(E) jbd2(E) mbcache(E) sr_mod(E) cdrom(E) sd_mod(E) ahci(E) libahci(E)
> > > [last unloaded: cxgb4]
> > > CPU: 14 PID: 6689 Comm: nfsd Tainted: GE   4.3.0 #1
> > > Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.512/19/2013
> > > 00e7 88400634fad8 812a4084 a00c96eb
> > >  88400634fb18 81059fd5 88400634fbd8
> > > 880fd1a460c8 880fd1a461d8 880fd1a46008 88400634fbd8
> > > Call Trace:
> > > [] dump_stack+0x48/0x64
> > > [] warn_slowpath_common+0x95/0xe0
> > > [] warn_slowpath_null+0x1a/0x20
> > > [] ext4_evict_inode+0x41e/0x490 [ext4]
> > > [] evict+0xae/0x1a0
> > > [] iput_final+0xe5/0x170
> > > [] iput+0xa3/0xf0
> > > [] ? fsnotify_destroy_marks+0x64/0x80
> > > [] dentry_unlink_inode+0xa9/0xe0
> > > [] d_delete+0xa6/0xb0
> > > [] vfs_unlink+0x138/0x140
> > > [] nfsd_unlink+0x165/0x200 [nfsd]
> > > [] ? lru_put_end+0x5c/0x70 [nfsd]
> > > [] nfsd3_proc_remove+0x83/0x120 [nfsd]
> > > [] nfsd_dispatch+0xdc/0x210 [nfsd]
> > > [] svc_process_common+0x311/0x620 [sunrpc]
> > > [] ? nfsd_set_nrthreads+0x1b0/0x1b0 [nfsd]
> > > [] svc_process+0x128/0x1b0 [sunrpc]
> > > [] nfsd+0xf3/0x160 [nfsd]
> > > [] kthread+0xcc/0xf0
> > > [] ? schedule_tail+0x1e/0xc0
> > > [] ? kthread_freezable_should_stop+0x70/0x70
> > > [] ret_from_fork+0x3f/0x70
> > > [] ? kthread_freezable_should_stop+0x70/0x70
> > > ---[ end trace 39afe9aeef2cfb34 ]---
> > > [ cut here ]
> > 
> > --
> > Chuck Lever
> > 
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 15/15] i40iw: changes for build of i40iw module

2015-12-16 Thread Christoph Hellwig
> --- a/include/uapi/rdma/rdma_netlink.h
> +++ b/include/uapi/rdma/rdma_netlink.h
> @@ -5,6 +5,7 @@
>  
>  enum {
>   RDMA_NL_RDMA_CM = 1,
> + RDMA_NL_I40IW,
>   RDMA_NL_NES,
>   RDMA_NL_C4IW,
>   RDMA_NL_LS, /* RDMA Local Services */

This changes the values for the existing RDMA_NL_NES, RDMA_NL_C4IW and
RDMA_NL_LS symbols.  Please add your new value at the end.  And it
should probably be a separate patch as it's not related to the build
system and referenced by the earlier patches.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/15] i40e: Add support for client interface for IWARP driver

2015-12-16 Thread Joe Perches
On Wed, 2015-12-16 at 13:58 -0600, Faisal Latif wrote:
> From: Anjali Singhai Jain 
> 
> This patch adds a Client interface for i40iw driver
> support. Also expands the Virtchannel to support messages
> from i40evf driver on behalf of i40iwvf driver.
[]
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_client.c 
> b/drivers/net/ethernet/intel/i40e/i40e_client.c
[]
> + * Contact Information:
> + * e1000-devel Mailing List 

trivia:

This should probably be: intel-wired-...@lists.osuosl.org

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-16 Thread Doug Ledford
On 12/16/2015 01:56 AM, Moni Shoua wrote:
>> The part that bothers me about this is that this statement makes sense
>> when just thinking about the spec, as you say.  However, once you
>> consider namespaces, security implications make this statement spec
>> compliant, but still unacceptable.  The spec itself is silent on
>> namespaces.  But, you guys wanted, and you got, namespace support.
>> Since that's beyond spec, and carries security requirements, I think
>> it's fair to say that from now on, the Linux kernel RDMA stack can no
>> longer *just* be spec compliant.  There are additional concerns that
>> must always be addressed with new changes, and those are the namespace
>> constraint preservation concerns.
> 
> I can't object to that but I really would like to get an example of a
> security risk.

*This* is exactly the conversation to be having right now.  The
namespace support has been added to the core, and so now we need to
define exactly what the impact of that is for new feature submissions
like this one.  More on that below...

> So far, besides hearing that the way we choose to handle completions
> is wrong, I didn't get a convincing example of how where it doesn't
> work.

Work is too fuzzy of a word to use here.  It could mean "applications
keep running", but that could be contrary to the namespace restrictions
as it may be that the application should *not* have continued to run
when namespace considerations were taken into account.

> Moreover, regarding security, all we wanted is for HW to report
> the L3 protocol (IB, IPv4, or IPv6) in the packet. This is data that
> with some extra CPU cycles can be obtained from the 40 bytes that are
> scattered to the receive bufs anyway. So, if there is a security hole
> it exists from day one of the IB stack and this is not the time we
> should insist on fixing it.

No, not true.  You are implementing RoCEv2 support, which is an entirely
new feature.  So this feature can't have had a security hole since
forever as it has never been in the kernel before now.  The objections
are arising because of the ordering of events.  Specifically, we added
the core namespace support (even though it isn't complete, so far it's
the infrastructure ready for various upper portions of the stack to
start using, but it isn't a complete stack wide solution yet) first, and
so this new feature, which will need to be a part of that namespace
infrastructure that other parts of the IB stack can use, should have its
namespace support already enabled (ideally, but if it didn't, it should
at least have a clear plan for how to enable it in the future).  Jason's
objection is based upon this premise and the fact that a technical
review of the code makes it look like the core namespace infrastructure
becomes less complete, not more, with the inclusion of these patches.

As I understand it, prior to these patches there would always be a 1:1
mapping of GID to gid_index because you would never have duplicate GIDs
in the GID table.  That allowed an easy, definitive 1:1 mapping of GID
to namespace via the existing infrastructure for any received packet [1].

These patches add the concept of duplicate GIDs that are differentiated
by their RoCE version (also called network type).  So, now, an incoming
packet could match a couple different gid_indexes and we need additional
information to get back to the definitive 1:1 mapping.  The submitted
patches are designed around a lazy resolution of the namespace,
preferring to defer the work of mapping the incoming packet to a unique
namespace until that information is actually needed.  To enable this
lazy resolution, it provides the network_type so that the resolution can
be done.

This is a fair assessment of the current state of things and what these
patches do, yes?

Jason's objections are this:

1)  The lazy resolution is wrong.
2)  The use of network_type as the additional information to get to the
unique namespace is vendor specific cruft that shouldn't be part of the
core kernel API.

Jason's preference would be that the above issues be resolved by
skipping the lazy resolution and instead doing proactive resolution on
receipt of a packet and then probably just pass the namespace around
instead of passing around the information needed to resolve the
namespace.  Or, at a minimum, at least make the information added to the
core API not something vendor specific like network_type, which is a
detail of the Mellanox implementation.

Jason, is this accurate for your position?

If everyone agrees that this is a fair statement of where we stand, then
I'll continue my response.  If not, please correct anything I have wrong
above and I'll take that into my continued response.

1 - Actually, for any received packet with associated IP address
information.  We've only enabled net namespaces for IP connections
between user space applications, for direct IB connections or for kernel
connections there is not yet any namespace support.

-- 
Doug Ledford 

Re: [PATCH 15/15] i40iw: changes for build of i40iw module

2015-12-16 Thread kbuild test robot
Hi Faisal,

[auto build test WARNING on net/master]
[also build test WARNING on v4.4-rc5 next-20151216]
[cannot apply to net-next/master]

url:
https://github.com/0day-ci/linux/commits/Faisal-Latif/add-Intel-R-X722-iWARP-driver/20151217-040340
config: arm-allyesconfig (attached as .config)
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All warnings (new ones prefixed by >>):

   In file included from include/linux/byteorder/big_endian.h:4:0,
from arch/arm/include/uapi/asm/byteorder.h:19,
from include/asm-generic/bitops/le.h:5,
from arch/arm/include/asm/bitops.h:340,
from include/linux/bitops.h:36,
from include/linux/kernel.h:10,
from include/linux/skbuff.h:17,
from include/linux/ip.h:20,
from drivers/infiniband/hw/i40iw/i40iw_cm.c:36:
   drivers/infiniband/hw/i40iw/i40iw_cm.c: In function 'i40iw_init_tcp_ctx':
   include/uapi/linux/byteorder/big_endian.h:32:26: warning: large integer 
implicitly truncated to unsigned type [-Woverflow]
#define __cpu_to_le32(x) ((__force __le32)__swab32((x)))
 ^
   include/linux/byteorder/generic.h:87:21: note: in expansion of macro 
'__cpu_to_le32'
#define cpu_to_le32 __cpu_to_le32
^
>> drivers/infiniband/hw/i40iw/i40iw_cm.c:3513:18: note: in expansion of macro 
>> 'cpu_to_le32'
 tcp_info->ttl = cpu_to_le32(I40IW_DEFAULT_TTL);
 ^

vim +/cpu_to_le32 +3513 drivers/infiniband/hw/i40iw/i40iw_cm.c

2d207efd Faisal Latif 2015-12-16  3497   * i40iw_init_tcp_ctx - setup qp context
2d207efd Faisal Latif 2015-12-16  3498   * @cm_node: connection's node
2d207efd Faisal Latif 2015-12-16  3499   * @tcp_info: offload info for tcp
2d207efd Faisal Latif 2015-12-16  3500   * @iwqp: associate qp for the 
connection
2d207efd Faisal Latif 2015-12-16  3501   */
2d207efd Faisal Latif 2015-12-16  3502  static void i40iw_init_tcp_ctx(struct 
i40iw_cm_node *cm_node,
2d207efd Faisal Latif 2015-12-16  3503 struct 
i40iw_tcp_offload_info *tcp_info,
2d207efd Faisal Latif 2015-12-16  3504 struct 
i40iw_qp *iwqp)
2d207efd Faisal Latif 2015-12-16  3505  {
2d207efd Faisal Latif 2015-12-16  3506  tcp_info->ipv4 = cm_node->ipv4;
2d207efd Faisal Latif 2015-12-16  3507  tcp_info->drop_ooo_seg = true;
2d207efd Faisal Latif 2015-12-16  3508  tcp_info->wscale = true;
2d207efd Faisal Latif 2015-12-16  3509  tcp_info->ignore_tcp_opt = true;
2d207efd Faisal Latif 2015-12-16  3510  tcp_info->ignore_tcp_uns_opt = 
true;
2d207efd Faisal Latif 2015-12-16  3511  tcp_info->no_nagle = false;
2d207efd Faisal Latif 2015-12-16  3512  
2d207efd Faisal Latif 2015-12-16 @3513  tcp_info->ttl = 
cpu_to_le32(I40IW_DEFAULT_TTL);
2d207efd Faisal Latif 2015-12-16  3514  tcp_info->rtt_var = 
cpu_to_le32(I40IW_DEFAULT_RTT_VAR);
2d207efd Faisal Latif 2015-12-16  3515  tcp_info->ss_thresh = 
cpu_to_le32(I40IW_DEFAULT_SS_THRESH);
2d207efd Faisal Latif 2015-12-16  3516  tcp_info->rexmit_thresh = 
I40IW_DEFAULT_REXMIT_THRESH;
2d207efd Faisal Latif 2015-12-16  3517  
2d207efd Faisal Latif 2015-12-16  3518  tcp_info->tcp_state = 
I40IW_TCP_STATE_ESTABLISHED;
2d207efd Faisal Latif 2015-12-16  3519  tcp_info->snd_wscale = 
cm_node->tcp_cntxt.snd_wscale;
2d207efd Faisal Latif 2015-12-16  3520  tcp_info->rcv_wscale = 
cm_node->tcp_cntxt.rcv_wscale;
2d207efd Faisal Latif 2015-12-16  3521  

:: The code at line 3513 was first introduced by commit
:: 2d207efd7fd9e5a190b2ebd6f077139412b0343f i40iw: add connection 
management code

:: TO: Faisal Latif 
:: CC: 0day robot 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH 02/15] i40iw: add main, hdr, status

2015-12-16 Thread Joe Perches
On Wed, 2015-12-16 at 13:58 -0600, Faisal Latif wrote:
> i40iw_main.c contains routines for i40e <=> i40iw interface and setup.
> i40iw.h is header file for main device data structures.
> i40iw_status.h is for return status codes.
[]
> diff --git a/drivers/infiniband/hw/i40iw/i40iw.h 
> b/drivers/infiniband/hw/i40iw/i40iw.h
[]
> +#define i40iw_pr_err(fmt, args ...) pr_err("%s: error " fmt, __func__, ## 
> args)
> +
> +#define i40iw_pr_info(fmt, args ...) pr_info("%s: " fmt, __func__, ## args)
> +
> +#define i40iw_pr_warn(fmt, args ...) pr_warn("%s: " fmt, __func__, ## args)

Using "error " in the output doesn't really add much
as there's already a KERN_ERR with the output.

Using __func__ hardly adds anything.

Using netdev_ is generally preferred

> +
> +struct i40iw_cqp_request {
> + struct cqp_commands_info info;
> + wait_queue_head_t waitq;
> + struct list_head list;
> + atomic_t refcount;
> + void (*callback_fcn)(struct i40iw_cqp_request*, u32);
> + void *param;
> + struct i40iw_cqp_compl_info compl_info;
> + u8 waiting:1;
> + u8 request_done:1;
> + u8 dynamic:1;
> + u8 polling:1;

These would bitfields might be better as bool

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/15] i40iw: add hardware related header files

2015-12-16 Thread Faisal Latif
header files for hardware accesses

Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/i40iw_d.h| 1713 ++
 drivers/infiniband/hw/i40iw/i40iw_p.h|  106 ++
 drivers/infiniband/hw/i40iw/i40iw_type.h | 1308 +++
 3 files changed, 3127 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_d.h
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_p.h
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_type.h

diff --git a/drivers/infiniband/hw/i40iw/i40iw_d.h 
b/drivers/infiniband/hw/i40iw/i40iw_d.h
new file mode 100644
index 000..f6668d7
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_d.h
@@ -0,0 +1,1713 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use in source and binary forms, with or
+*   without modification, are permitted provided that the following
+*   conditions are met:
+*
+*- Redistributions of source code must retain the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer.
+*
+*- Redistributions in binary form must reproduce the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer in the documentation and/or other materials
+*  provided with the distribution.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+* SOFTWARE.
+*
+***/
+
+#ifndef I40IW_D_H
+#define I40IW_D_H
+
+#define I40IW_DB_ADDR_OFFSET(4 * 1024 * 1024 - 64 * 1024)
+#define I40IW_VF_DB_ADDR_OFFSET (64 * 1024)
+
+#define I40IW_PUSH_OFFSET   (4 * 1024 * 1024)
+#define I40IW_PF_FIRST_PUSH_PAGE_INDEX 16
+#define I40IW_VF_PUSH_OFFSET((8 + 64) * 1024)
+#define I40IW_VF_FIRST_PUSH_PAGE_INDEX 2
+
+#define I40IW_PE_DB_SIZE_4M 1
+#define I40IW_PE_DB_SIZE_8M 2
+
+#define I40IW_DDP_VER 1
+#define I40IW_RDMAP_VER 1
+
+#define I40IW_RDMA_MODE_RDMAC 0
+#define I40IW_RDMA_MODE_IETF  1
+
+#define I40IW_QP_STATE_INVALID 0
+#define I40IW_QP_STATE_IDLE 1
+#define I40IW_QP_STATE_RTS 2
+#define I40IW_QP_STATE_CLOSING 3
+#define I40IW_QP_STATE_RESERVED 4
+#define I40IW_QP_STATE_TERMINATE 5
+#define I40IW_QP_STATE_ERROR 6
+
+#define I40IW_STAG_STATE_INVALID 0
+#define I40IW_STAG_STATE_VALID 1
+
+#define I40IW_STAG_TYPE_SHARED 0
+#define I40IW_STAG_TYPE_NONSHARED 1
+
+#define I40IW_MAX_USER_PRIORITY 8
+
+#define LS_64_1(val, bits)  ((u64)(uintptr_t)val << bits)
+#define RS_64_1(val, bits)  ((u64)(uintptr_t)val >> bits)
+#define LS_32_1(val, bits)  (u32)(val << bits)
+#define RS_32_1(val, bits)  (u32)(val >> bits)
+#define I40E_HI_DWORD(x)((u32)x) >> 16) >> 16) & 0x))
+
+#define LS_64(val, field) (((u64)val << field ## _SHIFT) & (field ## _MASK))
+
+#define RS_64(val, field) ((u64)(u64)(val & field ## _MASK) >> field ## _SHIFT)
+#define LS_32(val, field) ((val << field ## _SHIFT) & (field ## _MASK))
+#define RS_32(val, field) ((val & field ## _MASK) >> field ## _SHIFT)
+
+#define TERM_DDP_LEN_TAGGED 14
+#define TERM_DDP_LEN_UNTAGGED   18
+#define TERM_RDMA_LEN   28
+#define RDMA_OPCODE_MASK0x0f
+#define RDMA_READ_REQ_OPCODE1
+#define Q2_BAD_FRAME_OFFSET 72
+#define CQE_MAJOR_DRV   0x8000
+
+#define I40IW_TERM_SENT 0x01
+#define I40IW_TERM_RCVD 0x02
+#define I40IW_TERM_DONE 0x04
+#define I40IW_MAC_HLEN  14
+
+#define I40IW_INVALID_WQE_INDEX 0x
+
+#define I40IW_CQP_WAIT_POLL_REGS 1
+#define I40IW_CQP_WAIT_POLL_CQ 2
+#define I40IW_CQP_WAIT_EVENT 3
+
+#define I40IW_CQP_INIT_WQE(wqe) memset(wqe, 0, 64)
+
+#define I40IW_GET_CURRENT_CQ_ELEMENT(_cq) \
+   ( \
+   &((_cq)->cq_base[I40IW_RING_GETCURRENT_HEAD((_cq)->cq_ring)])  \
+   )
+#define I40IW_GET_CURRENT_EXTENDED_CQ_ELEMENT(_cq) \
+   ( \
+   &(((struct i40iw_extended_cqe *)\
+  
((_cq)->cq_base))[I40IW_RING_GETCURRENT_HEAD((_cq)->cq_ring)]) \
+   )
+
+#define I40IW_GET_CURRENT_AEQ_ELEMENT(_aeq) \
+   ( \
+   &_aeq->aeqe_base[I40IW_RING_GETCURRENT_TAIL(_aeq->aeq_ring)]   \
+   )
+
+#define I40IW_GET_CURRENT_CEQ_ELEMENT(_ceq) \
+  

[PATCH 06/15] i40iw: add hmc resource files

2015-12-16 Thread Faisal Latif
i40iw_hmc.[ch] are to manage hmc for the device.

Acked-by: Anjali Singhai Jain 
Acked-by: Shannon Nelson 
Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/i40iw_hmc.c | 823 
 drivers/infiniband/hw/i40iw/i40iw_hmc.h | 241 ++
 2 files changed, 1064 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_hmc.c
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_hmc.h

diff --git a/drivers/infiniband/hw/i40iw/i40iw_hmc.c 
b/drivers/infiniband/hw/i40iw/i40iw_hmc.c
new file mode 100644
index 000..f4f4055
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_hmc.c
@@ -0,0 +1,823 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use in source and binary forms, with or
+*   without modification, are permitted provided that the following
+*   conditions are met:
+*
+*- Redistributions of source code must retain the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer.
+*
+*- Redistributions in binary form must reproduce the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer in the documentation and/or other materials
+*  provided with the distribution.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+* SOFTWARE.
+*
+***/
+
+#include "i40iw_osdep.h"
+#include "i40iw_register.h"
+#include "i40iw_status.h"
+#include "i40iw_hmc.h"
+#include "i40iw_d.h"
+#include "i40iw_type.h"
+#include "i40iw_p.h"
+#include "i40iw_vf.h"
+#include "i40iw_virtchnl.h"
+
+/**
+ * i40iw_find_sd_index_limit - finds segment descriptor index limit
+ * @hmc_info: pointer to the HMC configuration information structure
+ * @type: type of HMC resources we're searching
+ * @index: starting index for the object
+ * @cnt: number of objects we're trying to create
+ * @sd_idx: pointer to return index of the segment descriptor in question
+ * @sd_limit: pointer to return the maximum number of segment descriptors
+ *
+ * This function calculates the segment descriptor index and index limit
+ * for the resource defined by i40iw_hmc_rsrc_type.
+ */
+
+static inline void i40iw_find_sd_index_limit(struct i40iw_hmc_info *hmc_info,
+u32 type,
+u32 idx,
+u32 cnt,
+u32 *sd_idx,
+u32 *sd_limit)
+{
+   u64 fpm_addr, fpm_limit;
+
+   fpm_addr = hmc_info->hmc_obj[(type)].base +
+   hmc_info->hmc_obj[type].size * idx;
+   fpm_limit = fpm_addr + hmc_info->hmc_obj[type].size * cnt;
+   *sd_idx = (u32)(fpm_addr / I40IW_HMC_DIRECT_BP_SIZE);
+   *sd_limit = (u32)((fpm_limit - 1) / I40IW_HMC_DIRECT_BP_SIZE);
+   *sd_limit += 1;
+}
+
+/**
+ * i40iw_find_pd_index_limit - finds page descriptor index limit
+ * @hmc_info: pointer to the HMC configuration information struct
+ * @type: HMC resource type we're examining
+ * @idx: starting index for the object
+ * @cnt: number of objects we're trying to create
+ * @pd_index: pointer to return page descriptor index
+ * @pd_limit: pointer to return page descriptor index limit
+ *
+ * Calculates the page descriptor index and index limit for the resource
+ * defined by i40iw_hmc_rsrc_type.
+ */
+
+static inline void i40iw_find_pd_index_limit(struct i40iw_hmc_info *hmc_info,
+u32 type,
+u32 idx,
+u32 cnt,
+u32 *pd_idx,
+u32 *pd_limit)
+{
+   u64 fpm_adr, fpm_limit;
+
+   fpm_adr = hmc_info->hmc_obj[type].base +
+   hmc_info->hmc_obj[type].size * idx;
+   fpm_limit = fpm_adr + (hmc_info)->hmc_obj[(type)].size * (cnt);
+   *(pd_idx) = (u32)(fpm_adr / I40IW_HMC_PAGED_BP_SIZE);
+   *(pd_limit) = (u32)((fpm_limit - 1) / I40IW_HMC_PAGED_BP_SIZE);

[PATCH 11/15] i40iw: add X722 register file

2015-12-16 Thread Faisal Latif
X722 Hardware registers defines for iWARP component.

Acked-by: Anjali Singhai Jain 
Acked-by: Shannon Nelson 
Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/i40iw_register.h | 1027 ++
 1 file changed, 1027 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_register.h

diff --git a/drivers/infiniband/hw/i40iw/i40iw_register.h 
b/drivers/infiniband/hw/i40iw/i40iw_register.h
new file mode 100644
index 000..01da7c5
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_register.h
@@ -0,0 +1,1027 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use in source and binary forms, with or
+*   without modification, are permitted provided that the following
+*   conditions are met:
+*
+*- Redistributions of source code must retain the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer.
+*
+*- Redistributions in binary form must reproduce the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer in the documentation and/or other materials
+*  provided with the distribution.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+* SOFTWARE.
+*
+***/
+
+#ifndef I40IW_REGISTER_H
+#define I40IW_REGISTER_H
+
+#define I40E_GLGEN_STAT   0x000B612C /* Reset: POR */
+
+#define I40E_PFHMC_PDINV   0x000C0300 /* Reset: PFR */
+#define I40E_PFHMC_PDINV_PMSDIDX_SHIFT 0
+#define I40E_PFHMC_PDINV_PMSDIDX_MASK  I40E_MASK(0xFFF, 
I40E_PFHMC_PDINV_PMSDIDX_SHIFT)
+#define I40E_PFHMC_PDINV_PMPDIDX_SHIFT 16
+#define I40E_PFHMC_PDINV_PMPDIDX_MASK  I40E_MASK(0x1FF, 
I40E_PFHMC_PDINV_PMPDIDX_SHIFT)
+#define I40E_PFHMC_SDCMD_PMSDWR_SHIFT  31
+#define I40E_PFHMC_SDCMD_PMSDWR_MASK   I40E_MASK(0x1, 
I40E_PFHMC_SDCMD_PMSDWR_SHIFT)
+#define I40E_PFHMC_SDDATALOW_PMSDVALID_SHIFT   0
+#define I40E_PFHMC_SDDATALOW_PMSDVALID_MASKI40E_MASK(0x1, 
I40E_PFHMC_SDDATALOW_PMSDVALID_SHIFT)
+#define I40E_PFHMC_SDDATALOW_PMSDTYPE_SHIFT1
+#define I40E_PFHMC_SDDATALOW_PMSDTYPE_MASK I40E_MASK(0x1, 
I40E_PFHMC_SDDATALOW_PMSDTYPE_SHIFT)
+#define I40E_PFHMC_SDDATALOW_PMSDBPCOUNT_SHIFT 2
+#define I40E_PFHMC_SDDATALOW_PMSDBPCOUNT_MASK  I40E_MASK(0x3FF, 
I40E_PFHMC_SDDATALOW_PMSDBPCOUNT_SHIFT)
+
+#define I40E_PFINT_DYN_CTLN(_INTPF) (0x00034800 + ((_INTPF) * 4)) /* 
_i=0...511 */ /* Reset: PFR */
+#define I40E_PFINT_DYN_CTLN_INTENA_SHIFT  0
+#define I40E_PFINT_DYN_CTLN_INTENA_MASK   I40E_MASK(0x1, 
I40E_PFINT_DYN_CTLN_INTENA_SHIFT)
+#define I40E_PFINT_DYN_CTLN_CLEARPBA_SHIFT1
+#define I40E_PFINT_DYN_CTLN_CLEARPBA_MASK I40E_MASK(0x1, 
I40E_PFINT_DYN_CTLN_CLEARPBA_SHIFT)
+#define I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT3
+#define I40E_PFINT_DYN_CTLN_ITR_INDX_MASK I40E_MASK(0x3, 
I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT)
+
+#define I40E_VFINT_DYN_CTLN1(_INTVF)   (0x3800 + ((_INTVF) * 
4)) /* _i=0...15 */ /* Reset: VFR */
+#define I40E_GLHMC_VFPDINV(_i)   (0x000C8300 + ((_i) * 4)) /* 
_i=0...31 */ /* Reset: CORER */
+
+#define I40E_PFHMC_PDINV_PMSDPARTSEL_SHIFT 15
+#define I40E_PFHMC_PDINV_PMSDPARTSEL_MASK  I40E_MASK(0x1, 
I40E_PFHMC_PDINV_PMSDPARTSEL_SHIFT)
+#define I40E_GLPCI_LBARCTRL0x000BE484 /* Reset: POR */
+#define I40E_GLPCI_LBARCTRL_PE_DB_SIZE_SHIFT4
+#define I40E_GLPCI_LBARCTRL_PE_DB_SIZE_MASK I40E_MASK(0x3, 
I40E_GLPCI_LBARCTRL_PE_DB_SIZE_SHIFT)
+
+#define I40E_PFPE_AEQALLOC   0x00131180 /* Reset: PFR */
+#define I40E_PFPE_AEQALLOC_AECOUNT_SHIFT 0
+#define I40E_PFPE_AEQALLOC_AECOUNT_MASK  I40E_MASK(0x, 
I40E_PFPE_AEQALLOC_AECOUNT_SHIFT)
+#define I40E_PFPE_CCQPHIGH  0x8200 /* Reset: PFR */
+#define I40E_PFPE_CCQPHIGH_PECCQPHIGH_SHIFT 0
+#define I40E_PFPE_CCQPHIGH_PECCQPHIGH_MASK  I40E_MASK(0x, 
I40E_PFPE_CCQPHIGH_PECCQPHIGH_SHIFT)
+#define I40E_PFPE_CCQPLOW 0x8180 /* Reset: PFR */
+#define I40E_PFPE_CCQPLOW_PECCQPLOW_SHIFT 0
+#define I40E_PFPE_CCQPLOW_PECCQPLOW_MASK  I40E_MASK(0x, 
I40E_PFPE_CCQPLOW

[PATCH 03/15] i40iw: add connection management code

2015-12-16 Thread Faisal Latif
i40iw_cm.c i40iw_cm.h are used for connection management.

Acked-by: Anjali Singhai Jain 
Acked-by: Shannon Nelson 
Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/i40iw_cm.c | 4447 
 drivers/infiniband/hw/i40iw/i40iw_cm.h |  456 
 2 files changed, 4903 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_cm.c
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_cm.h

diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c 
b/drivers/infiniband/hw/i40iw/i40iw_cm.c
new file mode 100644
index 000..aa6263f
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c
@@ -0,0 +1,4447 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use in source and binary forms, with or
+*   without modification, are permitted provided that the following
+*   conditions are met:
+*
+*- Redistributions of source code must retain the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer.
+*
+*- Redistributions in binary form must reproduce the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer in the documentation and/or other materials
+*  provided with the distribution.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+* SOFTWARE.
+*
+***/
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "i40iw.h"
+
+static void i40iw_rem_ref_cm_node(struct i40iw_cm_node *);
+static void i40iw_cm_post_event(struct i40iw_cm_event *event);
+static void i40iw_disconnect_worker(struct work_struct *work);
+
+/**
+ * i40iw_free_sqbuf - put back puda buffer if refcount = 0
+ * @dev: FPK device
+ * @buf: puda buffer to free
+ */
+void i40iw_free_sqbuf(struct i40iw_sc_dev *dev, void *bufp)
+{
+   struct i40iw_puda_buf *buf = (struct i40iw_puda_buf *)bufp;
+   struct i40iw_puda_rsrc *ilq = dev->ilq;
+
+   if (!atomic_dec_return(&buf->refcount))
+   i40iw_puda_ret_bufpool(ilq, buf);
+}
+
+/**
+ * i40iw_derive_hw_ird_setting - Calculate IRD
+ *
+ * @cm_ird: IRD of connection's node
+ *
+ * The ird from the connection is rounded to a supported HW
+ * setting (2,8,32,64) and then encoded for ird_size field of
+ * qp_ctx
+ */
+static u8 i40iw_derive_hw_ird_setting(u16 cm_ird)
+{
+   u8 encoded_ird_size = 0;
+   u8 pof2_cm_ird = 1;
+
+   /* round-off to next powerof2 */
+   while (pof2_cm_ird < cm_ird)
+   pof2_cm_ird *= 2;
+
+   /* ird_size field is encoded in qp_ctx */
+   switch (pof2_cm_ird) {
+   case I40IW_HW_IRD_SETTING_64:
+   encoded_ird_size = 3;
+   break;
+   case I40IW_HW_IRD_SETTING_32:
+   case I40IW_HW_IRD_SETTING_16:
+   encoded_ird_size = 2;
+   break;
+   case I40IW_HW_IRD_SETTING_8:
+   case I40IW_HW_IRD_SETTING_4:
+   encoded_ird_size = 1;
+   break;
+   case I40IW_HW_IRD_SETTING_2:
+   default:
+   encoded_ird_size = 0;
+   break;
+   }
+   return encoded_ird_size;
+}
+
+/**
+ * i40iw_record_ird_ord - Record IRD/ORD passed in
+ * @cm_node: connection's node
+ * @conn_ird: connection IRD
+ * @conn_ord: connection ORD
+ */
+static void i40iw_record_ird_ord(struct i40iw_cm_node *cm_node, u16 conn_ird, 
u16 conn_ord)
+{
+   if (conn_ird > I40IW_MAX_IRD_SIZE)
+   conn_ird = I40IW_MAX_IRD_SIZE;
+
+   if (conn_ord > I40IW_MAX_ORD_SIZE)
+   conn_ord = I40IW_MAX_ORD_SIZE;
+
+   cm_node->ird_size = conn_ird;
+   cm_node->ord_size = conn_ord;
+}
+
+/**
+ * i40iw_copy_ip_ntohl - change network to host ip
+ * @dst: host ip
+ * @src: big endian
+ */
+void i40iw_copy_ip_ntohl(u32 *dst, __be32 *src)
+{
+   *dst++ = ntohl(*src++);
+   *dst++ = ntohl(*src++);
+   *dst++ = ntohl(*src++);

[PATCH 05/15] i40iw: add pble resource files

2015-12-16 Thread Faisal Latif
i40iw_pble.[ch] to manage pble resource for iwarp clients.

Acked-by: Anjali Singhai Jain 
Acked-by: Shannon Nelson 
Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/i40iw_pble.c | 618 +++
 drivers/infiniband/hw/i40iw/i40iw_pble.h | 131 +++
 2 files changed, 749 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_pble.c
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_pble.h

diff --git a/drivers/infiniband/hw/i40iw/i40iw_pble.c 
b/drivers/infiniband/hw/i40iw/i40iw_pble.c
new file mode 100644
index 000..217997e
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_pble.c
@@ -0,0 +1,618 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use in source and binary forms, with or
+*   without modification, are permitted provided that the following
+*   conditions are met:
+*
+*- Redistributions of source code must retain the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer.
+*
+*- Redistributions in binary form must reproduce the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer in the documentation and/or other materials
+*  provided with the distribution.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+* SOFTWARE.
+*
+***/
+
+#include "i40iw_status.h"
+#include "i40iw_osdep.h"
+#include "i40iw_register.h"
+#include "i40iw_hmc.h"
+
+#include "i40iw_d.h"
+#include "i40iw_type.h"
+#include "i40iw_p.h"
+
+#include 
+#include 
+#include 
+#include "i40iw_pble.h"
+#include "i40iw.h"
+
+struct i40iw_device;
+static enum i40iw_status_code add_pble_pool(struct i40iw_sc_dev *dev,
+   struct i40iw_hmc_pble_rsrc 
*pble_rsrc);
+static void i40iw_free_vmalloc_mem(struct i40iw_hw *hw, struct i40iw_chunk 
*chunk);
+
+/**
+ * i40iw_destroy_pble_pool - destroy pool during module unload
+ * @pble_rsrc: pble resources
+ */
+void i40iw_destroy_pble_pool(struct i40iw_sc_dev *dev, struct 
i40iw_hmc_pble_rsrc *pble_rsrc)
+{
+   struct list_head *clist;
+   struct list_head *tlist;
+   struct i40iw_chunk *chunk;
+   struct i40iw_pble_pool *pinfo = &pble_rsrc->pinfo;
+
+   if (pinfo->pool) {
+   list_for_each_safe(clist, tlist, &pinfo->clist) {
+   chunk = list_entry(clist, struct i40iw_chunk, list);
+   if (chunk->type == I40IW_VMALLOC)
+   i40iw_free_vmalloc_mem(dev->hw, chunk);
+   kfree(chunk);
+   }
+   gen_pool_destroy(pinfo->pool);
+   }
+}
+
+/**
+ * i40iw_hmc_init_pble - Initialize pble resources during module load
+ * @dev: i40iw_sc_dev struct
+ * @pble_rsrc: pble resources
+ */
+enum i40iw_status_code i40iw_hmc_init_pble(struct i40iw_sc_dev *dev,
+  struct i40iw_hmc_pble_rsrc 
*pble_rsrc)
+{
+   struct i40iw_hmc_info *hmc_info;
+   u32 fpm_idx = 0;
+
+   hmc_info = dev->hmc_info;
+   pble_rsrc->fpm_base_addr = hmc_info->hmc_obj[I40IW_HMC_IW_PBLE].base;
+   /* Now start the pble' on 4k boundary */
+   if (pble_rsrc->fpm_base_addr & 0xfff)
+   fpm_idx = (PAGE_SIZE - (pble_rsrc->fpm_base_addr & 0xfff)) >> 3;
+
+   pble_rsrc->unallocated_pble =
+   hmc_info->hmc_obj[I40IW_HMC_IW_PBLE].cnt - fpm_idx;
+   pble_rsrc->next_fpm_addr = pble_rsrc->fpm_base_addr + (fpm_idx << 3);
+
+   pble_rsrc->pinfo.pool_shift = POOL_SHIFT;
+   pble_rsrc->pinfo.pool = gen_pool_create(pble_rsrc->pinfo.pool_shift, 
-1);
+   INIT_LIST_HEAD(&pble_rsrc->pinfo.clist);
+   if (!pble_rsrc->pinfo.pool)
+   goto error;
+
+   if (add_pble_pool(dev, pble_rsrc))
+   goto error;
+
+   return 0;
+
+ error:i40iw_destroy_pble_pool(dev, pble_rsrc);
+   return I40IW_ERR_NO_MEMORY;
+}
+
+/**
+ * get_sd_pd_idx -  Returns sd index, pd index and rel_pd_idx from fpm address
+ * @ pble_rsrc:structure containing fpm address
+ * @ idx: where to return indexe

[PATCH 12/15] i40iw: user kernel shared files

2015-12-16 Thread Faisal Latif
i40iw_user.h and i40iw_uk.c are used by both user library as well as
kernel requests.

Acked-by: Anjali Singhai Jain 
Acked-by: Shannon Nelson 
Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/i40iw_uk.c   | 1213 ++
 drivers/infiniband/hw/i40iw/i40iw_user.h |  438 +++
 2 files changed, 1651 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_uk.c
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_user.h

diff --git a/drivers/infiniband/hw/i40iw/i40iw_uk.c 
b/drivers/infiniband/hw/i40iw/i40iw_uk.c
new file mode 100644
index 000..d7ae9e6
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_uk.c
@@ -0,0 +1,1213 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use in source and binary forms, with or
+*   without modification, are permitted provided that the following
+*   conditions are met:
+*
+*- Redistributions of source code must retain the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer.
+*
+*- Redistributions in binary form must reproduce the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer in the documentation and/or other materials
+*  provided with the distribution.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+* SOFTWARE.
+*
+***/
+
+#include "i40iw_osdep.h"
+#include "i40iw_status.h"
+#include "i40iw_d.h"
+#include "i40iw_user.h"
+#include "i40iw_register.h"
+
+static u32 nop_signature = 0x;
+
+/**
+ * i40iw_nop_1 - insert a nop wqe and move head. no post work
+ * @qp: hw qp ptr
+ */
+static enum i40iw_status_code i40iw_nop_1(struct i40iw_qp_uk *qp)
+{
+   u64 header, *wqe;
+   u64 *wqe_0 = NULL;
+   u32 wqe_idx, peek_head;
+   bool signaled = false;
+
+   if (!qp->sq_ring.head)
+   return I40IW_ERR_PARAM;
+
+   wqe_idx = I40IW_RING_GETCURRENT_HEAD(qp->sq_ring);
+   wqe = &qp->sq_base[wqe_idx << 2];
+   peek_head = (qp->sq_ring.head + 1) % qp->sq_ring.size;
+   wqe_0 = &qp->sq_base[peek_head << 2];
+   if (peek_head)
+   wqe_0[3] = LS_64(!qp->swqe_polarity, I40IWQPSQ_VALID);
+   else
+   wqe_0[3] = LS_64(qp->swqe_polarity, I40IWQPSQ_VALID);
+
+   set_64bit_val(wqe, 0, 0);
+   set_64bit_val(wqe, 8, 0);
+   set_64bit_val(wqe, 16, 0);
+
+   header = LS_64(I40IWQP_OP_NOP, I40IWQPSQ_OPCODE) |
+   LS_64(signaled, I40IWQPSQ_SIGCOMPL) |
+   LS_64(qp->swqe_polarity, I40IWQPSQ_VALID) | nop_signature++;
+
+   wmb();  /* Memory barrier to ensure data is written before valid bit is 
set */
+
+   set_64bit_val(wqe, 24, header);
+   return 0;
+}
+
+/**
+ * i40iw_qp_post_wr - post wr to hrdware
+ * @qp: hw qp ptr
+ */
+void i40iw_qp_post_wr(struct i40iw_qp_uk *qp)
+{
+   u64 temp;
+   u32 hw_sq_tail;
+   u32 sw_sq_head;
+
+   wmb(); /* make sure valid bit is written */
+
+   /* read the doorbell shadow area */
+   get_64bit_val(qp->shadow_area, 0, &temp);
+
+   rmb(); /* make sure read is finished */
+
+   hw_sq_tail = (u32)RS_64(temp, I40IW_QP_DBSA_HW_SQ_TAIL);
+   sw_sq_head = I40IW_RING_GETCURRENT_HEAD(qp->sq_ring);
+   if (sw_sq_head != hw_sq_tail) {
+   if (sw_sq_head > qp->initial_ring.head) {
+   if ((hw_sq_tail >= qp->initial_ring.head) &&
+   (hw_sq_tail < sw_sq_head)) {
+   db_wr32(qp->wqe_alloc_reg, qp->qp_id);
+   }
+   } else if (sw_sq_head != qp->initial_ring.head) {
+   if ((hw_sq_tail >= qp->initial_ring.head) ||
+   (hw_sq_tail < sw_sq_head)) {
+   db_wr32(qp->wqe_alloc_reg, qp->qp_id);
+   }
+   }
+   }
+
+   qp->initial_ring.head = qp->sq_ring.head;
+}
+
+/**
+ * i40iw_qp_ring_push_db -  ring qp doorbell
+ * @qp: hw qp ptr
+ * @wqe_idx: wqe index
+ */
+static void i40iw_qp_ring_push_db(struct i40iw_qp_uk *qp, u

[PATCH 07/15] i40iw: add hw and utils files

2015-12-16 Thread Faisal Latif
i40iw_hw.c, i40iw_utils.c and i40iw_osdep.h are files to handle
interrupts and processing.

Acked-by: Anjali Singhai Jain 
Acked-by: Shannon Nelson 
Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/i40iw_hw.c|  705 +
 drivers/infiniband/hw/i40iw/i40iw_osdep.h |  235 ++
 drivers/infiniband/hw/i40iw/i40iw_utils.c | 1233 +
 3 files changed, 2173 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_hw.c
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_osdep.h
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_utils.c

diff --git a/drivers/infiniband/hw/i40iw/i40iw_hw.c 
b/drivers/infiniband/hw/i40iw/i40iw_hw.c
new file mode 100644
index 000..13d0d9e
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_hw.c
@@ -0,0 +1,705 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use in source and binary forms, with or
+*   without modification, are permitted provided that the following
+*   conditions are met:
+*
+*- Redistributions of source code must retain the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer.
+*
+*- Redistributions in binary form must reproduce the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer in the documentation and/or other materials
+*  provided with the distribution.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+* SOFTWARE.
+*
+***/
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "i40iw.h"
+
+/**
+ * i40iw_initialize_hw_resources - initialize hw resource during open
+ * @iwdev: iwarp device
+ */
+u32 i40iw_initialize_hw_resources(struct i40iw_device *iwdev)
+{
+   unsigned long num_pds;
+   u32 resources_size;
+   u32 max_mr;
+   u32 max_qp;
+   u32 max_cq;
+   u32 arp_table_size;
+   u32 mrdrvbits;
+   void *resource_ptr;
+
+   max_qp = iwdev->sc_dev.hmc_info->hmc_obj[I40IW_HMC_IW_QP].cnt;
+   max_cq = iwdev->sc_dev.hmc_info->hmc_obj[I40IW_HMC_IW_CQ].cnt;
+   max_mr = iwdev->sc_dev.hmc_info->hmc_obj[I40IW_HMC_IW_MR].cnt;
+   arp_table_size = iwdev->sc_dev.hmc_info->hmc_obj[I40IW_HMC_IW_ARP].cnt;
+   iwdev->max_cqe = 0xF;
+   num_pds = max_qp * 4;
+   resources_size = sizeof(struct i40iw_arp_entry) * arp_table_size;
+   resources_size += sizeof(unsigned long) * BITS_TO_LONGS(max_qp);
+   resources_size += sizeof(unsigned long) * BITS_TO_LONGS(max_mr);
+   resources_size += sizeof(unsigned long) * BITS_TO_LONGS(max_cq);
+   resources_size += sizeof(unsigned long) * BITS_TO_LONGS(num_pds);
+   resources_size += sizeof(unsigned long) * BITS_TO_LONGS(arp_table_size);
+   resources_size += sizeof(struct i40iw_qp **) * max_qp;
+   iwdev->mem_resources = kzalloc(resources_size, GFP_KERNEL);
+
+   if (!iwdev->mem_resources)
+   return -ENOMEM;
+
+   iwdev->max_qp = max_qp;
+   iwdev->max_mr = max_mr;
+   iwdev->max_cq = max_cq;
+   iwdev->max_pd = num_pds;
+   iwdev->arp_table_size = arp_table_size;
+   iwdev->arp_table = (struct i40iw_arp_entry *)iwdev->mem_resources;
+   resource_ptr = iwdev->mem_resources + (sizeof(struct i40iw_arp_entry) * 
arp_table_size);
+
+   iwdev->device_cap_flags = IB_DEVICE_LOCAL_DMA_LKEY |
+   IB_DEVICE_MEM_WINDOW | IB_DEVICE_MEM_MGT_EXTENSIONS;
+
+   iwdev->allocated_qps = resource_ptr;
+   iwdev->allocated_cqs = &iwdev->allocated_qps[BITS_TO_LONGS(max_qp)];
+   iwdev->allocated_mrs = &iwdev->allocated_cqs[BITS_TO_LONGS(max_cq)];
+   iwdev->allocated_pds = &iwdev->allocated_mrs[BITS_TO_LONGS(max_mr)];
+   iwdev->allocated_arps = &iwdev->allocated_pds[BITS_TO_LONGS(num_pds)];
+   iwdev->qp_table = (struct i40iw_qp 
**)(&iwdev->allocated_arps[BITS_TO_LONGS(arp_table_size)]);
+   set_bit(0, iwdev->allocated_mrs);
+   set_bit(0, iwdev->allocated_qps);
+   set_bit(0, iwdev->allocated_cqs);
+   set_bit(0, iwdev->allocated_pds);
+   s

[PATCH 02/15] i40iw: add main, hdr, status

2015-12-16 Thread Faisal Latif
i40iw_main.c contains routines for i40e <=> i40iw interface and setup.
i40iw.h is header file for main device data structures.
i40iw_status.h is for return status codes.

Acked-by: Anjali Singhai Jain 
Acked-by: Shannon Nelson 
Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/i40iw.h|  573 +
 drivers/infiniband/hw/i40iw/i40iw_main.c   | 1905 
 drivers/infiniband/hw/i40iw/i40iw_status.h |  100 ++
 3 files changed, 2578 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw.h
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_main.c
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_status.h

diff --git a/drivers/infiniband/hw/i40iw/i40iw.h 
b/drivers/infiniband/hw/i40iw/i40iw.h
new file mode 100644
index 000..c048f06b
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw.h
@@ -0,0 +1,573 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use in source and binary forms, with or
+*   without modification, are permitted provided that the following
+*   conditions are met:
+*
+*- Redistributions of source code must retain the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer.
+*
+*- Redistributions in binary form must reproduce the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer in the documentation and/or other materials
+*  provided with the distribution.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+* SOFTWARE.
+*
+***/
+
+#ifndef I40IW_IW_H
+#define I40IW_IW_H
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "i40iw_status.h"
+#include "i40iw_osdep.h"
+#include "i40iw_d.h"
+#include "i40iw_hmc.h"
+
+#include 
+#include "i40iw_type.h"
+#include "i40iw_p.h"
+#include "i40iw_ucontext.h"
+#include "i40iw_pble.h"
+#include "i40iw_verbs.h"
+#include "i40iw_cm.h"
+#include "i40iw_user.h"
+#include "i40iw_puda.h"
+
+#define I40IW_FW_VERSION  2
+#define I40IW_HW_VERSION  2
+
+#define I40IW_ARP_ADD 1
+#define I40IW_ARP_DELETE  2
+#define I40IW_ARP_RESOLVE 3
+
+#define I40IW_MACIP_ADD 1
+#define I40IW_MACIP_DELETE  2
+
+#define IW_CCQ_SIZE (I40IW_CQP_SW_SQSIZE_2048 + 1)
+#define IW_CEQ_SIZE 2048
+#define IW_AEQ_SIZE 2048
+
+#define RX_BUF_SIZE(1536 + 8)
+#define IW_REG0_SIZE   (4 * 1024)
+#define IW_TX_TIMEOUT  (6 * HZ)
+#define IW_FIRST_QPN   1
+#define IW_SW_CONTEXT_ALIGN1024
+
+#define MAX_DPC_ITERATIONS 128
+
+#define I40IW_EVENT_TIMEOUT10
+#define I40IW_VCHNL_EVENT_TIMEOUT  10
+
+#defineI40IW_NO_VLAN   0x
+#defineI40IW_NO_QSET   0x
+
+/* access to mcast filter list */
+#define IW_ADD_MCAST false
+#define IW_DEL_MCAST true
+
+#define I40IW_DRV_OPT_ENABLE_MPA_VER_0 0x0001
+#define I40IW_DRV_OPT_DISABLE_MPA_CRC  0x0002
+#define I40IW_DRV_OPT_DISABLE_FIRST_WRITE  0x0004
+#define I40IW_DRV_OPT_DISABLE_INTF 0x0008
+#define I40IW_DRV_OPT_ENABLE_MSI   0x0010
+#define I40IW_DRV_OPT_DUAL_LOGICAL_PORT0x0020
+#define I40IW_DRV_OPT_NO_INLINE_DATA   0x0080
+#define I40IW_DRV_OPT_DISABLE_INT_MOD  0x0100
+#define I40IW_DRV_OPT_DISABLE_VIRT_WQ  0x0200
+#define I40IW_DRV_OPT_ENABLE_PAU   0x0400
+#define I40IW_DRV_OPT_MCAST_LOGPORT_MAP0x0800
+
+#define IW_HMC_OBJ_TYPE_NUM ARRAY_SIZE(iw_hmc_obj_types)
+#define IW_CFG_FPM_QP_COUNT32768
+
+#define I40IW_MTU_TO_MSS   40
+#define I40IW_DEFAULT_MSS  1460
+
+struct i40iw_cqp_compl_info {
+   u32 op_ret_val;
+   u16 maj_err_code;
+   u16 min_err_code;
+   bool error;
+   u8 op_code;
+};
+
+#define CHECK_CQP_REQ(cqp_request) \
+{  \
+   if (!cqp_request) { \
+

[PATCH 04/15] i40iw: add puda code

2015-12-16 Thread Faisal Latif
i40iw_puda.[ch] are files to handle iwarp connection packets as
well as exception packets over multiple privilege mode uda queues.

Acked-by: Anjali Singhai Jain 
Acked-by: Shannon Nelson 
Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/i40iw_puda.c | 1443 ++
 drivers/infiniband/hw/i40iw/i40iw_puda.h |  183 
 2 files changed, 1626 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_puda.c
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_puda.h

diff --git a/drivers/infiniband/hw/i40iw/i40iw_puda.c 
b/drivers/infiniband/hw/i40iw/i40iw_puda.c
new file mode 100644
index 000..8e628af
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_puda.c
@@ -0,0 +1,1443 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use in source and binary forms, with or
+*   without modification, are permitted provided that the following
+*   conditions are met:
+*
+*- Redistributions of source code must retain the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer.
+*
+*- Redistributions in binary form must reproduce the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer in the documentation and/or other materials
+*  provided with the distribution.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+* SOFTWARE.
+*
+***/
+
+#include "i40iw_osdep.h"
+#include "i40iw_register.h"
+#include "i40iw_status.h"
+#include "i40iw_hmc.h"
+
+#include "i40iw_d.h"
+#include "i40iw_type.h"
+#include "i40iw_p.h"
+#include "i40iw_puda.h"
+
+static void i40iw_ieq_receive(struct i40iw_sc_dev *dev,
+ struct i40iw_puda_buf *buf);
+static void i40iw_ieq_tx_compl(struct i40iw_sc_dev *dev, void *sqwrid);
+static void i40iw_ilq_putback_rcvbuf(struct i40iw_sc_qp *qp, u32 wqe_idx);
+static enum i40iw_status_code i40iw_puda_replenish_rq(struct i40iw_puda_rsrc
+ *rsrc, bool initial);
+/**
+ * i40iw_puda_get_listbuf - get buffer from puda list
+ * @list: list to use for buffers (ILQ or IEQ)
+ */
+static struct i40iw_puda_buf *i40iw_puda_get_listbuf(struct list_head *list)
+{
+   struct i40iw_puda_buf *buf = NULL;
+
+   if (!list_empty(list)) {
+   buf = (struct i40iw_puda_buf *)list->next;
+   list_del((struct list_head *)&buf->list);
+   }
+   return buf;
+}
+
+/**
+ * i40iw_puda_get_bufpool - return buffer from resource
+ * @rsrc: resource to use for buffer
+ */
+struct i40iw_puda_buf *i40iw_puda_get_bufpool(struct i40iw_puda_rsrc *rsrc)
+{
+   struct i40iw_puda_buf *buf = NULL;
+   struct list_head *list = &rsrc->bufpool;
+   unsigned long   flags;
+
+   spin_lock_irqsave(&rsrc->bufpool_lock, flags);
+   buf = i40iw_puda_get_listbuf(list);
+   if (buf)
+   rsrc->avail_buf_count--;
+   else
+   rsrc->stats_buf_alloc_fail++;
+   spin_unlock_irqrestore(&rsrc->bufpool_lock, flags);
+   return buf;
+}
+
+/**
+ * i40iw_puda_ret_bufpool - return buffer to rsrc list
+ * @rsrc: resource to use for buffer
+ * @buf: buffe to return to resouce
+ */
+void i40iw_puda_ret_bufpool(struct i40iw_puda_rsrc *rsrc,
+   struct i40iw_puda_buf *buf)
+{
+   unsigned long   flags;
+
+   spin_lock_irqsave(&rsrc->bufpool_lock, flags);
+   list_add(&buf->list, &rsrc->bufpool);
+   spin_unlock_irqrestore(&rsrc->bufpool_lock, flags);
+   rsrc->avail_buf_count++;
+}
+
+/**
+ * i40iw_puda_post_recvbuf - set wqe for rcv buffer
+ * @rsrc: resource ptr
+ * @wqe_idx: wqe index to use
+ * @buf: puda buffer for rcv q
+ * @initial: flag if during init time
+ */
+static void i40iw_puda_post_recvbuf(struct i40iw_puda_rsrc *rsrc, u32 wqe_idx,
+   struct i40iw_puda_buf *buf, bool initial)
+{
+   u64 *wqe;
+   struct i40iw_sc_qp *qp = &rsrc->qp;
+   u64 offset24 = 0;
+
+   qp->qp_uk.rq_wrid_array[wqe_idx] = (uintptr_t)buf;
+   wqe = &qp->qp_uk.rq_base[

[PATCH 01/15] i40e: Add support for client interface for IWARP driver

2015-12-16 Thread Faisal Latif
From: Anjali Singhai Jain 

This patch adds a Client interface for i40iw driver
support. Also expands the Virtchannel to support messages
from i40evf driver on behalf of i40iwvf driver.

This client API is used by the i40iw and i40iwvf driver
to access the core driver resources brokered by the i40e driver.

Signed-off-by: Anjali Singhai Jain 
---
 drivers/net/ethernet/intel/i40e/Makefile   |1 +
 drivers/net/ethernet/intel/i40e/i40e.h |   22 +
 drivers/net/ethernet/intel/i40e/i40e_client.c  | 1012 
 drivers/net/ethernet/intel/i40e/i40e_client.h  |  232 +
 drivers/net/ethernet/intel/i40e/i40e_main.c|  115 ++-
 drivers/net/ethernet/intel/i40e/i40e_type.h|3 +-
 drivers/net/ethernet/intel/i40e/i40e_virtchnl.h|   34 +
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  247 -
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h |4 +
 9 files changed, 1657 insertions(+), 13 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_client.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_client.h

diff --git a/drivers/net/ethernet/intel/i40e/Makefile 
b/drivers/net/ethernet/intel/i40e/Makefile
index b4729ba..3b3c63e 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -41,6 +41,7 @@ i40e-objs := i40e_main.o \
i40e_diag.o \
i40e_txrx.o \
i40e_ptp.o  \
+   i40e_client.o   \
i40e_virtchnl_pf.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 4dd3e26..1417ae8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -59,6 +59,7 @@
 #ifdef I40E_FCOE
 #include "i40e_fcoe.h"
 #endif
+#include "i40e_client.h"
 #include "i40e_virtchnl.h"
 #include "i40e_virtchnl_pf.h"
 #include "i40e_txrx.h"
@@ -178,6 +179,7 @@ struct i40e_lump_tracking {
u16 search_hint;
u16 list[0];
 #define I40E_PILE_VALID_BIT  0x8000
+#define I40E_IWARP_IRQ_PILE_ID  (I40E_PILE_VALID_BIT - 2)
 };
 
 #define I40E_DEFAULT_ATR_SAMPLE_RATE   20
@@ -264,6 +266,8 @@ struct i40e_pf {
 #endif /* I40E_FCOE */
u16 num_lan_qps;   /* num lan queues this PF has set up */
u16 num_lan_msix;  /* num queue vectors for the base PF vsi */
+   u16 num_iwarp_msix;/* num of iwarp vectors for this PF */
+   int iwarp_base_vector;
int queues_left;   /* queues left unclaimed */
u16 rss_size;  /* num queues in the RSS array */
u16 rss_size_max;  /* HW defined max RSS queues */
@@ -313,6 +317,7 @@ struct i40e_pf {
 #define I40E_FLAG_16BYTE_RX_DESC_ENABLED   BIT_ULL(13)
 #define I40E_FLAG_CLEAN_ADMINQ BIT_ULL(14)
 #define I40E_FLAG_FILTER_SYNC  BIT_ULL(15)
+#define I40E_FLAG_SERVICE_CLIENT_REQUESTED BIT_ULL(16)
 #define I40E_FLAG_PROCESS_MDD_EVENTBIT_ULL(17)
 #define I40E_FLAG_PROCESS_VFLR_EVENT   BIT_ULL(18)
 #define I40E_FLAG_SRIOV_ENABLEDBIT_ULL(19)
@@ -550,6 +555,8 @@ struct i40e_vsi {
struct kobject *kobj;  /* sysfs object */
bool current_isup; /* Sync 'link up' logging */
 
+   void *priv; /* client driver data reference. */
+
/* VSI specific handlers */
irqreturn_t (*irq_handler)(int irq, void *data);
 
@@ -702,6 +709,10 @@ void i40e_vsi_setup_queue_map(struct i40e_vsi *vsi,
  struct i40e_vsi_context *ctxt,
  u8 enabled_tc, bool is_add);
 #endif
+void i40e_service_event_schedule(struct i40e_pf *pf);
+void i40e_notify_client_of_vf_msg(struct i40e_vsi *vsi, u32 vf_id,
+ u8 *msg, u16 len);
+
 int i40e_vsi_control_rings(struct i40e_vsi *vsi, bool enable);
 int i40e_reconfig_rss_queues(struct i40e_pf *pf, int queue_count);
 struct i40e_veb *i40e_veb_setup(struct i40e_pf *pf, u16 flags, u16 uplink_seid,
@@ -724,6 +735,17 @@ static inline void i40e_dbg_pf_exit(struct i40e_pf *pf) {}
 static inline void i40e_dbg_init(void) {}
 static inline void i40e_dbg_exit(void) {}
 #endif /* CONFIG_DEBUG_FS*/
+/* needed by client drivers */
+int i40e_lan_add_device(struct i40e_pf *pf);
+int i40e_lan_del_device(struct i40e_pf *pf);
+void i40e_client_subtask(struct i40e_pf *pf);
+void i40e_notify_client_of_l2_param_changes(struct i40e_vsi *vsi);
+void i40e_notify_client_of_netdev_open(struct i40e_vsi *vsi);
+void i40e_notify_client_of_netdev_close(struct i40e_vsi *vsi, bool reset);
+void i40e_notify_client_of_vf_enable(struct i40e_pf *pf, u32 num_vfs);
+void i40e_notify_client_of_vf_reset(struct i40e_pf *pf, u32 vf_id);
+int i40e_vf_client_capable(struct i40e_pf *pf, u32 vf_id,
+  enum i40e_client_type type);
 /**
  * i40e_irq_dynamic_enable - Enable default interrupt generation settings
  * @vsi:

[PATCH 09/15] i40iw: add file to handle cqp calls

2015-12-16 Thread Faisal Latif
i40iw_ctrl.c provides for hardware wqe supporti and cqp.

Acked-by: Anjali Singhai Jain 
Acked-by: Shannon Nelson 
Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/i40iw_ctrl.c | 4774 ++
 1 file changed, 4774 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_ctrl.c

diff --git a/drivers/infiniband/hw/i40iw/i40iw_ctrl.c 
b/drivers/infiniband/hw/i40iw/i40iw_ctrl.c
new file mode 100644
index 000..d0f2a23
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_ctrl.c
@@ -0,0 +1,4774 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use in source and binary forms, with or
+*   without modification, are permitted provided that the following
+*   conditions are met:
+*
+*- Redistributions of source code must retain the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer.
+*
+*- Redistributions in binary form must reproduce the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer in the documentation and/or other materials
+*  provided with the distribution.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+* SOFTWARE.
+*
+***/
+
+#include "i40iw_osdep.h"
+#include "i40iw_register.h"
+#include "i40iw_status.h"
+#include "i40iw_hmc.h"
+
+#include "i40iw_d.h"
+#include "i40iw_type.h"
+#include "i40iw_p.h"
+#include "i40iw_vf.h"
+#include "i40iw_virtchnl.h"
+
+/**
+ * i40iw_insert_wqe_hdr - write wqe header
+ * @wqe: cqp wqe for header
+ * @header: header for the cqp wqe
+ */
+static inline void i40iw_insert_wqe_hdr(u64 *wqe, u64 header)
+{
+   wmb();/* make sure WQE is populated before valid bit is set 
*/
+   set_64bit_val(wqe, 24, header);
+}
+
+/**
+ * i40iw_get_cqp_reg_info - get head and tail for cqp using registers
+ * @cqp: struct for cqp hw
+ * @val: cqp tail register value
+ * @tail:wqtail register value
+ * @error: cqp processing err
+ */
+static inline void i40iw_get_cqp_reg_info(struct i40iw_sc_cqp *cqp,
+ u32 *val,
+ u32 *tail,
+ u32 *error)
+{
+   if (cqp->dev->is_pf) {
+   *val = rd32(cqp->dev->hw, I40E_PFPE_CQPTAIL);
+   *tail = RS_32(*val, I40E_PFPE_CQPTAIL_WQTAIL);
+   *error = RS_32(*val, I40E_PFPE_CQPTAIL_CQP_OP_ERR);
+   } else {
+   *val = rd32(cqp->dev->hw, I40E_VFPE_CQPTAIL1);
+   *tail = RS_32(*val, I40E_VFPE_CQPTAIL_WQTAIL);
+   *error = RS_32(*val, I40E_VFPE_CQPTAIL_CQP_OP_ERR);
+   }
+}
+
+/**
+ * i40iw_cqp_poll_registers - poll cqp registers
+ * @cqp: struct for cqp hw
+ * @tail:wqtail register value
+ * @count: how many times to try for completion
+ */
+static enum i40iw_status_code i40iw_cqp_poll_registers(
+   struct i40iw_sc_cqp *cqp,
+   u32 tail,
+   u32 count)
+{
+   u32 i = 0;
+   u32 newtail, error, val;
+
+   while (i < count) {
+   i++;
+   i40iw_get_cqp_reg_info(cqp, &val, &newtail, &error);
+   if (error) {
+   error = (cqp->dev->is_pf) ?
+rd32(cqp->dev->hw, I40E_PFPE_CQPERRCODES) :
+rd32(cqp->dev->hw, I40E_VFPE_CQPERRCODES1);
+   return I40IW_ERR_CQP_COMPL_ERROR;
+   }
+   if (newtail != tail) {
+   /* SUCCESS */
+   I40IW_RING_MOVE_TAIL(cqp->sq_ring);
+   return 0;
+   }
+   udelay(I40IW_SLEEP_COUNT);
+   }
+   return I40IW_ERR_TIMEOUT;
+}
+
+/**
+ * i40iw_sc_parse_fpm_commit_buf - parse fpm commit buffer
+ * @buf: ptr to fpm commit buffer
+ * @info: ptr to i40iw_hmc_obj_info struct
+ *
+ * parses fpm commit info and copy base value
+ * of hmc objects in hmc_info
+ */
+static enum i40iw_statu

[PATCH 08/15] i40iw: add files for iwarp interface

2015-12-16 Thread Faisal Latif
i40iw_verbs.[ch] are to handle iwarp interface.

Acked-by: Anjali Singhai Jain 
Acked-by: Shannon Nelson 
Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/i40iw_ucontext.h |  110 ++
 drivers/infiniband/hw/i40iw/i40iw_verbs.c| 2492 ++
 drivers/infiniband/hw/i40iw/i40iw_verbs.h|  173 ++
 3 files changed, 2775 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_ucontext.h
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_verbs.c
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_verbs.h

diff --git a/drivers/infiniband/hw/i40iw/i40iw_ucontext.h 
b/drivers/infiniband/hw/i40iw/i40iw_ucontext.h
new file mode 100644
index 000..5c65c25
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_ucontext.h
@@ -0,0 +1,110 @@
+/*
+ * Copyright (c) 2006 - 2015 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2005 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2005 Cisco Systems.  All rights reserved.
+ * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ */
+
+#ifndef I40IW_USER_CONTEXT_H
+#define I40IW_USER_CONTEXT_H
+
+#include 
+
+#define I40IW_ABI_USERSPACE_VER 4
+#define I40IW_ABI_KERNEL_VER4
+struct i40iw_alloc_ucontext_req {
+   __u32 reserved32;
+   __u8 userspace_ver;
+   __u8 reserved8[3];
+};
+
+struct i40iw_alloc_ucontext_resp {
+   __u32 max_pds;  /* maximum pds allowed for this user process */
+   __u32 max_qps;  /* maximum qps allowed for this user process */
+   __u32 wq_size;  /* size of the WQs (sq+rq) allocated to the 
mmaped area */
+   __u8 kernel_ver;
+   __u8 reserved[3];
+};
+
+struct i40iw_alloc_pd_resp {
+   __u32 pd_id;
+   __u8 reserved[4];
+};
+
+struct i40iw_create_cq_req {
+   __u64 user_cq_buffer;
+   __u64 user_shadow_area;
+};
+
+struct i40iw_create_qp_req {
+   __u64 user_wqe_buffers;
+   __u64 user_compl_ctx;
+
+   /* UDA QP PHB */
+   __u64 user_sq_phb;  /* place for VA of the sq phb buff */
+   __u64 user_rq_phb;  /* place for VA of the rq phb buff */
+};
+
+enum i40iw_memreg_type {
+   IW_MEMREG_TYPE_MEM = 0x,
+   IW_MEMREG_TYPE_QP = 0x0001,
+   IW_MEMREG_TYPE_CQ = 0x0002,
+   IW_MEMREG_TYPE_MW = 0x0003,
+   IW_MEMREG_TYPE_FMR = 0x0004,
+   IW_MEMREG_TYPE_FMEM = 0x0005,
+};
+
+struct i40iw_mem_reg_req {
+   __u16 reg_type; /* Memory, QP or CQ */
+   __u16 cq_pages;
+   __u16 rq_pages;
+   __u16 sq_pages;
+};
+
+struct i40iw_create_cq_resp {
+   __u32 cq_id;
+   __u32 cq_size;
+   __u32 mmap_db_index;
+   __u32 reserved;
+};
+
+struct i40iw_create_qp_resp {
+   __u32 qp_id;
+   __u32 actual_sq_size;
+   __u32 actual_rq_size;
+   __u32 i40iw_drv_opt;
+   __u16 push_idx;
+   __u8  lsmm;
+   __u8  rsvd2;
+};
+
+#endif
diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c 
b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
new file mode 100644
index 000..9bdd95f
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c
@@ -0,0 +1,2492 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use 

[PATCH 15/15] i40iw: changes for build of i40iw module

2015-12-16 Thread Faisal Latif
IAINTAINERS< Kconfig, Makefile and rdma_netlink.h to include
i40iw

Signed-off-by: Faisal Latif 
---
 MAINTAINERS  | 10 ++
 drivers/infiniband/Kconfig   |  1 +
 drivers/infiniband/hw/Makefile   |  1 +
 include/uapi/rdma/rdma_netlink.h |  1 +
 4 files changed, 13 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 69c8a9c..fc0ee30 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5600,6 +5600,16 @@ F:   Documentation/networking/i40evf.txt
 F: drivers/net/ethernet/intel/
 F: drivers/net/ethernet/intel/*/
 
+INTEL RDMA RNIC DRIVER
+M: Faisal Latif 
+R: Chien Tin Tung 
+R: Mustafa Ismail 
+R: Shiraz Saleem 
+R: Tatyana Nikolova 
+L: linux-rdma@vger.kernel.org
+S: Supported
+F: drivers/infiniband/hw/i40iw/
+
 INTEL-MID GPIO DRIVER
 M: David Cohen 
 L: linux-g...@vger.kernel.org
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index aa26f3c..7ddd81f 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -58,6 +58,7 @@ source "drivers/infiniband/hw/mthca/Kconfig"
 source "drivers/infiniband/hw/qib/Kconfig"
 source "drivers/infiniband/hw/cxgb3/Kconfig"
 source "drivers/infiniband/hw/cxgb4/Kconfig"
+source "drivers/infiniband/hw/i40iw/Kconfig"
 source "drivers/infiniband/hw/mlx4/Kconfig"
 source "drivers/infiniband/hw/mlx5/Kconfig"
 source "drivers/infiniband/hw/nes/Kconfig"
diff --git a/drivers/infiniband/hw/Makefile b/drivers/infiniband/hw/Makefile
index aded2a5..c7ad0a4 100644
--- a/drivers/infiniband/hw/Makefile
+++ b/drivers/infiniband/hw/Makefile
@@ -2,6 +2,7 @@ obj-$(CONFIG_INFINIBAND_MTHCA)  += mthca/
 obj-$(CONFIG_INFINIBAND_QIB)   += qib/
 obj-$(CONFIG_INFINIBAND_CXGB3) += cxgb3/
 obj-$(CONFIG_INFINIBAND_CXGB4) += cxgb4/
+obj-$(CONFIG_INFINIBAND_I40IW) += i40iw/
 obj-$(CONFIG_MLX4_INFINIBAND)  += mlx4/
 obj-$(CONFIG_MLX5_INFINIBAND)  += mlx5/
 obj-$(CONFIG_INFINIBAND_NES)   += nes/
diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index c19a5dc..56bafbe 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -5,6 +5,7 @@
 
 enum {
RDMA_NL_RDMA_CM = 1,
+   RDMA_NL_I40IW,
RDMA_NL_NES,
RDMA_NL_C4IW,
RDMA_NL_LS, /* RDMA Local Services */
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/15] i40iw: virtual channel handling files

2015-12-16 Thread Faisal Latif
i40iw_vf.[ch] and i40iw_virtchnl[ch] are used for virtual
channel support for iWARP VF module.

Acked-by: Anjali Singhai Jain 
Acked-by: Shannon Nelson 
Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/i40iw_vf.c   |  85 +++
 drivers/infiniband/hw/i40iw/i40iw_vf.h   |  62 +++
 drivers/infiniband/hw/i40iw/i40iw_virtchnl.c | 750 +++
 drivers/infiniband/hw/i40iw/i40iw_virtchnl.h | 124 +
 4 files changed, 1021 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_vf.c
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_vf.h
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_virtchnl.c
 create mode 100644 drivers/infiniband/hw/i40iw/i40iw_virtchnl.h

diff --git a/drivers/infiniband/hw/i40iw/i40iw_vf.c 
b/drivers/infiniband/hw/i40iw/i40iw_vf.c
new file mode 100644
index 000..39bb0ca
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_vf.c
@@ -0,0 +1,85 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use in source and binary forms, with or
+*   without modification, are permitted provided that the following
+*   conditions are met:
+*
+*- Redistributions of source code must retain the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer.
+*
+*- Redistributions in binary form must reproduce the above
+*  copyright notice, this list of conditions and the following
+*  disclaimer in the documentation and/or other materials
+*  provided with the distribution.
+*
+* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+* SOFTWARE.
+*
+***/
+
+#include "i40iw_osdep.h"
+#include "i40iw_register.h"
+#include "i40iw_status.h"
+#include "i40iw_hmc.h"
+#include "i40iw_d.h"
+#include "i40iw_type.h"
+#include "i40iw_p.h"
+#include "i40iw_vf.h"
+
+/**
+ * i40iw_manage_vf_pble_bp - manage vf pble
+ * @cqp: cqp for cqp' sq wqe
+ * @info: pble info
+ * @scratch: pointer for completion
+ * @post_sq: to post and ring
+ */
+enum i40iw_status_code i40iw_manage_vf_pble_bp(struct i40iw_sc_cqp *cqp,
+  struct i40iw_manage_vf_pble_info 
*info,
+  u64 scratch,
+  bool post_sq)
+{
+   u64 *wqe;
+   u64 temp, header, pd_pl_pba = 0;
+
+   wqe = i40iw_sc_cqp_get_next_send_wqe(cqp, scratch);
+   if (!wqe)
+   return I40IW_ERR_RING_FULL;
+
+   temp = LS_64((info->pd_entry_cnt), I40IW_CQPSQ_MVPBP_PD_ENTRY_CNT) |
+   LS_64((info->first_pd_index), I40IW_CQPSQ_MVPBP_FIRST_PD_INX) |
+   LS_64((info->sd_index), I40IW_CQPSQ_MVPBP_SD_INX);
+   set_64bit_val(wqe, 16, temp);
+
+   header = LS_64((info->inv_pd_ent ? 1 : 0), 
I40IW_CQPSQ_MVPBP_INV_PD_ENT) |
+   LS_64(I40IW_CQP_OP_MANAGE_VF_PBLE_BP, I40IW_CQPSQ_OPCODE) |
+   LS_64(cqp->polarity, I40IW_CQPSQ_WQEVALID);
+   set_64bit_val(wqe, 24, header);
+
+   pd_pl_pba = LS_64(info->pd_pl_pba >> 3, I40IW_CQPSQ_MVPBP_PD_PLPBA);
+   set_64bit_val(wqe, 32, pd_pl_pba);
+
+   i40iw_debug_buf(cqp->dev, I40IW_DEBUG_WQE, "MANAGE VF_PBLE_BP WQE", 
wqe, I40IW_CQP_WQE_SIZE * 8);
+
+   if (post_sq)
+   i40iw_sc_cqp_post_sq(cqp);
+   return 0;
+}
+
+struct i40iw_vf_cqp_ops iw_vf_cqp_ops = {
+   i40iw_manage_vf_pble_bp
+};
diff --git a/drivers/infiniband/hw/i40iw/i40iw_vf.h 
b/drivers/infiniband/hw/i40iw/i40iw_vf.h
new file mode 100644
index 000..cfe112d
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/i40iw_vf.h
@@ -0,0 +1,62 @@
+/***
+*
+* Copyright (c) 2015 Intel Corporation.  All rights reserved.
+*
+* This software is available to you under a choice of one of two
+* licenses.  You may choose to be licensed under the terms of the GNU
+* General Public License (GPL) Version 2, available from the file
+* COPYING in the main directory of this source tree, or the
+* OpenFabrics.org BSD license below:
+*
+*   Redistribution and use in source and binary forms, with

[PATCH 14/15] i40iw: Kconfig and Kbuild for iwarp module

2015-12-16 Thread Faisal Latif
Kconfig and Kbuild needed to build iwarp module.

Signed-off-by: Faisal Latif 
---
 drivers/infiniband/hw/i40iw/Kbuild  | 43 +
 drivers/infiniband/hw/i40iw/Kconfig |  7 ++
 2 files changed, 50 insertions(+)
 create mode 100644 drivers/infiniband/hw/i40iw/Kbuild
 create mode 100644 drivers/infiniband/hw/i40iw/Kconfig

diff --git a/drivers/infiniband/hw/i40iw/Kbuild 
b/drivers/infiniband/hw/i40iw/Kbuild
new file mode 100644
index 000..ba84a78
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/Kbuild
@@ -0,0 +1,43 @@
+
+#
+# * Copyright (c) 2015 Intel Corporation.  All rights reserved.
+# *
+# * This software is available to you under a choice of one of two
+# * licenses.  You may choose to be licensed under the terms of the GNU
+# * General Public License (GPL) Version 2, available from the file
+# * COPYING in the main directory of this source tree, or the
+# * OpenFabrics.org BSD license below:
+# *
+# *   Redistribution and use in source and binary forms, with or
+# *   without modification, are permitted provided that the following
+# *   conditions are met:
+# *
+# *- Redistributions of source code must retain the above
+# *copyright notice, this list of conditions and the following
+# *disclaimer.
+# *
+# *- Redistributions in binary form must reproduce the above
+# *copyright notice, this list of conditions and the following
+# *disclaimer in the documentation and/or other materials
+# *provided with the distribution.
+# *
+# * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+# * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+# * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+# * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+# * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+# * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+# * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# * SOFTWARE.
+#
+
+
+ccflags-y :=  -Idrivers/net/ethernet/intel/i40e
+
+obj-m += i40iw.o
+
+i40iw-objs :=\
+   i40iw_cm.o i40iw_ctrl.o \
+   i40iw_hmc.o i40iw_hw.o i40iw_main.o  \
+   i40iw_pble.o i40iw_puda.o i40iw_uk.o i40iw_utils.o \
+   i40iw_verbs.o i40iw_virtchnl.o i40iw_vf.o
diff --git a/drivers/infiniband/hw/i40iw/Kconfig 
b/drivers/infiniband/hw/i40iw/Kconfig
new file mode 100644
index 000..6e7d27a
--- /dev/null
+++ b/drivers/infiniband/hw/i40iw/Kconfig
@@ -0,0 +1,7 @@
+config INFINIBAND_I40IW
+   tristate "Intel(R) Ethernet X722 iWARP Driver"
+   depends on INET && I40E
+   select GENERIC_ALLOCATOR
+   ---help---
+   Intel(R) Ethernet X722 iWARP Driver
+   INET && I40IW && INFINIBAND && I40E
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/15] add Intel(R) X722 iWARP driver

2015-12-16 Thread Faisal Latif
This series contains the addition of the i40iw.ko driver.

This driver provides iWARP RDMA functionality for the Intel(R) X722
Ethernet controller for PCI Physical Functions. It also has support
for Virtual Function driver (i40iwvf.ko) that will be part of seperate
patch series.

It cooperates with the Intel(R) X722 base driver (i40e.ko) to allocate
resources and program the controller.

This series include 1 patch to i40e.ko to provide interface support
to i40iw.ko. The interface provides a driver registration mechanism,
resource allocations, and device reset coordination mechanisms.

This patch series is based on Doug Ledford's
/github.com/dledford/linux.git


Anjali Singhai Jain (1)
net/ethernet/intel/i40e: Add support for client interface for IWARP driver

Faisal Latif(14):
infiniband/hw/i40iw: add main, hdr, status
infiniband/hw/i40iw: add connection management code
infiniband/hw/i40iw: add puda code
infiniband/hw/i40iw: add pble resource files
infiniband/hw/i40iw: add hmc resource files
infiniband/hw/i40iw: add hw and utils files
infiniband/hw/i40iw: add files for iwarp interface
infiniband/hw/i40iw: add file to handle cqp calls
infiniband/hw/i40iw: add hardware related header files
infiniband/hw/i40iw: add X722 register file
infiniband/hw/i40iw: user kernel shared files
infiniband/hw/i40iw: virtual channel handling files
infiniband/hw/i40iw: Kconfig and Kbuild for iwarp module
infiniband/hw/i40iw: changes for build of i40iw module

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-16 Thread Christoph Lameter
On Wed, 16 Dec 2015, Christoph Lameter wrote:

> DRAFT: This is missing the check if this device supports
> extended counters.

Found some time and here is the patch with the detection of the extended
attribute through sending a mad request. Untested. Got the info on how
to do the proper mad request from an earlier patch by Or in 2011.


Subject: IB Core: Display extended counter set if available V2

Check if the extended counters are available and if so
create the proper extended and additional counters.

Signed-off-by: Christoph Lameter 

Index: linux/drivers/infiniband/core/sysfs.c
===
--- linux.orig/drivers/infiniband/core/sysfs.c
+++ linux/drivers/infiniband/core/sysfs.c
@@ -39,6 +39,7 @@
 #include 

 #include 
+#include 

 struct ib_port {
struct kobject kobj;
@@ -65,6 +66,7 @@ struct port_table_attribute {
struct port_attribute   attr;
charname[8];
int index;
+   int attr_id;
 };

 static ssize_t port_attr_show(struct kobject *kobj,
@@ -314,24 +316,33 @@ static ssize_t show_port_pkey(struct ib_
 #define PORT_PMA_ATTR(_name, _counter, _width, _offset)
\
 struct port_table_attribute port_pma_attr_##_name = {  \
.attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
-   .index = (_offset) | ((_width) << 16) | ((_counter) << 24)  \
+   .index = (_offset) | ((_width) << 16) | ((_counter) << 24), \
+   .attr_id = IB_PMA_PORT_COUNTERS ,   \
 }

-static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
-   char *buf)
+#define PORT_PMA_ATTR_EXT(_name, _width, _offset)  \
+struct port_table_attribute port_pma_attr_ext_##_name = {  \
+   .attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
+   .index = (_offset) | ((_width) << 16),  \
+   .attr_id = IB_PMA_PORT_COUNTERS_EXT ,   \
+}
+
+
+/*
+ * Get a MAD block of data.
+ * Returns error code or the number of bytes retrieved.
+ */
+static int get_mad(struct ib_device *dev, int port_num, int attr,
+   void *data, int offset, size_t size)
 {
-   struct port_table_attribute *tab_attr =
-   container_of(attr, struct port_table_attribute, attr);
-   int offset = tab_attr->index & 0x;
-   int width  = (tab_attr->index >> 16) & 0xff;
-   struct ib_mad *in_mad  = NULL;
-   struct ib_mad *out_mad = NULL;
+   struct ib_mad *in_mad;
+   struct ib_mad *out_mad;
size_t mad_size = sizeof(*out_mad);
u16 out_mad_pkey_index = 0;
ssize_t ret;

-   if (!p->ibdev->process_mad)
-   return sprintf(buf, "N/A (no PMA)\n");
+   if (!dev->process_mad)
+   return -ENOSYS;

in_mad  = kzalloc(sizeof *in_mad, GFP_KERNEL);
out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL);
@@ -344,12 +355,12 @@ static ssize_t show_pma_counter(struct i
in_mad->mad_hdr.mgmt_class= IB_MGMT_CLASS_PERF_MGMT;
in_mad->mad_hdr.class_version = 1;
in_mad->mad_hdr.method= IB_MGMT_METHOD_GET;
-   in_mad->mad_hdr.attr_id   = cpu_to_be16(0x12); /* PortCounters */
+   in_mad->mad_hdr.attr_id   = attr;

-   in_mad->data[41] = p->port_num; /* PortSelect field */
+   in_mad->data[41] = port_num;/* PortSelect field */

-   if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY,
-p->port_num, NULL, NULL,
+   if ((dev->process_mad(dev, IB_MAD_IGNORE_MKEY,
+port_num, NULL, NULL,
 (const struct ib_mad_hdr *)in_mad, mad_size,
 (struct ib_mad_hdr *)out_mad, &mad_size,
 &out_mad_pkey_index) &
@@ -358,31 +369,54 @@ static ssize_t show_pma_counter(struct i
ret = -EINVAL;
goto out;
}
+   memcpy(data, out_mad->data + offset, size);
+   ret = size;
+out:
+   kfree(in_mad);
+   kfree(out_mad);
+   return ret;
+}
+
+static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
+   char *buf)
+{
+   struct port_table_attribute *tab_attr =
+   container_of(attr, struct port_table_attribute, attr);
+   int offset = tab_attr->index & 0x;
+   int width  = (tab_attr->index >> 16) & 0xff;
+   ssize_t ret;
+   u8 data[8];
+
+   ret = get_mad(p->ibdev, p->port_num, tab_attr->attr_id, &data,
+   40 + offset / 8, sizeof(data));
+   if (ret < 0)
+   return sprintf(buf, "N/A (no PMA)\n");

switch (width) {
case 4:
-   ret = sprintf(buf, "%u\n", (out_mad->data[40 + offset / 8] >>
+   ret = sprintf(buf, "%u\n", (*data >>
 

Re: [PATCH] svc_rdma: use local_dma_lkey

2015-12-16 Thread Jason Gunthorpe
On Wed, Dec 16, 2015 at 04:11:04PM +0100, Christoph Hellwig wrote:
> We now alwasy have a per-PD local_dma_lkey available.  Make use of that
> fact in svc_rdma and stop registering our own MR.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Jason Gunthorpe 

> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -144,6 +144,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
>  
>   head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no];
>   head->arg.page_len += len;
> +
>   head->arg.len += len;
>   if (!pg_off)
>   head->count++;

Was this hunk deliberate?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-16 Thread Jason Gunthorpe
On Wed, Dec 16, 2015 at 11:26:39AM +0100, Michael Wang wrote:
> 
> On 12/15/2015 06:30 PM, Jason Gunthorpe wrote:
> > On Tue, Dec 15, 2015 at 05:38:34PM +0100, Michael Wang wrote:
> >> The hop_limit is only suggest that the package allowed to be
> >> routed, not have to, correct?
> > 
> > If the hop limit is >= 2 (?) then the GRH is mandatory. The
> > SM will return this information in the PathRecord if it determines a
> > GRH is required. The whole stack follows this protocol.
> > 
> > The GRH is optional for in-subnet communications.
> 
> Thanks for the explain :-)
> 
> I've rechecked the ib_init_ah_from_path() again, and found it
> still set IB_AH_GRH when the GID cache missing, but with:

How do you mean?

ah_attr->ah_flags = IB_AH_GRH;
ah_attr->grh.dgid = rec->dgid;

ret = ib_find_cached_gid(device, &rec->sgid, ndev, &port_num,
 &gid_index);
if (ret) {
if (ndev)
dev_put(ndev);
return ret;
}

If find_cached_gid fails then ib_init_ah_from_path also fails.

Is there a case where ib_find_cached_gid can succeed but not return
good data?

I agree it would read nicer if the ah_flags and gr.dgid was moved
after the ib_find_cached_gid

> BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out
> a check on return.

That sounds like a problem.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-16 Thread Jason Gunthorpe
On Wed, Dec 16, 2015 at 08:56:01AM +0200, Moni Shoua wrote:

> I can't object to that but I really would like to get an example of a
> security risk.

How can anyone give you an example when nobody knows exactly how mlx
hardware works in this area?

>From an kapi prespective, the security design is very simple.

Every single UD packet the kapi side has to process must be
unambiguously associated with a gid_index or dropped. Period full
stop. I would think that is an obvious conclusion based on the design
of the gid cache.

This is why we need a clear API to get this critical information. It
should not be open coded in init_ah_from_wc, it should not be done
some other way in the CMA code.

This is a simple matter of sane kapi design, nothing more.

I have no idea why this is so objectionable.

> scattered to the receive bufs anyway. So, if there is a security hole
> it exists from day one of the IB stack and this is not the time we
> should insist on fixing it.

IB isn't interacting with the net stack in the same way rocev2 is, so
this is not a pre-existing problem.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-16 Thread Jason Gunthorpe
On Wed, Dec 16, 2015 at 09:57:02AM +, Liran Liss wrote:
> Currently, namespaces are not supported for RoCE.

IMHO, we should not be accepting rocev2 without at least basic
namespace support too, since it is fairly trivial to do based on the
work that is already done for verbs. An obvious missing piece is the
'wc to gid index' API I keep asking for.

> That said, we have everything we need for RoCE namespace support when we get 
> there.

Then there is no problem with the 'wc to gid index' stuff, so stop
complaining about it.

> All of this has nothing to do with "broken" and enshrining anything in the 
> kapi.
> That's just bullshit.

No, it is a critique of the bad kAPI choices in this patch that mean
it broadly doesn't use namespaces, net devices or IP routing
correctly.

> The design of the RDMA stack is that Verbs are used by core IB
> services, such as addressing.  For these services, as the
> specification requires, all relevant fields must be reported in the
> CQE, period.  All spec-compliant HW devices follow this.

Wrong, the kapi needs to meet the needs of the kernel, and is
influenced but not set by the various standards.

That means we get to make better choices in the kapi than exposing
wc.network_type.

> If a ULP wants to create an address handle from a completion, there
> are service routines to accomplish that, based on the reported
> fields.  If it doesn't care, there is no reason to sacrifice
> performance.

I have no idea why you think there would be a performance sacrifice,
maybe you should review the patches and my remarks again.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-16 Thread Doug Ledford
On 12/15/2015 04:46 PM, Doug Ledford wrote:
> On 12/15/2015 04:42 PM, Hal Rosenstock wrote:
>> On 12/15/2015 4:20 PM, Jason Gunthorpe wrote:
 The unicast/multicast extended counters are not always supported -
> depends on setting of PerfMgt ClassPortInfo
> CapabilityMask.IsExtendedWidthSupportedNoIETF (bit 10).
>>
>>> Yes.. certainly this proposed patch needs to account for that and
>>> continue to use the 32 bit ones in that case.
>>
>> There are no 32 bit equivalents of those 4 "IETF" counters ([uni
>> multi]cast [xmit rcv] pkts).
>>
>> When not supported, perhaps it is best not to populate these counters in
>> sysfs so one can discern between counter not supported and 0 value.
>>
>> I'm still working on definitive mthca answer but think the attribute is
>> not supported there. Does anyone out there have an mthca setup where
>> they can try this ?
> 
> Yes.
> 
> 

From my mthca machine:

[root@rdma-dev-04 ~]$ lspci | grep Mellanox
01:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex
(Tavor compatibility mode) (rev 20)
[root@rdma-dev-04 ~]$ perfquery
# Port counters: Lid 36 port 1 (CapMask: 0x00)
PortSelect:..1
CounterSelect:...0x
SymbolErrorCounter:..0
LinkErrorRecoveryCounter:0
LinkDownedCounter:...0
PortRcvErrors:...0
PortRcvRemotePhysicalErrors:.0
PortRcvSwitchRelayErrors:0
PortXmitDiscards:0
PortXmitConstraintErrors:0
PortRcvConstraintErrors:.0
CounterSelect2:..0x00
LocalLinkIntegrityErrors:0
ExcessiveBufferOverrunErrors:0
VL15Dropped:.1
PortXmitData:2470620192
PortRcvData:.2401094563
PortXmitPkts:6363544
PortRcvPkts:.6321251
[root@rdma-dev-04 ~]$ perfquery -x
ibwarn: [29831] dump_perfcounters: PerfMgt ClassPortInfo 0x0; No
extended counter support indicated

perfquery: iberror: failed: perfextquery

So, no extended counters on this device.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-16 Thread Christoph Lameter
On Tue, 15 Dec 2015, Jason Gunthorpe wrote:

> > The unicast/multicast extended counters are not always supported -
> > depends on setting of PerfMgt ClassPortInfo
> > CapabilityMask.IsExtendedWidthSupportedNoIETF (bit 10).
>
> Yes.. certainly this proposed patch needs to account for that and
> continue to use the 32 bit ones in that case.

So this is in struct ib_class_port_info the capability_mask? This does not
seem to be used anywhere in the IB core.

Here is a draft patch to change the counters depending on a bit (which I
do not know how to determine). So this would hopefully work if someone
would insert the proper check. Note that this patch no longer needs the
earlier 2 patches.

>From Christoph Lameter 
Subject: IB Core: Display extended counter set if available

Check if the extended counters are available and if so
create the proper extended and additional counters.

DRAFT: This is missing the check if this device supports
extended counters.

Signed-off-by: Christoph Lameter 

Index: linux/drivers/infiniband/core/sysfs.c
===
--- linux.orig/drivers/infiniband/core/sysfs.c
+++ linux/drivers/infiniband/core/sysfs.c
@@ -39,6 +39,7 @@
 #include 

 #include 
+#include 

 struct ib_port {
struct kobject kobj;
@@ -65,6 +66,7 @@ struct port_table_attribute {
struct port_attribute   attr;
charname[8];
int index;
+   int attr_id;
 };

 static ssize_t port_attr_show(struct kobject *kobj,
@@ -314,7 +316,15 @@ static ssize_t show_port_pkey(struct ib_
 #define PORT_PMA_ATTR(_name, _counter, _width, _offset)
\
 struct port_table_attribute port_pma_attr_##_name = {  \
.attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
-   .index = (_offset) | ((_width) << 16) | ((_counter) << 24)  \
+   .index = (_offset) | ((_width) << 16) | ((_counter) << 24), \
+   .attr_id = IB_PMA_PORT_COUNTERS ,   \
+}
+
+#define PORT_PMA_ATTR_EXT(_name, _width, _offset)  \
+struct port_table_attribute port_pma_attr_ext_##_name = {  \
+   .attr  = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\
+   .index = (_offset) | ((_width) << 16),  \
+   .attr_id = IB_PMA_PORT_COUNTERS_EXT ,   \
 }

 static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr,
@@ -344,7 +354,7 @@ static ssize_t show_pma_counter(struct i
in_mad->mad_hdr.mgmt_class= IB_MGMT_CLASS_PERF_MGMT;
in_mad->mad_hdr.class_version = 1;
in_mad->mad_hdr.method= IB_MGMT_METHOD_GET;
-   in_mad->mad_hdr.attr_id   = cpu_to_be16(0x12); /* PortCounters */
+   in_mad->mad_hdr.attr_id   = tab_attr->attr_id;

in_mad->data[41] = p->port_num; /* PortSelect field */

@@ -375,6 +385,11 @@ static ssize_t show_pma_counter(struct i
ret = sprintf(buf, "%u\n",
  be32_to_cpup((__be32 *)(out_mad->data + 40 + 
offset / 8)));
break;
+   case 64:
+   ret = sprintf(buf, "%llu\n",
+   be64_to_cpup((__be64 *)(out_mad->data + 40 + 
offset / 8)));
+   break;
+
default:
ret = 0;
}
@@ -403,6 +418,18 @@ static PORT_PMA_ATTR(port_rcv_data
 static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256);
 static PORT_PMA_ATTR(port_rcv_packets  , 15, 32, 288);

+/*
+ * Counters added by extended set
+ */
+static PORT_PMA_ATTR_EXT(port_xmit_data, 64,  64);
+static PORT_PMA_ATTR_EXT(port_rcv_data , 64, 128);
+static PORT_PMA_ATTR_EXT(port_xmit_packets , 64, 192);
+static PORT_PMA_ATTR_EXT(port_rcv_packets  , 64, 256);
+static PORT_PMA_ATTR_EXT(unicast_xmit_packets  , 64, 320);
+static PORT_PMA_ATTR_EXT(unicast_rcv_packets   , 64, 384);
+static PORT_PMA_ATTR_EXT(multicast_xmit_packets, 64, 448);
+static PORT_PMA_ATTR_EXT(multicast_rcv_packets , 64, 512);
+
 static struct attribute *pma_attrs[] = {
&port_pma_attr_symbol_error.attr.attr,
&port_pma_attr_link_error_recovery.attr.attr,
@@ -423,11 +450,40 @@ static struct attribute *pma_attrs[] = {
NULL
 };

+static struct attribute *pma_attrs_ext[] = {
+   &port_pma_attr_symbol_error.attr.attr,
+   &port_pma_attr_link_error_recovery.attr.attr,
+   &port_pma_attr_link_downed.attr.attr,
+   &port_pma_attr_port_rcv_errors.attr.attr,
+   &port_pma_attr_port_rcv_remote_physical_errors.attr.attr,
+   &port_pma_attr_port_rcv_switch_relay_errors.attr.attr,
+   &port_pma_attr_port_xmit_discards.attr.attr,
+   &port_pma_attr_port_xmit_constraint_errors.attr.attr,
+   &port_pma_attr_port_rcv_constraint_errors.attr.attr,
+   &port_pma_attr_loc

Re: [PATCH] svc_rdma: use local_dma_lkey

2015-12-16 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set

2015-12-16 Thread Christoph Lameter
On Tue, 15 Dec 2015, Doug Ledford wrote:

> On 12/15/2015 04:42 PM, Hal Rosenstock wrote:
> > On 12/15/2015 4:20 PM, Jason Gunthorpe wrote:
> >>> The unicast/multicast extended counters are not always supported -
>  depends on setting of PerfMgt ClassPortInfo
>  CapabilityMask.IsExtendedWidthSupportedNoIETF (bit 10).
> >
> >> Yes.. certainly this proposed patch needs to account for that and
> >> continue to use the 32 bit ones in that case.
> >
> > There are no 32 bit equivalents of those 4 "IETF" counters ([uni
> > multi]cast [xmit rcv] pkts).
> >
> > When not supported, perhaps it is best not to populate these counters in
> > sysfs so one can discern between counter not supported and 0 value.
> >
> > I'm still working on definitive mthca answer but think the attribute is
> > not supported there. Does anyone out there have an mthca setup where
> > they can try this ?
>
> Yes.

We can return ENOSYS for the counters not supported.

Or simply not create the sysfs files when the device is instantiated as
well as fall back to the 32 bit counters on instantiation for those
devices not supporting the extended set.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svc_rdma: use local_dma_lkey

2015-12-16 Thread Chuck Lever

> On Dec 16, 2015, at 10:11 AM, Christoph Hellwig  wrote:
> 
> We now alwasy have a per-PD local_dma_lkey available.  Make use of that
> fact in svc_rdma and stop registering our own MR.
> 
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Chuck Lever 

> ---
> include/linux/sunrpc/svc_rdma.h|  2 --
> net/sunrpc/xprtrdma/svc_rdma_backchannel.c |  2 +-
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c|  4 ++--
> net/sunrpc/xprtrdma/svc_rdma_sendto.c  |  6 ++---
> net/sunrpc/xprtrdma/svc_rdma_transport.c   | 36 --
> 5 files changed, 10 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> index b13513a..5322fea 100644
> --- a/include/linux/sunrpc/svc_rdma.h
> +++ b/include/linux/sunrpc/svc_rdma.h
> @@ -156,13 +156,11 @@ struct svcxprt_rdma {
>   struct ib_qp *sc_qp;
>   struct ib_cq *sc_rq_cq;
>   struct ib_cq *sc_sq_cq;
> - struct ib_mr *sc_phys_mr;   /* MR for server memory */
>   int  (*sc_reader)(struct svcxprt_rdma *,
> struct svc_rqst *,
> struct svc_rdma_op_ctxt *,
> int *, u32 *, u32, u32, u64, bool);
>   u32  sc_dev_caps;   /* distilled device caps */
> - u32  sc_dma_lkey;   /* local dma key */
>   unsigned int sc_frmr_pg_list_len;
>   struct list_head sc_frmr_q;
>   spinlock_t   sc_frmr_q_lock;
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c 
> b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
> index 417cec1..c428734 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
> @@ -128,7 +128,7 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
> 
>   ctxt->wr_op = IB_WR_SEND;
>   ctxt->direction = DMA_TO_DEVICE;
> - ctxt->sge[0].lkey = rdma->sc_dma_lkey;
> + ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey;
>   ctxt->sge[0].length = sndbuf->len;
>   ctxt->sge[0].addr =
>   ib_dma_map_page(rdma->sc_cm_id->device, ctxt->pages[0], 0,
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
> b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index 3dfe464..c8b8a8b 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -144,6 +144,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
> 
>   head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no];
>   head->arg.page_len += len;
> +
>   head->arg.len += len;
>   if (!pg_off)
>   head->count++;
> @@ -160,8 +161,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
>   goto err;
>   atomic_inc(&xprt->sc_dma_used);
> 
> - /* The lkey here is either a local dma lkey or a dma_mr lkey */
> - ctxt->sge[pno].lkey = xprt->sc_dma_lkey;
> + ctxt->sge[pno].lkey = xprt->sc_pd->local_dma_lkey;
>   ctxt->sge[pno].length = len;
>   ctxt->count++;
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> index ced3151..20bd5d4 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> @@ -265,7 +265,7 @@ static int send_write(struct svcxprt_rdma *xprt, struct 
> svc_rqst *rqstp,
>sge[sge_no].addr))
>   goto err;
>   atomic_inc(&xprt->sc_dma_used);
> - sge[sge_no].lkey = xprt->sc_dma_lkey;
> + sge[sge_no].lkey = xprt->sc_pd->local_dma_lkey;
>   ctxt->count++;
>   sge_off = 0;
>   sge_no++;
> @@ -487,7 +487,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
>   ctxt->count = 1;
> 
>   /* Prepare the SGE for the RPCRDMA Header */
> - ctxt->sge[0].lkey = rdma->sc_dma_lkey;
> + ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey;
>   ctxt->sge[0].length = svc_rdma_xdr_get_reply_hdr_len(rdma_resp);
>   ctxt->sge[0].addr =
>   ib_dma_map_page(rdma->sc_cm_id->device, page, 0,
> @@ -511,7 +511,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
>ctxt->sge[sge_no].addr))
>   goto err;
>   atomic_inc(&rdma->sc_dma_used);
> - ctxt->sge[sge_no].lkey = rdma->sc_dma_lkey;
> + ctxt->sge[sge_no].lkey = rdma->sc_pd->local_dma_lkey;
>   ctxt->sge[sge_no].length = sge_bytes;
>   }
>   if (byte_count != 0) {
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
> b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index abfbd02..faf4c49 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -232,11 +232,11 @@ void svc_

Re: [PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack

2015-12-16 Thread Christoph Hellwig
On Wed, Dec 16, 2015 at 10:13:31AM -0500, Chuck Lever wrote:
> > Shouldn't be an issue with transparent unions these days:
> > 
> > union {
> > struct ib_reg_wrfr_regwr;
> > struct ib_send_wr   fr_invwr;
> > };
> 
> Right, but isn't that a gcc-ism that Al hates? If
> everyone is OK with that construction, I will use it.

I started out as a GNUism, but now is supported in C11.  We use it
a lot all over the kernel.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] svc_rdma: use local_dma_lkey

2015-12-16 Thread Christoph Hellwig
We now alwasy have a per-PD local_dma_lkey available.  Make use of that
fact in svc_rdma and stop registering our own MR.

Signed-off-by: Christoph Hellwig 
---
 include/linux/sunrpc/svc_rdma.h|  2 --
 net/sunrpc/xprtrdma/svc_rdma_backchannel.c |  2 +-
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c|  4 ++--
 net/sunrpc/xprtrdma/svc_rdma_sendto.c  |  6 ++---
 net/sunrpc/xprtrdma/svc_rdma_transport.c   | 36 --
 5 files changed, 10 insertions(+), 40 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index b13513a..5322fea 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -156,13 +156,11 @@ struct svcxprt_rdma {
struct ib_qp *sc_qp;
struct ib_cq *sc_rq_cq;
struct ib_cq *sc_sq_cq;
-   struct ib_mr *sc_phys_mr;   /* MR for server memory */
int  (*sc_reader)(struct svcxprt_rdma *,
  struct svc_rqst *,
  struct svc_rdma_op_ctxt *,
  int *, u32 *, u32, u32, u64, bool);
u32  sc_dev_caps;   /* distilled device caps */
-   u32  sc_dma_lkey;   /* local dma key */
unsigned int sc_frmr_pg_list_len;
struct list_head sc_frmr_q;
spinlock_t   sc_frmr_q_lock;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c 
b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
index 417cec1..c428734 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c
@@ -128,7 +128,7 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma,
 
ctxt->wr_op = IB_WR_SEND;
ctxt->direction = DMA_TO_DEVICE;
-   ctxt->sge[0].lkey = rdma->sc_dma_lkey;
+   ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey;
ctxt->sge[0].length = sndbuf->len;
ctxt->sge[0].addr =
ib_dma_map_page(rdma->sc_cm_id->device, ctxt->pages[0], 0,
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 3dfe464..c8b8a8b 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -144,6 +144,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
 
head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no];
head->arg.page_len += len;
+
head->arg.len += len;
if (!pg_off)
head->count++;
@@ -160,8 +161,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
goto err;
atomic_inc(&xprt->sc_dma_used);
 
-   /* The lkey here is either a local dma lkey or a dma_mr lkey */
-   ctxt->sge[pno].lkey = xprt->sc_dma_lkey;
+   ctxt->sge[pno].lkey = xprt->sc_pd->local_dma_lkey;
ctxt->sge[pno].length = len;
ctxt->count++;
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index ced3151..20bd5d4 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -265,7 +265,7 @@ static int send_write(struct svcxprt_rdma *xprt, struct 
svc_rqst *rqstp,
 sge[sge_no].addr))
goto err;
atomic_inc(&xprt->sc_dma_used);
-   sge[sge_no].lkey = xprt->sc_dma_lkey;
+   sge[sge_no].lkey = xprt->sc_pd->local_dma_lkey;
ctxt->count++;
sge_off = 0;
sge_no++;
@@ -487,7 +487,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
ctxt->count = 1;
 
/* Prepare the SGE for the RPCRDMA Header */
-   ctxt->sge[0].lkey = rdma->sc_dma_lkey;
+   ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey;
ctxt->sge[0].length = svc_rdma_xdr_get_reply_hdr_len(rdma_resp);
ctxt->sge[0].addr =
ib_dma_map_page(rdma->sc_cm_id->device, page, 0,
@@ -511,7 +511,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
 ctxt->sge[sge_no].addr))
goto err;
atomic_inc(&rdma->sc_dma_used);
-   ctxt->sge[sge_no].lkey = rdma->sc_dma_lkey;
+   ctxt->sge[sge_no].lkey = rdma->sc_pd->local_dma_lkey;
ctxt->sge[sge_no].length = sge_bytes;
}
if (byte_count != 0) {
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index abfbd02..faf4c49 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -232,11 +232,11 @@ void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
for (i = 0; i < ctxt->count && ctxt->sge[i].length; i++) {
/*
 * Unmap the DMA addr in the SGE 

small svc_rdma cleanup

2015-12-16 Thread Christoph Hellwig
This makes use of the now always available local_dma_lkey, and goes on top
of Chuck's "[PATCH v4 00/11] NFS/RDMA server patches for v4.5" series.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack

2015-12-16 Thread Chuck Lever

> On Dec 16, 2015, at 10:11 AM, Christoph Hellwig  wrote:
> 
> On Wed, Dec 16, 2015 at 10:06:33AM -0500, Chuck Lever wrote:
>>> Would it make sense to unionize these as they are guaranteed not to
>>> execute together? Some people don't like this sort of savings.
>> 
>> I dislike unions because they make the code that uses
>> them less readable. I can define macros to help that,
>> but sigh! OK.
> 
> Shouldn't be an issue with transparent unions these days:
> 
>   union {
>   struct ib_reg_wrfr_regwr;
>   struct ib_send_wr   fr_invwr;
>   };

Right, but isn't that a gcc-ism that Al hates? If
everyone is OK with that construction, I will use it.

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 09/11] SUNRPC: Introduce xprt_commit_rqst()

2015-12-16 Thread Chuck Lever
Hi Anna-


> On Dec 16, 2015, at 8:48 AM, Anna Schumaker  wrote:
> 
> Hi Chuck,
> 
> Sorry for the last minute comment.
> 
> On 12/14/2015 04:19 PM, Chuck Lever wrote:
>> I'm about to add code in the RPC/RDMA reply handler between the
>> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
>> to execute outside of spinlock critical sections.
>> 
>> Add a hook to remove an rpc_rqst from the pending list once
>> the transport knows its going to invoke xprt_complete_rqst().
>> 
>> Signed-off-by: Chuck Lever 
>> ---
>> include/linux/sunrpc/xprt.h|1 +
>> net/sunrpc/xprt.c  |   14 ++
>> net/sunrpc/xprtrdma/rpc_rdma.c |4 
>> 3 files changed, 19 insertions(+)
>> 
>> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
>> index 69ef5b3..ab6c3a5 100644
>> --- a/include/linux/sunrpc/xprt.h
>> +++ b/include/linux/sunrpc/xprt.h
>> @@ -366,6 +366,7 @@ void 
>> xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
>> void xprt_write_space(struct rpc_xprt *xprt);
>> void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task 
>> *task, int result);
>> struct rpc_rqst *xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
>> +voidxprt_commit_rqst(struct rpc_task *task);
>> void xprt_complete_rqst(struct rpc_task *task, int copied);
>> void xprt_release_rqst_cong(struct rpc_task *task);
>> void xprt_disconnect_done(struct rpc_xprt *xprt);
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index 2e98f4a..a5be4ab 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
>> }
>> 
>> /**
>> + * xprt_commit_rqst - remove rqst from pending list early
>> + * @task: RPC request to remove
>> + *
>> + * Caller holds transport lock.
>> + */
>> +void xprt_commit_rqst(struct rpc_task *task)
>> +{
>> +struct rpc_rqst *req = task->tk_rqstp;
>> +
>> +list_del_init(&req->rq_list);
>> +}
>> +EXPORT_SYMBOL_GPL(xprt_commit_rqst);
> 
> Can you move this function into the xprtrdma code, since it's not called 
> outside of there?

I think that's a layering violation, and the idea is
to allow other transports to use this API eventually.

But I'll include this change in the next version of
the series.


> Thanks,
> Anna
> 
>> +
>> +/**
>>  * xprt_complete_rqst - called when reply processing is complete
>>  * @task: RPC request that recently completed
>>  * @copied: actual number of bytes received from the transport
>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>> index c10d969..0bc8c39 100644
>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>> @@ -804,6 +804,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>>  if (req->rl_reply)
>>  goto out_duplicate;
>> 
>> +xprt_commit_rqst(rqst->rq_task);
>> +spin_unlock_bh(&xprt->transport_lock);
>> +
>>  dprintk("RPC:   %s: reply 0x%p completes request 0x%p\n"
>>  "   RPC request 0x%p xid 0x%08x\n",
>>  __func__, rep, req, rqst,
>> @@ -894,6 +897,7 @@ badheader:
>>  else if (credits > r_xprt->rx_buf.rb_max_requests)
>>  credits = r_xprt->rx_buf.rb_max_requests;
>> 
>> +spin_lock_bh(&xprt->transport_lock);
>>  cwnd = xprt->cwnd;
>>  xprt->cwnd = credits << RPC_CWNDSHIFT;
>>  if (xprt->cwnd > cwnd)
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack

2015-12-16 Thread Christoph Hellwig
On Wed, Dec 16, 2015 at 10:06:33AM -0500, Chuck Lever wrote:
> > Would it make sense to unionize these as they are guaranteed not to
> > execute together? Some people don't like this sort of savings.
> 
> I dislike unions because they make the code that uses
> them less readable. I can define macros to help that,
> but sigh! OK.

Shouldn't be an issue with transparent unions these days:

union {
struct ib_reg_wrfr_regwr;
struct ib_send_wr   fr_invwr;
};
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/11] xprtrdma: Add ro_unmap_sync method for FRWR

2015-12-16 Thread Chuck Lever

> On Dec 16, 2015, at 8:57 AM, Sagi Grimberg  wrote:
> 
> 
>> +static void
>> +__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
>> + int rc)
>> +{
>> +struct ib_device *device = r_xprt->rx_ia.ri_device;
>> +struct rpcrdma_mw *mw = seg->rl_mw;
>> +int nsegs = seg->mr_nsegs;
>> +
>> +seg->rl_mw = NULL;
>> +
>> +while (nsegs--)
>> +rpcrdma_unmap_one(device, seg++);
> 
> Chuck, shouldn't this be replaced with ib_dma_unmap_sg?

Looks like this was left over from before the conversion
to use ib_dma_unmap_sg. I'll have a look.

> Sorry for the late comment (Didn't find enough time to properly
> review this...)

--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack

2015-12-16 Thread Chuck Lever

> On Dec 16, 2015, at 9:00 AM, Sagi Grimberg  wrote:
> 
> 
>> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h 
>> b/net/sunrpc/xprtrdma/xprt_rdma.h
>> index 4197191..e60d817 100644
>> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
>> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
>> @@ -206,6 +206,8 @@ struct rpcrdma_frmr {
>>  enum rpcrdma_frmr_state fr_state;
>>  struct work_struct  fr_work;
>>  struct rpcrdma_xprt *fr_xprt;
>> +struct ib_reg_wrfr_regwr;
>> +struct ib_send_wr   fr_invwr;
> 
> Would it make sense to unionize these as they are guaranteed not to
> execute together? Some people don't like this sort of savings.

I dislike unions because they make the code that uses
them less readable. I can define macros to help that,
but sigh! OK.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 5/5] IB/mlx5: Mmap the HCA's core clock register to user-space

2015-12-16 Thread Sagi Grimberg




  enum mlx5_ib_mmap_cmd {
MLX5_IB_MMAP_REGULAR_PAGE   = 0,
-   MLX5_IB_MMAP_GET_CONTIGUOUS_PAGES   = 1, /* always last */
+   MLX5_IB_MMAP_GET_CONTIGUOUS_PAGES   = 1,
+   /* 5 is chosen in order to be compatible with old versions of libmlx5 */
+   MLX5_IB_MMAP_CORE_CLOCK = 5,
  };


Overall the patches look good so I'd suggest not to apply atop of
the contig pages patchset from Yishai which obviously involves some
debate. Although if this bit is the only conflict then perhaps doug can
take care of it...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 3/5] IB/mlx5: Add support for hca_core_clock and timestamp_mask

2015-12-16 Thread Matan Barak
On Wed, Dec 16, 2015 at 4:43 PM, Sagi Grimberg  wrote:
>
>> Reporting the hca_core_clock (in kHZ) and the timestamp_mask in
>> query_device extended verb. timestamp_mask is used by users in order
>> to know what is the valid range of the raw timestamps, while
>> hca_core_clock reports the clock frequency that is used for
>> timestamps.
>
>
> Hi Matan,
>
> Shouldn't this patch come last?
>

Not necessarily. In order to support completion timestamping (that's
what defined in this query_device patch), we only need create_cq_ex in
mlx5_ib.
The down stream patches adds support for reading the HCA core clock
(via query_values).
One could have completion timestamping support without having
ibv_query_values support.

Thanks for taking a look.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 3/5] IB/mlx5: Add support for hca_core_clock and timestamp_mask

2015-12-16 Thread Sagi Grimberg



Reporting the hca_core_clock (in kHZ) and the timestamp_mask in
query_device extended verb. timestamp_mask is used by users in order
to know what is the valid range of the raw timestamps, while
hca_core_clock reports the clock frequency that is used for
timestamps.


Hi Matan,

Shouldn't this patch come last?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 37/37] IB/rdmavt: Add support for new memory registration API

2015-12-16 Thread Sagi Grimberg



This patch exists to provide parity for what is in qib. Should we not
have it?  If not, why do we have:

commit 38071a461f0a ("IB/qib: Support the new memory registration API")


That was done by me because I saw this in qib and assumed that it was
supported. Now that I found out that it isn't, I'd say it should be
removed altogether shouldn't it?



That doesn't mean it can't be added to rdmavt as a future enhancement
though if there is a need.


Well, given that we're trying to consolidate on post send registration
interface it's kind of a must I'd say.


Are you asking because soft-roce will need it?


I was asking in general, but in specific soft-roce as a consumer will
need to support that yes.


I think it makes sense to revisit when soft-roce comes in,


I agree.


since qib/hfi do not need IB_WR_LOCAL_INV.


Can you explain? Does qib/hfi have a magic way to invalidate memory
regions?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 2/5] IB/core: Add ib_is_udata_cleared

2015-12-16 Thread Haggai Eran
On 15/12/2015 20:30, Matan Barak wrote:
> Extending core and vendor verb commands require us to check that the
> unknown part of the user's given command is all zeros.
> Adding ib_is_udata_cleared in order to do so.
> 
> Signed-off-by: Matan Barak 
Reviewed-by: Haggai Eran 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-16 Thread Liran Liss
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> ow...@vger.kernel.org] On Behalf Of Doug Ledford

> In particular, Liran piped up with this comment:
> 
> "Also, I don't want to do any route resolution on the Rx path. A UD QP
> completion just reports the details of the packet it received.
> 
> Conceptually, an incoming packet may not even match an SGID index at all.
> Maybe, responses should be sent from another device. This should not be
> decided that the point that a packet was received."
> 
> The part that bothers me about this is that this statement makes sense when
> just thinking about the spec, as you say.  However, once you consider
> namespaces, security implications make this statement spec compliant, but
> still unacceptable.  The spec itself is silent on namespaces.  But, you guys
> wanted, and you got, namespace support.
> Since that's beyond spec, and carries security requirements, I think it's 
> fair to
> say that from now on, the Linux kernel RDMA stack can no longer *just* be
> spec compliant.  There are additional concerns that must always be
> addressed with new changes, and those are the namespace constraint
> preservation concerns.
> 

Hi Doug,

Currently, there is no namespace support for RoCE, so the RoCEv2 patches have 
*nothing* to do with this.
That said, the RoCE specification does not contradict or inhibit any future 
implementation for namespaces.
The CMA will get the  from ib_wc and resolve to a netdev (or 
sgid_index->netdev, whatever) and process the request accordingly.

We can have endless theoretical discussions on features that are not even 
implemented yet (e.g., RoCE namespace support) each time we add a minor 
straightforward, *spec-compliant* change that *all* RoCE vendors adhere to.
If someone wishes to introduce a new concept, API refactoring proposal, or 
similar for community review, please do so with a different RFC.

This is hindering progress of the whole RDMA stack development!
For example, the posted SoftRoCE patches are waiting just for this.

The RoCEv2 patches have been posted upstream for review for months (!) now.
I simply cannot understand why this is lagging for so long; let's start to get 
the wheels rolling.
--Liran



Re: [PATCH 37/37] IB/rdmavt: Add support for new memory registration API

2015-12-16 Thread Dennis Dalessandro

On Wed, Dec 16, 2015 at 03:21:02PM +0200, Sagi Grimberg wrote:



This question is not directly related to this patch, but given that
this is a copy-paste from the qib driver I'll go ahead and take it
anyway. How does qib (and rvt now) do memory key invalidation? I didn't
see any reference to IB_WR_LOCAL_INV anywhere in the qib driver...

What am I missing?


ping?


In short, it doesn't look like qib or hfi1 support this.


Oh, I'm surprised to learn that. At least I see that
qib is not exposing IB_DEVICE_MEM_MGT_EXTENSIONS. But whats
the point in doing something with a IB_WR_REG_MR at all?



Given that this is not supported anyway, why does this patch
exist?


This patch exists to provide parity for what is in qib. Should we not have 
it?  If not, why do we have:


commit 38071a461f0a ("IB/qib: Support the new memory registration API")


That doesn't mean it can't be added to rdmavt as a future enhancement
though if there is a need.


Well, given that we're trying to consolidate on post send registration
interface it's kind of a must I'd say.


Are you asking because soft-roce will need it?


I was asking in general, but in specific soft-roce as a consumer will
need to support that yes.


I think it makes sense to revisit when soft-roce comes in, since qib/hfi do 
not need IB_WR_LOCAL_INV.


-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup

2015-12-16 Thread Or Gerlitz

On 12/16/2015 3:40 PM, Sagi Grimberg wrote:

I really don't have a strong preference on either of the approaches. I
just want to see this included one way or the other. 


sure, agree, I will send my patches tomorrow
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack

2015-12-16 Thread Sagi Grimberg



diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 4197191..e60d817 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -206,6 +206,8 @@ struct rpcrdma_frmr {
enum rpcrdma_frmr_state fr_state;
struct work_struct  fr_work;
struct rpcrdma_xprt *fr_xprt;
+   struct ib_reg_wrfr_regwr;
+   struct ib_send_wr   fr_invwr;


Would it make sense to unionize these as they are guaranteed not to
execute together? Some people don't like this sort of savings.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 06/11] xprtrdma: Add ro_unmap_sync method for FRWR

2015-12-16 Thread Sagi Grimberg



+static void
+__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg,
+int rc)
+{
+   struct ib_device *device = r_xprt->rx_ia.ri_device;
+   struct rpcrdma_mw *mw = seg->rl_mw;
+   int nsegs = seg->mr_nsegs;
+
+   seg->rl_mw = NULL;
+
+   while (nsegs--)
+   rpcrdma_unmap_one(device, seg++);


Chuck, shouldn't this be replaced with ib_dma_unmap_sg?

Sorry for the late comment (Didn't find enough time to properly
review this...)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 09/11] SUNRPC: Introduce xprt_commit_rqst()

2015-12-16 Thread Anna Schumaker
Hi Chuck,

Sorry for the last minute comment.

On 12/14/2015 04:19 PM, Chuck Lever wrote:
> I'm about to add code in the RPC/RDMA reply handler between the
> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs
> to execute outside of spinlock critical sections.
> 
> Add a hook to remove an rpc_rqst from the pending list once
> the transport knows its going to invoke xprt_complete_rqst().
> 
> Signed-off-by: Chuck Lever 
> ---
>  include/linux/sunrpc/xprt.h|1 +
>  net/sunrpc/xprt.c  |   14 ++
>  net/sunrpc/xprtrdma/rpc_rdma.c |4 
>  3 files changed, 19 insertions(+)
> 
> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
> index 69ef5b3..ab6c3a5 100644
> --- a/include/linux/sunrpc/xprt.h
> +++ b/include/linux/sunrpc/xprt.h
> @@ -366,6 +366,7 @@ void  
> xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action);
>  void xprt_write_space(struct rpc_xprt *xprt);
>  void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task 
> *task, int result);
>  struct rpc_rqst *xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid);
> +void xprt_commit_rqst(struct rpc_task *task);
>  void xprt_complete_rqst(struct rpc_task *task, int copied);
>  void xprt_release_rqst_cong(struct rpc_task *task);
>  void xprt_disconnect_done(struct rpc_xprt *xprt);
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index 2e98f4a..a5be4ab 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task)
>  }
>  
>  /**
> + * xprt_commit_rqst - remove rqst from pending list early
> + * @task: RPC request to remove
> + *
> + * Caller holds transport lock.
> + */
> +void xprt_commit_rqst(struct rpc_task *task)
> +{
> + struct rpc_rqst *req = task->tk_rqstp;
> +
> + list_del_init(&req->rq_list);
> +}
> +EXPORT_SYMBOL_GPL(xprt_commit_rqst);

Can you move this function into the xprtrdma code, since it's not called 
outside of there?

Thanks,
Anna

> +
> +/**
>   * xprt_complete_rqst - called when reply processing is complete
>   * @task: RPC request that recently completed
>   * @copied: actual number of bytes received from the transport
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index c10d969..0bc8c39 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -804,6 +804,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep)
>   if (req->rl_reply)
>   goto out_duplicate;
>  
> + xprt_commit_rqst(rqst->rq_task);
> + spin_unlock_bh(&xprt->transport_lock);
> +
>   dprintk("RPC:   %s: reply 0x%p completes request 0x%p\n"
>   "   RPC request 0x%p xid 0x%08x\n",
>   __func__, rep, req, rqst,
> @@ -894,6 +897,7 @@ badheader:
>   else if (credits > r_xprt->rx_buf.rb_max_requests)
>   credits = r_xprt->rx_buf.rb_max_requests;
>  
> + spin_lock_bh(&xprt->transport_lock);
>   cwnd = xprt->cwnd;
>   xprt->cwnd = credits << RPC_CWNDSHIFT;
>   if (xprt->cwnd > cwnd)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device attr cleanup

2015-12-16 Thread Sagi Grimberg




Hi Doug,

Lets stop beating, both horses and people.

I do understand that

1. you don't link the removal of the attr
2. you do like the removal of all the query calls

I am proposing to take the path of a patch that
does exactly #2 while avoiding #1.


I really don't have a strong preference on either of the approaches. I
just want to see this included one way or the other.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 37/37] IB/rdmavt: Add support for new memory registration API

2015-12-16 Thread Sagi Grimberg



This question is not directly related to this patch, but given that
this is a copy-paste from the qib driver I'll go ahead and take it
anyway. How does qib (and rvt now) do memory key invalidation? I didn't
see any reference to IB_WR_LOCAL_INV anywhere in the qib driver...

What am I missing?


ping?


In short, it doesn't look like qib or hfi1 support this.


Oh, I'm surprised to learn that. At least I see that
qib is not exposing IB_DEVICE_MEM_MGT_EXTENSIONS. But whats
the point in doing something with a IB_WR_REG_MR at all?

Given that this is not supported anyway, why does this patch
exist?


That doesn't mean it can't be added to rdmavt as a future enhancement
though if there is a need.


Well, given that we're trying to consolidate on post send registration
interface it's kind of a must I'd say.


Are you asking because soft-roce will need it?


I was asking in general, but in specific soft-roce as a consumer will
need to support that yes.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 00/11] NFS/RDMA client patches for 4.5

2015-12-16 Thread Devesh Sharma
Hi Chuck,

iozone passed on ocrdma device. Link bounce fails to recover iozone
traffic, however failure is not related to this patch series. I am in
processes of finding out the patch which broke it.

Tested-By: Devesh Sharma 

On Wed, Dec 16, 2015 at 1:07 AM, Anna Schumaker
 wrote:
> Thanks, Chuck!
>
> Everything looks okay to me, so I'll apply these patches and send them to 
> Trond before the holiday.
>
> On 12/14/2015 04:17 PM, Chuck Lever wrote:
>> For 4.5, I'd like to address the send queue accounting and
>> invalidation/unmap ordering issues Jason brought up a couple of
>> months ago.
>>
>> In preparation for Doug's final topic branch, Anna, I've rebased
>> these on Christoph's ib_device_attr branch, but there were no merge
>> conflicts or other changes needed. Could you begin preparing these
>> for linux-next and other final testing and review?
>
> No merge conflicts is nice, and we might not need to worry about ordering the 
> pull request.
>
> Thanks,
> Anna
>
>>
>> Also available in the "nfs-rdma-for-4.5" topic branch of this git repo:
>>
>> git://git.linux-nfs.org/projects/cel/cel-2.6.git
>>
>> Or for browsing:
>>
>> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5
>>
>>
>> Changes since v2:
>> - Rebased on Christoph's ib_device_attr branch
>>
>>
>> Changes since v1:
>>
>> - Rebased on v4.4-rc3
>> - Receive buffer safety margin patch dropped
>> - Backchannel pr_err and pr_info converted to dprintk
>> - Backchannel spin locks converted to work queue-safe locks
>> - Fixed premature release of backchannel request buffer
>> - NFSv4.1 callbacks tested with for-4.5 server
>>
>> ---
>>
>> Chuck Lever (11):
>>   xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
>>   xprtrdma: xprt_rdma_free() must not release backchannel reqs
>>   xprtrdma: Disable RPC/RDMA backchannel debugging messages
>>   xprtrdma: Move struct ib_send_wr off the stack
>>   xprtrdma: Introduce ro_unmap_sync method
>>   xprtrdma: Add ro_unmap_sync method for FRWR
>>   xprtrdma: Add ro_unmap_sync method for FMR
>>   xprtrdma: Add ro_unmap_sync method for all-physical registration
>>   SUNRPC: Introduce xprt_commit_rqst()
>>   xprtrdma: Invalidate in the RPC reply handler
>>   xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').
>>
>>
>>  include/linux/sunrpc/xprt.h|1
>>  net/sunrpc/xprt.c  |   14 +++
>>  net/sunrpc/xprtrdma/backchannel.c  |   22 ++---
>>  net/sunrpc/xprtrdma/fmr_ops.c  |   64 +
>>  net/sunrpc/xprtrdma/frwr_ops.c |  175 
>> +++-
>>  net/sunrpc/xprtrdma/physical_ops.c |   13 +++
>>  net/sunrpc/xprtrdma/rpc_rdma.c |   14 +++
>>  net/sunrpc/xprtrdma/transport.c|3 +
>>  net/sunrpc/xprtrdma/verbs.c|   13 +--
>>  net/sunrpc/xprtrdma/xprt_rdma.h|   12 +-
>>  10 files changed, 283 insertions(+), 48 deletions(-)
>>
>> --
>> Chuck Lever
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/11] NFS/RDMA server patches for v4.5

2015-12-16 Thread Devesh Sharma
iozone passed on ocrdma device. Link bounce fails to recover iozone
traffic, however failure is not related to this patch series. I am in
processes of finding out the patch which broke it.

Tested-By: Devesh Sharma 

On Tue, Dec 15, 2015 at 3:00 AM, Chuck Lever  wrote:
> Here are patches to support server-side bi-directional RPC/RDMA
> operation (to enable NFSv4.1 on RPC/RDMA transports). Thanks to
> all who reviewed v1, v2, and v3. This version has some significant
> changes since the previous one.
>
> In preparation for Doug's final topic branch, Bruce, I've rebased
> these on Christoph's ib_device_attr branch. There were some merge
> conflicts which I've fixed and tested. These are ready for your
> review.
>
> Also available in the "nfsd-rdma-for-4.5" topic branch of this git repo:
>
> git://git.linux-nfs.org/projects/cel/cel-2.6.git
>
> Or for browsing:
>
> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-for-4.5
>
>
> Changes since v3:
> - Rebased on Christoph's ib_device_attr branch
> - Backchannel patches have been squashed together
> - Memory allocation overhaul to prevent blocking allocation
>   when sending backchannel calls
>
>
> Changes since v2:
> - Rebased on v4.4-rc4
> - Backchannel code in new source file to address dprintk issues
> - svc_rdma_get_context() now uses a pre-allocated cache
> - Dropped svc_rdma_send clean up
>
>
> Changes since v1:
>
> - Rebased on v4.4-rc3
> - Removed the use of CONFIG_SUNRPC_BACKCHANNEL
> - Fixed computation of forward and backward max_requests
> - Updated some comments and patch descriptions
> - pr_err and pr_info converted to dprintk
> - Simplified svc_rdma_get_context()
> - Dropped patch removing access_flags field
> - NFSv4.1 callbacks tested with for-4.5 client
>
> ---
>
> Chuck Lever (11):
>   svcrdma: Do not send XDR roundup bytes for a write chunk
>   svcrdma: Clean up rdma_create_xprt()
>   svcrdma: Clean up process_context()
>   svcrdma: Improve allocation of struct svc_rdma_op_ctxt
>   svcrdma: Improve allocation of struct svc_rdma_req_map
>   svcrdma: Remove unused req_map and ctxt kmem_caches
>   svcrdma: Add gfp flags to svc_rdma_post_recv()
>   svcrdma: Remove last two __GFP_NOFAIL call sites
>   svcrdma: Make map_xdr non-static
>   svcrdma: Define maximum number of backchannel requests
>   svcrdma: Add class for RDMA backwards direction transport
>
>
>  include/linux/sunrpc/svc_rdma.h|   37 ++-
>  net/sunrpc/xprt.c  |1
>  net/sunrpc/xprtrdma/Makefile   |2
>  net/sunrpc/xprtrdma/svc_rdma.c |   41 ---
>  net/sunrpc/xprtrdma/svc_rdma_backchannel.c |  371 
> 
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c|   52 
>  net/sunrpc/xprtrdma/svc_rdma_sendto.c  |   34 ++-
>  net/sunrpc/xprtrdma/svc_rdma_transport.c   |  284 -
>  net/sunrpc/xprtrdma/transport.c|   30 +-
>  net/sunrpc/xprtrdma/xprt_rdma.h|   20 +-
>  10 files changed, 730 insertions(+), 142 deletions(-)
>  create mode 100644 net/sunrpc/xprtrdma/svc_rdma_backchannel.c
>
> --
> Signature
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-16 Thread Liran Liss
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-

> > Since you and Jason did not reach a consensus, I have to dig in and
> > see if these patches make it possible to break namespace confinement,
> > either accidentally or with intentionally tricky behavior.  That's
> > going to take me some time.
> 
> Everything to do with parsing a wc and constructing an AH is wrong in this
> series, and the fixes require the API changes I mentioned ( add 'wc to gid
> index' API call, add 'route to AH' API call)
> 
> Every time you read 'route validation' - that is an error, the route should
> never just be validated, it is needed information to construct a rocev2 AH. 
> All
> the places that roughly hand parse the rocev2 WC should not be open coded.
> 
> Even if current HW is broken for namespaces we should not enshrine that in
> the kapi.
>

Currently, namespaces are not supported for RoCE.
So for this patches, this is irrelevant.
That said, we have everything we need for RoCE namespace support when we get 
there.

All of this has nothing to do with "broken" and enshrining anything in the kapi.
That's just bullshit.

The crux of the discussion is the meaning of the API.
The design of the RDMA stack is that Verbs are used by core IB services, such 
as addressing.
For these services, as the specification requires, all relevant fields must be 
reported in the CQE, period.
All spec-compliant HW devices follow this.

If a ULP wants to create an address handle from a completion, there are service 
routines to accomplish that, based on the reported fields.
If it doesn't care, there is no reason to sacrifice performance.

--Liran

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path

2015-12-16 Thread Michael Wang

On 12/15/2015 06:30 PM, Jason Gunthorpe wrote:
> On Tue, Dec 15, 2015 at 05:38:34PM +0100, Michael Wang wrote:
>> The hop_limit is only suggest that the package allowed to be
>> routed, not have to, correct?
> 
> If the hop limit is >= 2 (?) then the GRH is mandatory. The
> SM will return this information in the PathRecord if it determines a
> GRH is required. The whole stack follows this protocol.
> 
> The GRH is optional for in-subnet communications.

Thanks for the explain :-)

I've rechecked the ib_init_ah_from_path() again, and found it
still set IB_AH_GRH when the GID cache missing, but with:

  grh.sgid_index = 0
  grh.flow_label = 0
  grh.hop_limit  = 0
  grh.traffic_class = 0

Not sure if it's just coincidence, hop_limit is 0, so router
will discard the pkg and GRH won't be used, the transaction in
subnet still works.

Could this by designed as an optimization for the case like when
SM reassigning the GID?

BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out
a check on return.

Regards,
Michael Wang

> 
> Jason
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html