[PATCH 2/2] IB/mlx4: Convert kmalloc to be kmalloc_array to fix checkpatch warnings
From: Leon Romanovsky Convert kmalloc to be kmalloc_array to fix warnings below: WARNING: Prefer kmalloc_array over kmalloc with multiply + qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(u64), WARNING: Prefer kmalloc_array over kmalloc with multiply + qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(u64), WARNING: Prefer kmalloc_array over kmalloc with multiply + srq->wrid = kmalloc(srq->msrq.max * sizeof(u64), Signed-off-by: Leon Romanovsky Reviewed-by: Or Gerlitz --- drivers/infiniband/hw/mlx4/qp.c | 4 ++-- drivers/infiniband/hw/mlx4/srq.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index dc86975fe1a9..70de13ed9da7 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -796,12 +796,12 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, if (err) goto err_mtt; - qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(u64), + qp->sq.wrid = kmalloc_array(qp->sq.wqe_cnt, sizeof(u64), gfp | __GFP_NOWARN); if (!qp->sq.wrid) qp->sq.wrid = __vmalloc(qp->sq.wqe_cnt * sizeof(u64), gfp, PAGE_KERNEL); - qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(u64), + qp->rq.wrid = kmalloc_array(qp->rq.wqe_cnt, sizeof(u64), gfp | __GFP_NOWARN); if (!qp->rq.wrid) qp->rq.wrid = __vmalloc(qp->rq.wqe_cnt * sizeof(u64), diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c index f416c7463827..68d5a5fda271 100644 --- a/drivers/infiniband/hw/mlx4/srq.c +++ b/drivers/infiniband/hw/mlx4/srq.c @@ -171,7 +171,7 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd, if (err) goto err_mtt; - srq->wrid = kmalloc(srq->msrq.max * sizeof(u64), + srq->wrid = kmalloc_array(srq->msrq.max, sizeof(u64), GFP_KERNEL | __GFP_NOWARN); if (!srq->wrid) { srq->wrid = __vmalloc(srq->msrq.max * sizeof(u64), -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] IB/mlx4: Suppress memory allocations warnings in kmalloc->__vmalloc flows
From: Leon Romanovsky Failure in kmalloc memory allocations will throw a warning about it. Such warnings are not needed anymore, since in commit 0ef2f05c7e02 ("IB/mlx4: Use vmalloc for WR buffers when needed"), fallback mechanism from kmalloc() to __vmalloc() was added. Signed-off-by: Leon Romanovsky Reviewed-by: Or Gerlitz --- drivers/infiniband/hw/mlx4/qp.c | 6 -- drivers/infiniband/hw/mlx4/srq.c | 3 ++- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 13eaaf45288f..dc86975fe1a9 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -796,11 +796,13 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, if (err) goto err_mtt; - qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(u64), gfp); + qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(u64), + gfp | __GFP_NOWARN); if (!qp->sq.wrid) qp->sq.wrid = __vmalloc(qp->sq.wqe_cnt * sizeof(u64), gfp, PAGE_KERNEL); - qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(u64), gfp); + qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(u64), + gfp | __GFP_NOWARN); if (!qp->rq.wrid) qp->rq.wrid = __vmalloc(qp->rq.wqe_cnt * sizeof(u64), gfp, PAGE_KERNEL); diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c index 8d133c40fa0e..f416c7463827 100644 --- a/drivers/infiniband/hw/mlx4/srq.c +++ b/drivers/infiniband/hw/mlx4/srq.c @@ -171,7 +171,8 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd, if (err) goto err_mtt; - srq->wrid = kmalloc(srq->msrq.max * sizeof (u64), GFP_KERNEL); + srq->wrid = kmalloc(srq->msrq.max * sizeof(u64), + GFP_KERNEL | __GFP_NOWARN); if (!srq->wrid) { srq->wrid = __vmalloc(srq->msrq.max * sizeof(u64), GFP_KERNEL, PAGE_KERNEL); -- 1.7.12.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/14] staging/rdma/hfi1: Enable TID caching feature
From: Mitko Haralanov This commit "flips the switch" on the TID caching feature implemented in this patch series. As well as enabling the new feature by tying the new function with the PSM API, it also cleans up the old unneeded code, data structure members, and variables. Due to difference in operation and information, the tracing functions related to expected receives had to be changed. This patch include these changes. The tracing function changes could not be split into a separate commit without including both tracing variants at the same time. This would have caused other complications and ugliness. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- drivers/staging/rdma/hfi1/file_ops.c | 448 +++ drivers/staging/rdma/hfi1/hfi.h | 14 - drivers/staging/rdma/hfi1/init.c | 3 - drivers/staging/rdma/hfi1/trace.h| 132 + drivers/staging/rdma/hfi1/user_exp_rcv.c | 12 + drivers/staging/rdma/hfi1/user_pages.c | 14 - include/uapi/rdma/hfi/hfi1_user.h| 7 +- 7 files changed, 132 insertions(+), 498 deletions(-) diff --git a/drivers/staging/rdma/hfi1/file_ops.c b/drivers/staging/rdma/hfi1/file_ops.c index b0348263b901..d36588934f99 100644 --- a/drivers/staging/rdma/hfi1/file_ops.c +++ b/drivers/staging/rdma/hfi1/file_ops.c @@ -96,9 +96,6 @@ static int user_event_ack(struct hfi1_ctxtdata *, int, unsigned long); static int set_ctxt_pkey(struct hfi1_ctxtdata *, unsigned, u16); static int manage_rcvq(struct hfi1_ctxtdata *, unsigned, int); static int vma_fault(struct vm_area_struct *, struct vm_fault *); -static int exp_tid_setup(struct file *, struct hfi1_tid_info *); -static int exp_tid_free(struct file *, struct hfi1_tid_info *); -static void unlock_exp_tids(struct hfi1_ctxtdata *); static const struct file_operations hfi1_file_ops = { .owner = THIS_MODULE, @@ -188,6 +185,7 @@ static ssize_t hfi1_file_write(struct file *fp, const char __user *data, struct hfi1_cmd cmd; struct hfi1_user_info uinfo; struct hfi1_tid_info tinfo; + unsigned long addr; ssize_t consumed = 0, copy = 0, ret = 0; void *dest = NULL; __u64 user_val = 0; @@ -219,6 +217,7 @@ static ssize_t hfi1_file_write(struct file *fp, const char __user *data, break; case HFI1_CMD_TID_UPDATE: case HFI1_CMD_TID_FREE: + case HFI1_CMD_TID_INVAL_READ: copy = sizeof(tinfo); dest = &tinfo; break; @@ -241,7 +240,6 @@ static ssize_t hfi1_file_write(struct file *fp, const char __user *data, must_be_root = 1; /* validate user */ copy = 0; break; - case HFI1_CMD_TID_INVAL_READ: default: ret = -EINVAL; goto bail; @@ -295,9 +293,8 @@ static ssize_t hfi1_file_write(struct file *fp, const char __user *data, sc_return_credits(uctxt->sc); break; case HFI1_CMD_TID_UPDATE: - ret = exp_tid_setup(fp, &tinfo); + ret = hfi1_user_exp_rcv_setup(fp, &tinfo); if (!ret) { - unsigned long addr; /* * Copy the number of tidlist entries we used * and the length of the buffer we registered. @@ -312,8 +309,25 @@ static ssize_t hfi1_file_write(struct file *fp, const char __user *data, ret = -EFAULT; } break; + case HFI1_CMD_TID_INVAL_READ: + ret = hfi1_user_exp_rcv_invalid(fp, &tinfo); + if (ret) + break; + addr = (unsigned long)cmd.addr + + offsetof(struct hfi1_tid_info, tidcnt); + if (copy_to_user((void __user *)addr, &tinfo.tidcnt, +sizeof(tinfo.tidcnt))) + ret = -EFAULT; + break; case HFI1_CMD_TID_FREE: - ret = exp_tid_free(fp, &tinfo); + ret = hfi1_user_exp_rcv_clear(fp, &tinfo); + if (ret) + break; + addr = (unsigned long)cmd.addr + + offsetof(struct hfi1_tid_info, tidcnt); + if (copy_to_user((void __user *)addr, &tinfo.tidcnt, +sizeof(tinfo.tidcnt))) + ret = -EFAULT; break; case HFI1_CMD_RECV_CTRL: ret = manage_rcvq(uctxt, fd->subctxt, (int)user_val); @@ -779,12 +793,9 @@ static int hfi1_file_close(struct inode *inode, struct file *fp) uctxt->pionowait = 0; uctxt->event_flags = 0; - hfi1_clear_tids(uctxt); + hfi1_user_exp_rcv_free(fdata); hfi1_clear_ctxt_pkey(dd, uctxt->ctxt); - if (uctxt->tid_pg_list) - unlock_exp_tids(uctxt); - hfi
[PATCH 13/14] staging/rdma/hfi1: Add TID entry program function body
From: Mitko Haralanov The previous patch in the series added the free/invalidate function bodies. Now, it's time for the programming side. This large function takes the user's buffer, breaks it up into manageable chunks, allocates enough RcvArray groups and programs the chunks into the RcvArray entries in the hardware. With this function, the TID caching functionality is implemented. However, it is still unused. The switch will come in a later patch in the series, which will remove the old functionality and switch the driver over to TID caching. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- drivers/staging/rdma/hfi1/user_exp_rcv.c | 263 ++- 1 file changed, 259 insertions(+), 4 deletions(-) diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c b/drivers/staging/rdma/hfi1/user_exp_rcv.c index 91950d225da5..6d21c1349b77 100644 --- a/drivers/staging/rdma/hfi1/user_exp_rcv.c +++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c @@ -97,8 +97,7 @@ struct tid_pageset { static void unlock_exp_tids(struct hfi1_ctxtdata *, struct exp_tid_set *, struct rb_root *); -static u32 find_phys_blocks(struct page **, unsigned, - struct tid_pageset *) __maybe_unused; +static u32 find_phys_blocks(struct page **, unsigned, struct tid_pageset *); static int set_rcvarray_entry(struct file *, unsigned long, u32, struct tid_group *, struct page **, unsigned); static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long, @@ -119,7 +118,7 @@ static inline void mmu_notifier_range_start(struct mmu_notifier *, unsigned long, unsigned long); static int program_rcvarray(struct file *, unsigned long, struct tid_group *, struct tid_pageset *, unsigned, u16, struct page **, - u32 *, unsigned *, unsigned *) __maybe_unused; + u32 *, unsigned *, unsigned *); static int unprogram_rcvarray(struct file *, u32, struct tid_group **); static void clear_tid_node(struct hfi1_filedata *, u16, struct mmu_rb_node *); @@ -339,9 +338,265 @@ static inline void rcv_array_wc_fill(struct hfi1_devdata *dd, u32 index) writeq(0, dd->rcvarray_wc + (index * 8)); } +/* + * RcvArray entry allocation for Expected Receives is done by the + * following algorithm: + * + * The context keeps 3 lists of groups of RcvArray entries: + * 1. List of empty groups - tid_group_list + * This list is created during user context creation and + * contains elements which describe sets (of 8) of empty + * RcvArray entries. + * 2. List of partially used groups - tid_used_list + * This list contains sets of RcvArray entries which are + * not completely used up. Another mapping request could + * use some of all of the remaining entries. + * 3. List of full groups - tid_full_list + * This is the list where sets that are completely used + * up go. + * + * An attempt to optimize the usage of RcvArray entries is + * made by finding all sets of physically contiguous pages in a + * user's buffer. + * These physically contiguous sets are further split into + * sizes supported by the receive engine of the HFI. The + * resulting sets of pages are stored in struct tid_pageset, + * which describes the sets as: + ** .count - number of pages in this set + ** .idx - starting index into struct page ** array + *of this set + * + * From this point on, the algorithm deals with the page sets + * described above. The number of pagesets is divided by the + * RcvArray group size to produce the number of full groups + * needed. + * + * Groups from the 3 lists are manipulated using the following + * rules: + * 1. For each set of 8 pagesets, a complete group from + * tid_group_list is taken, programmed, and moved to + * the tid_full_list list. + * 2. For all remaining pagesets: + * 2.1 If the tid_used_list is empty and the tid_group_list + * is empty, stop processing pageset and return only + * what has been programmed up to this point. + * 2.2 If the tid_used_list is empty and the tid_group_list + * is not empty, move a group from tid_group_list to + * tid_used_list. + * 2.3 For each group is tid_used_group, program as much as + * can fit into the group. If the group becomes fully + * used, move it to tid_full_list. + */ int hfi1_user_exp_rcv_setup(struct file *fp, struct hfi1_tid_info *tinfo) { - return -EINVAL; + int ret = 0, need_group = 0, pinned; + struct hfi1_filedata *fd = fp->private_data; + struct hfi1_ctxtdata *uctxt = fd->uctxt; + struct hfi1_devdata *dd = uctxt->dd; + unsigned npages, ngroups, pageidx = 0, pageset_count, npagesets, + tididx = 0, mapped, mapped_pages = 0; + unsigned long vaddr = tinfo->vaddr; +
[PATCH 11/14] staging/rdma/hfi1: Add MMU notifier callback function
From: Mitko Haralanov TID caching will rely on the MMU notifier to be told when memory is being invalidated. When the callback is called, the driver will find all RcvArray entries that span the invalidated buffer and "schedule" them to be freed by the PSM library. This function is currently unused and is being added in preparation for the TID caching feature. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- drivers/staging/rdma/hfi1/user_exp_rcv.c | 67 +++- 1 file changed, 65 insertions(+), 2 deletions(-) diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c b/drivers/staging/rdma/hfi1/user_exp_rcv.c index b303182be08a..7996ce763adf 100644 --- a/drivers/staging/rdma/hfi1/user_exp_rcv.c +++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c @@ -104,7 +104,7 @@ static int set_rcvarray_entry(struct file *, unsigned long, u32, static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long, unsigned long); static struct mmu_rb_node *mmu_rb_search_by_addr(struct rb_root *, -unsigned long) __maybe_unused; +unsigned long); static inline struct mmu_rb_node *mmu_rb_search_by_entry(struct rb_root *, u32); static int mmu_rb_insert_by_addr(struct rb_root *, struct mmu_rb_node *); @@ -656,7 +656,70 @@ static void mmu_notifier_mem_invalidate(struct mmu_notifier *mn, unsigned long start, unsigned long end, enum mmu_call_types type) { - /* Stub for now */ + struct hfi1_filedata *fd = container_of(mn, struct hfi1_filedata, mn); + struct hfi1_ctxtdata *uctxt = fd->uctxt; + struct rb_root *root = &fd->tid_rb_root; + struct mmu_rb_node *node; + unsigned long addr = start; + + spin_lock(&fd->rb_lock); + while (addr < end) { + node = mmu_rb_search_by_addr(root, addr); + + if (!node) { + /* +* Didn't find a node at this address. However, the +* range could be bigger than what we have registered +* so we have to keep looking. +*/ + addr += PAGE_SIZE; + continue; + } + + /* +* The next address to be looked up is computed based +* on the node's starting address. This is due to the +* fact that the range where we start might be in the +* middle of the node's buffer so simply incrementing +* the address by the node's size would result is a +* bad address. +*/ + addr = node->virt + (node->npages * PAGE_SIZE); + if (node->freed) + continue; + + node->freed = true; + + spin_lock(&fd->invalid_lock); + if (fd->invalid_tid_idx < uctxt->expected_count) { + fd->invalid_tids[fd->invalid_tid_idx] = + rcventry2tidinfo(node->rcventry - +uctxt->expected_base); + fd->invalid_tids[fd->invalid_tid_idx] |= + EXP_TID_SET(LEN, node->npages); + if (!fd->invalid_tid_idx) { + unsigned long *ev; + + /* +* hfi1_set_uevent_bits() sets a user event flag +* for all processes. Because calling into the +* driver to process TID cache invalidations is +* expensive and TID cache invalidations are +* handled on a per-process basis, we can +* optimize this to set the flag only for the +* process in question. +*/ + ev = uctxt->dd->events + + (((uctxt->ctxt - + uctxt->dd->first_user_ctxt) * + HFI1_MAX_SHARED_CTXTS) + fd->subctxt); + set_bit(_HFI1_EVENT_TID_MMU_NOTIFY_BIT, ev); + } + fd->invalid_tid_idx++; + } + spin_unlock(&fd->invalid_lock); + } + spin_unlock(&fd->rb_lock); } static inline int mmu_addr_cmp(struct mmu_rb_node *node, unsigned long addr, -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/14] staging/rdma/hfi1: Add Expected receive init and free functions
From: Mitko Haralanov The upcoming TID caching feature requires different data structures and, by extension, different initialization for each of the MPI processes. The two new functions (currently unused) perform the required initialization and freeing of required resources and structures. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- drivers/staging/rdma/hfi1/user_exp_rcv.c | 154 +-- 1 file changed, 144 insertions(+), 10 deletions(-) diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c b/drivers/staging/rdma/hfi1/user_exp_rcv.c index 0eb888fcaf70..b303182be08a 100644 --- a/drivers/staging/rdma/hfi1/user_exp_rcv.c +++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c @@ -90,23 +90,25 @@ struct tid_pageset { #define EXP_TID_SET_EMPTY(set) (set.count == 0 && list_empty(&set.list)) +#define num_user_pages(vaddr, len)\ + (1 + (unsigned long)(vaddr) + \ +(unsigned long)(len) - 1) & PAGE_MASK) - \ + ((unsigned long)vaddr & PAGE_MASK)) >> PAGE_SHIFT)) + static void unlock_exp_tids(struct hfi1_ctxtdata *, struct exp_tid_set *, - struct rb_root *) __maybe_unused; + struct rb_root *); static u32 find_phys_blocks(struct page **, unsigned, struct tid_pageset *) __maybe_unused; static int set_rcvarray_entry(struct file *, unsigned long, u32, - struct tid_group *, struct page **, - unsigned) __maybe_unused; + struct tid_group *, struct page **, unsigned); static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long, unsigned long); static struct mmu_rb_node *mmu_rb_search_by_addr(struct rb_root *, unsigned long) __maybe_unused; static inline struct mmu_rb_node *mmu_rb_search_by_entry(struct rb_root *, u32); -static int mmu_rb_insert_by_addr(struct rb_root *, -struct mmu_rb_node *) __maybe_unused; -static int mmu_rb_insert_by_entry(struct rb_root *, - struct mmu_rb_node *) __maybe_unused; +static int mmu_rb_insert_by_addr(struct rb_root *, struct mmu_rb_node *); +static int mmu_rb_insert_by_entry(struct rb_root *, struct mmu_rb_node *); static void mmu_notifier_mem_invalidate(struct mmu_notifier *, unsigned long, unsigned long, enum mmu_call_types); @@ -168,7 +170,7 @@ static inline void tid_group_move(struct tid_group *group, tid_group_add_tail(group, s2); } -static struct mmu_notifier_ops __maybe_unused mn_opts = { +static struct mmu_notifier_ops mn_opts = { .invalidate_page = mmu_notifier_page, .invalidate_range_start = mmu_notifier_range_start, }; @@ -180,12 +182,144 @@ static struct mmu_notifier_ops __maybe_unused mn_opts = { */ int hfi1_user_exp_rcv_init(struct file *fp) { - return -EINVAL; + struct hfi1_filedata *fd = fp->private_data; + struct hfi1_ctxtdata *uctxt = fd->uctxt; + struct hfi1_devdata *dd = uctxt->dd; + unsigned tidbase; + int i, ret = 0; + + INIT_HLIST_NODE(&fd->mn.hlist); + spin_lock_init(&fd->rb_lock); + spin_lock_init(&fd->tid_lock); + spin_lock_init(&fd->invalid_lock); + fd->mn.ops = &mn_opts; + fd->tid_rb_root = RB_ROOT; + + if (!uctxt->subctxt_cnt || !fd->subctxt) { + exp_tid_group_init(&uctxt->tid_group_list); + exp_tid_group_init(&uctxt->tid_used_list); + exp_tid_group_init(&uctxt->tid_full_list); + + tidbase = uctxt->expected_base; + for (i = 0; i < uctxt->expected_count / +dd->rcv_entries.group_size; i++) { + struct tid_group *grp; + + grp = kzalloc(sizeof(*grp), GFP_KERNEL); + if (!grp) { + /* +* If we fail here, the groups already +* allocated will be freed by the close +* call. +*/ + ret = -ENOMEM; + goto done; + } + grp->size = dd->rcv_entries.group_size; + grp->base = tidbase; + tid_group_add_tail(grp, &uctxt->tid_group_list); + tidbase += dd->rcv_entries.group_size; + } + } + + if (!HFI1_CAP_IS_USET(TID_UNMAP)) { + fd->invalid_tid_idx = 0; + fd->invalid_tids = kzalloc(uctxt->expected_count * +
[PATCH 12/14] staging/rdma/hfi1: Add TID free/clear function bodies
From: Mitko Haralanov Up to now, the functions which cleared the programmed TID entries and gave PSM the list of invalidated TID entries were just stubs. With this commit, the bodies of these functions are added. This commit is a bit asymmetric as it only contains the free code path. This is done on purpose to help with patch reviews as the programming code path is much longer. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- drivers/staging/rdma/hfi1/user_exp_rcv.c | 91 +--- 1 file changed, 85 insertions(+), 6 deletions(-) diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c b/drivers/staging/rdma/hfi1/user_exp_rcv.c index 7996ce763adf..91950d225da5 100644 --- a/drivers/staging/rdma/hfi1/user_exp_rcv.c +++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c @@ -120,10 +120,8 @@ static inline void mmu_notifier_range_start(struct mmu_notifier *, static int program_rcvarray(struct file *, unsigned long, struct tid_group *, struct tid_pageset *, unsigned, u16, struct page **, u32 *, unsigned *, unsigned *) __maybe_unused; -static int unprogram_rcvarray(struct file *, u32, - struct tid_group **) __maybe_unused; -static void clear_tid_node(struct hfi1_filedata *, u16, - struct mmu_rb_node *) __maybe_unused; +static int unprogram_rcvarray(struct file *, u32, struct tid_group **); +static void clear_tid_node(struct hfi1_filedata *, u16, struct mmu_rb_node *); static inline u32 rcventry2tidinfo(u32 rcventry) { @@ -264,6 +262,7 @@ int hfi1_user_exp_rcv_init(struct file *fp) * Make sure that we set the tid counts only after successful * init. */ + spin_lock(&fd->tid_lock); if (uctxt->subctxt_cnt && !HFI1_CAP_IS_USET(TID_UNMAP)) { u16 remainder; @@ -274,6 +273,7 @@ int hfi1_user_exp_rcv_init(struct file *fp) } else { fd->tid_limit = uctxt->expected_count; } + spin_unlock(&fd->tid_lock); done: return ret; } @@ -346,12 +346,91 @@ int hfi1_user_exp_rcv_setup(struct file *fp, struct hfi1_tid_info *tinfo) int hfi1_user_exp_rcv_clear(struct file *fp, struct hfi1_tid_info *tinfo) { - return -EINVAL; + int ret = 0; + struct hfi1_filedata *fd = fp->private_data; + struct hfi1_ctxtdata *uctxt = fd->uctxt; + u32 *tidinfo; + unsigned tididx; + + tidinfo = kcalloc(tinfo->tidcnt, sizeof(*tidinfo), GFP_KERNEL); + if (!tidinfo) + return -ENOMEM; + + if (copy_from_user(tidinfo, (void __user *)(unsigned long) + tinfo->tidlist, sizeof(tidinfo[0]) * + tinfo->tidcnt)) { + ret = -EFAULT; + goto done; + } + + mutex_lock(&uctxt->exp_lock); + for (tididx = 0; tididx < tinfo->tidcnt; tididx++) { + ret = unprogram_rcvarray(fp, tidinfo[tididx], NULL); + if (ret) { + hfi1_cdbg(TID, "Failed to unprogram rcv array %d", + ret); + break; + } + } + spin_lock(&fd->tid_lock); + fd->tid_used -= tididx; + spin_unlock(&fd->tid_lock); + tinfo->tidcnt = tididx; + mutex_unlock(&uctxt->exp_lock); +done: + kfree(tidinfo); + return ret; } int hfi1_user_exp_rcv_invalid(struct file *fp, struct hfi1_tid_info *tinfo) { - return -EINVAL; + struct hfi1_filedata *fd = fp->private_data; + struct hfi1_ctxtdata *uctxt = fd->uctxt; + unsigned long *ev = uctxt->dd->events + + (((uctxt->ctxt - uctxt->dd->first_user_ctxt) * + HFI1_MAX_SHARED_CTXTS) + fd->subctxt); + u32 *array; + int ret = 0; + + if (!fd->invalid_tids) + return -EINVAL; + + /* +* copy_to_user() can sleep, which will leave the invalid_lock +* locked and cause the MMU notifier to be blocked on the lock +* for a long time. +* Copy the data to a local buffer so we can release the lock. +*/ + array = kcalloc(uctxt->expected_count, sizeof(*array), GFP_KERNEL); + if (!array) + return -EFAULT; + + spin_lock(&fd->invalid_lock); + if (fd->invalid_tid_idx) { + memcpy(array, fd->invalid_tids, sizeof(*array) * + fd->invalid_tid_idx); + memset(fd->invalid_tids, 0, sizeof(*fd->invalid_tids) * + fd->invalid_tid_idx); + tinfo->tidcnt = fd->invalid_tid_idx; + fd->invalid_tid_idx = 0; + /* +* Reset the user flag while still holding the lock. +* Otherwise, PSM can miss events. +*/ + clear_bit(_HFI1_EVENT_TID_MMU_NOTIFY_BIT, ev); + } else { + tinfo->tidcnt = 0; + } + spi
[PATCH 02/14] uapi/rdma/hfi/hfi1_user.h: Correct comment for capability bit
From: Mitko Haralanov The HFI1_CAP_TID_UNMAP comment was incorrectly implying the opposite of what capability actually did. Correct this error. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- include/uapi/rdma/hfi/hfi1_user.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/uapi/rdma/hfi/hfi1_user.h b/include/uapi/rdma/hfi/hfi1_user.h index 288694e422fb..cf172718e3d5 100644 --- a/include/uapi/rdma/hfi/hfi1_user.h +++ b/include/uapi/rdma/hfi/hfi1_user.h @@ -93,7 +93,7 @@ #define HFI1_CAP_MULTI_PKT_EGR(1UL << 7) /* Enable multi-packet Egr buffs*/ #define HFI1_CAP_NODROP_RHQ_FULL (1UL << 8) /* Don't drop on Hdr Q full */ #define HFI1_CAP_NODROP_EGR_FULL (1UL << 9) /* Don't drop on EGR buffs full */ -#define HFI1_CAP_TID_UNMAP(1UL << 10) /* Enable Expected TID caching */ +#define HFI1_CAP_TID_UNMAP(1UL << 10) /* Disable Expected TID caching */ #define HFI1_CAP_PRINT_UNIMPL (1UL << 11) /* Show for unimplemented feats */ #define HFI1_CAP_ALLOW_PERM_JKEY (1UL << 12) /* Allow use of permissive JKEY */ #define HFI1_CAP_NO_INTEGRITY (1UL << 13) /* Enable ctxt integrity checks */ -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/14] staging/rdma/hfi1: Remove un-needed variable
From: Mitko Haralanov There is no need to use a separate variable for a return value and a label when returning right away would do just as well. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- drivers/staging/rdma/hfi1/file_ops.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/drivers/staging/rdma/hfi1/file_ops.c b/drivers/staging/rdma/hfi1/file_ops.c index c66693532be0..76fe60315bb4 100644 --- a/drivers/staging/rdma/hfi1/file_ops.c +++ b/drivers/staging/rdma/hfi1/file_ops.c @@ -1037,22 +1037,19 @@ static int allocate_ctxt(struct file *fp, struct hfi1_devdata *dd, static int init_subctxts(struct hfi1_ctxtdata *uctxt, const struct hfi1_user_info *uinfo) { - int ret = 0; unsigned num_subctxts; num_subctxts = uinfo->subctxt_cnt; - if (num_subctxts > HFI1_MAX_SHARED_CTXTS) { - ret = -EINVAL; - goto bail; - } + if (num_subctxts > HFI1_MAX_SHARED_CTXTS) + return -EINVAL; uctxt->subctxt_cnt = uinfo->subctxt_cnt; uctxt->subctxt_id = uinfo->subctxt_id; uctxt->active_slaves = 1; uctxt->redirect_seq_cnt = 1; set_bit(HFI1_CTXT_MASTER_UNINIT, &uctxt->event_flags); -bail: - return ret; + + return 0; } static int setup_subctxt(struct hfi1_ctxtdata *uctxt) -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/14] staging/rdma/hfi1: Add definitions needed for TID caching support
From: Mitko Haralanov In preparation for adding the TID caching support, there is a set of headers, structures, and variables which will be needed. This commit adds them to the hfi.h header file. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- drivers/staging/rdma/hfi1/hfi.h | 20 1 file changed, 20 insertions(+) diff --git a/drivers/staging/rdma/hfi1/hfi.h b/drivers/staging/rdma/hfi1/hfi.h index 057a41cee734..996dd520cf41 100644 --- a/drivers/staging/rdma/hfi1/hfi.h +++ b/drivers/staging/rdma/hfi1/hfi.h @@ -179,6 +179,11 @@ struct ctxt_eager_bufs { } *rcvtids; }; +struct exp_tid_set { + struct list_head list; + u32 count; +}; + struct hfi1_ctxtdata { /* shadow the ctxt's RcvCtrl register */ u64 rcvctrl; @@ -247,6 +252,11 @@ struct hfi1_ctxtdata { struct page **tid_pg_list; /* dma handles for exp tid pages */ dma_addr_t *physshadow; + + struct exp_tid_set tid_group_list; + struct exp_tid_set tid_used_list; + struct exp_tid_set tid_full_list; + /* lock protecting all Expected TID data */ spinlock_t exp_lock; /* number of pio bufs for this ctxt (all procs, if shared) */ @@ -1138,6 +1148,16 @@ struct hfi1_filedata { struct hfi1_user_sdma_pkt_q *pq; /* for cpu affinity; -1 if none */ int rec_cpu_num; + struct mmu_notifier mn; + struct rb_root tid_rb_root; + spinlock_t tid_lock; /* protect tid_[limit,used] counters */ + u32 tid_limit; + u32 tid_used; + spinlock_t rb_lock; /* protect tid_rb_root RB tree */ + u32 *invalid_tids; + u32 invalid_tid_idx; + spinlock_t invalid_lock; /* protect the invalid_tids array */ + int (*mmu_rb_insert)(struct rb_root *, struct mmu_rb_node *); }; extern struct list_head hfi1_dev_list; -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/14] staging/rdma/hfi1: Add definitions and support functions for TID groups
From: Mitko Haralanov Definitions and functions use to manage sets of TID/RcvArray groups. These will be used by the TID cacheline functionality coming with later patches. TID groups (or RcvArray groups) are groups of TID/RcvArray entries organized in sets of 8 and aligned on cacheline boundaries. The TID/RcvArray entries are managed in this way to make taking advantage of write-combining easier - each group is a entire cacheline. rcv_array_wc_fill() is provided to allow of generating writes to TIDs which are not currently being used in order to cause the flush of the write-combining buffer. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- drivers/staging/rdma/hfi1/user_exp_rcv.c | 64 1 file changed, 64 insertions(+) diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c b/drivers/staging/rdma/hfi1/user_exp_rcv.c index bafeddf67c8f..7f15024daab9 100644 --- a/drivers/staging/rdma/hfi1/user_exp_rcv.c +++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c @@ -52,6 +52,14 @@ #include "user_exp_rcv.h" #include "trace.h" +struct tid_group { + struct list_head list; + unsigned base; + u8 size; + u8 used; + u8 map; +}; + struct mmu_rb_node { struct rb_node rbnode; unsigned long virt; @@ -75,6 +83,8 @@ static const char * const mmu_types[] = { "RANGE" }; +#define EXP_TID_SET_EMPTY(set) (set.count == 0 && list_empty(&set.list)) + static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long, unsigned long); static struct mmu_rb_node *mmu_rb_search_by_addr(struct rb_root *, @@ -94,6 +104,43 @@ static inline void mmu_notifier_range_start(struct mmu_notifier *, struct mm_struct *, unsigned long, unsigned long); +static inline void exp_tid_group_init(struct exp_tid_set *set) +{ + INIT_LIST_HEAD(&set->list); + set->count = 0; +} + +static inline void tid_group_remove(struct tid_group *grp, + struct exp_tid_set *set) +{ + list_del_init(&grp->list); + set->count--; +} + +static inline void tid_group_add_tail(struct tid_group *grp, + struct exp_tid_set *set) +{ + list_add_tail(&grp->list, &set->list); + set->count++; +} + +static inline struct tid_group *tid_group_pop(struct exp_tid_set *set) +{ + struct tid_group *grp = + list_first_entry(&set->list, struct tid_group, list); + list_del_init(&grp->list); + set->count--; + return grp; +} + +static inline void tid_group_move(struct tid_group *group, + struct exp_tid_set *s1, + struct exp_tid_set *s2) +{ + tid_group_remove(group, s1); + tid_group_add_tail(group, s2); +} + static struct mmu_notifier_ops __maybe_unused mn_opts = { .invalidate_page = mmu_notifier_page, .invalidate_range_start = mmu_notifier_range_start, @@ -114,6 +161,23 @@ int hfi1_user_exp_rcv_free(struct hfi1_filedata *fd) return -EINVAL; } +/* + * Write an "empty" RcvArray entry. + * This function exists so the TID registaration code can use it + * to write to unused/unneeded entries and still take advantage + * of the WC performance improvements. The HFI will ignore this + * write to the RcvArray entry. + */ +static inline void rcv_array_wc_fill(struct hfi1_devdata *dd, u32 index) +{ + /* +* Doing the WC fill writes only makes sense if the device is +* present and the RcvArray has been mapped as WC memory. +*/ + if ((dd->flags & HFI1_PRESENT) && dd->rcvarray_wc) + writeq(0, dd->rcvarray_wc + (index * 8)); +} + int hfi1_user_exp_rcv_setup(struct file *fp, struct hfi1_tid_info *tinfo) { return -EINVAL; -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/14] staging/rdma/hfi1: Convert lock to mutex
From: Mitko Haralanov The exp_lock lock does not need to be a spinlock as all its uses are in process context and allowing the process to sleep when the mutex is contended might be benefitial. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- drivers/staging/rdma/hfi1/file_ops.c | 12 ++-- drivers/staging/rdma/hfi1/hfi.h | 2 +- drivers/staging/rdma/hfi1/init.c | 2 +- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/staging/rdma/hfi1/file_ops.c b/drivers/staging/rdma/hfi1/file_ops.c index 76fe60315bb4..b0348263b901 100644 --- a/drivers/staging/rdma/hfi1/file_ops.c +++ b/drivers/staging/rdma/hfi1/file_ops.c @@ -1611,14 +1611,14 @@ static int exp_tid_setup(struct file *fp, struct hfi1_tid_info *tinfo) * reserved, we don't need the lock anymore since we * are guaranteed the groups. */ - spin_lock(&uctxt->exp_lock); + mutex_lock(&uctxt->exp_lock); if (uctxt->tidusemap[useidx] == -1ULL || bitidx >= BITS_PER_LONG) { /* no free groups in the set, use the next */ useidx = (useidx + 1) % uctxt->tidmapcnt; idx++; bitidx = 0; - spin_unlock(&uctxt->exp_lock); + mutex_unlock(&uctxt->exp_lock); continue; } ngroups = ((npages - mapped) / dd->rcv_entries.group_size) + @@ -1635,13 +1635,13 @@ static int exp_tid_setup(struct file *fp, struct hfi1_tid_info *tinfo) * as 0 because we don't check the entire bitmap but * we start from bitidx. */ - spin_unlock(&uctxt->exp_lock); + mutex_unlock(&uctxt->exp_lock); continue; } bits_used = min(free, ngroups); tidmap[useidx] |= ((1ULL << bits_used) - 1) << bitidx; uctxt->tidusemap[useidx] |= tidmap[useidx]; - spin_unlock(&uctxt->exp_lock); + mutex_unlock(&uctxt->exp_lock); /* * At this point, we know where in the map we have free bits. @@ -1677,10 +1677,10 @@ static int exp_tid_setup(struct file *fp, struct hfi1_tid_info *tinfo) * Let go of the bits that we reserved since we are not * going to use them. */ - spin_lock(&uctxt->exp_lock); + mutex_lock(&uctxt->exp_lock); uctxt->tidusemap[useidx] &= ~(((1ULL << bits_used) - 1) << bitidx); - spin_unlock(&uctxt->exp_lock); + mutex_unlock(&uctxt->exp_lock); goto done; } /* diff --git a/drivers/staging/rdma/hfi1/hfi.h b/drivers/staging/rdma/hfi1/hfi.h index 996dd520cf41..8ae914aab9bf 100644 --- a/drivers/staging/rdma/hfi1/hfi.h +++ b/drivers/staging/rdma/hfi1/hfi.h @@ -258,7 +258,7 @@ struct hfi1_ctxtdata { struct exp_tid_set tid_full_list; /* lock protecting all Expected TID data */ - spinlock_t exp_lock; + struct mutex exp_lock; /* number of pio bufs for this ctxt (all procs, if shared) */ u32 piocnt; /* first pio buffer for this ctxt */ diff --git a/drivers/staging/rdma/hfi1/init.c b/drivers/staging/rdma/hfi1/init.c index 98aaa0ebff51..503dc7a397a5 100644 --- a/drivers/staging/rdma/hfi1/init.c +++ b/drivers/staging/rdma/hfi1/init.c @@ -227,7 +227,7 @@ struct hfi1_ctxtdata *hfi1_create_ctxtdata(struct hfi1_pportdata *ppd, u32 ctxt) rcd->numa_id = numa_node_id(); rcd->rcv_array_groups = dd->rcv_entries.ngroups; - spin_lock_init(&rcd->exp_lock); + mutex_init(&rcd->exp_lock); /* * Calculate the context's RcvArray entry starting point. -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/14] staging/rdma/hfi1: Start adding building blocks for TID caching
From: Mitko Haralanov Functions added by this patch are building blocks for the upcoming TID caching functionality. The functions added are currently unsed (and marked as such.) The functions' purposes are to find physically contigous pages in the user's virtual buffer, program the RcvArray group entries with these physical chunks, and unprogram the RcvArray groups. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- drivers/staging/rdma/hfi1/user_exp_rcv.c | 310 +++ 1 file changed, 310 insertions(+) diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c b/drivers/staging/rdma/hfi1/user_exp_rcv.c index 7f15024daab9..0eb888fcaf70 100644 --- a/drivers/staging/rdma/hfi1/user_exp_rcv.c +++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c @@ -83,8 +83,20 @@ static const char * const mmu_types[] = { "RANGE" }; +struct tid_pageset { + u16 idx; + u16 count; +}; + #define EXP_TID_SET_EMPTY(set) (set.count == 0 && list_empty(&set.list)) +static void unlock_exp_tids(struct hfi1_ctxtdata *, struct exp_tid_set *, + struct rb_root *) __maybe_unused; +static u32 find_phys_blocks(struct page **, unsigned, + struct tid_pageset *) __maybe_unused; +static int set_rcvarray_entry(struct file *, unsigned long, u32, + struct tid_group *, struct page **, + unsigned) __maybe_unused; static inline int mmu_addr_cmp(struct mmu_rb_node *, unsigned long, unsigned long); static struct mmu_rb_node *mmu_rb_search_by_addr(struct rb_root *, @@ -103,6 +115,21 @@ static inline void mmu_notifier_page(struct mmu_notifier *, struct mm_struct *, static inline void mmu_notifier_range_start(struct mmu_notifier *, struct mm_struct *, unsigned long, unsigned long); +static int program_rcvarray(struct file *, unsigned long, struct tid_group *, + struct tid_pageset *, unsigned, u16, struct page **, + u32 *, unsigned *, unsigned *) __maybe_unused; +static int unprogram_rcvarray(struct file *, u32, + struct tid_group **) __maybe_unused; +static void clear_tid_node(struct hfi1_filedata *, u16, + struct mmu_rb_node *) __maybe_unused; + +static inline u32 rcventry2tidinfo(u32 rcventry) +{ + u32 pair = rcventry & ~0x1; + + return EXP_TID_SET(IDX, pair >> 1) | + EXP_TID_SET(CTRL, 1 << (rcventry - pair)); +} static inline void exp_tid_group_init(struct exp_tid_set *set) { @@ -193,6 +220,289 @@ int hfi1_user_exp_rcv_invalid(struct file *fp, struct hfi1_tid_info *tinfo) return -EINVAL; } +static u32 find_phys_blocks(struct page **pages, unsigned npages, + struct tid_pageset *list) +{ + unsigned pagecount, pageidx, setcount = 0, i; + unsigned long pfn, this_pfn; + + if (!npages) + return 0; + + /* +* Look for sets of physically contiguous pages in the user buffer. +* This will allow us to optimize Expected RcvArray entry usage by +* using the bigger supported sizes. +*/ + pfn = page_to_pfn(pages[0]); + for (pageidx = 0, pagecount = 1, i = 1; i <= npages; i++) { + this_pfn = i < npages ? page_to_pfn(pages[i]) : 0; + + /* +* If the pfn's are not sequential, pages are not physically +* contiguous. +*/ + if (this_pfn != ++pfn) { + /* +* At this point we have to loop over the set of +* physically contiguous pages and break them down it +* sizes supported by the HW. +* There are two main constraints: +* 1. The max buffer size is MAX_EXPECTED_BUFFER. +*If the total set size is bigger than that +*program only a MAX_EXPECTED_BUFFER chunk. +* 2. The buffer size has to be a power of two. If +*it is not, round down to the closes power of +*2 and program that size. +*/ + while (pagecount) { + int maxpages = pagecount; + u32 bufsize = pagecount * PAGE_SIZE; + + if (bufsize > MAX_EXPECTED_BUFFER) + maxpages = + MAX_EXPECTED_BUFFER >> + PAGE_SHIFT; + else if (!is_power_of_2(bufsize)) + maxpages = +
[PATCH 01/14] staging/rdma/hfi1: Add function stubs for TID caching
From: Mitko Haralanov Add mmu notify helper functions and TID caching function stubs in preparation for the TID caching implementation. TID caching makes use of the MMU notifier to allow the driver to respond to the user freeing memory which is allocated to the HFI. This patch implements the basic MMU notifier functions to insert, find and remove buffer pages from memory based on the mmu_notifier being invoked. In addition it places stubs in place for the main entry points by follow on code. Follow up patches will complete the implementation of the interaction with user space and makes use of these functions. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- drivers/staging/rdma/hfi1/Kconfig| 1 + drivers/staging/rdma/hfi1/Makefile | 2 +- drivers/staging/rdma/hfi1/hfi.h | 4 + drivers/staging/rdma/hfi1/user_exp_rcv.c | 264 +++ drivers/staging/rdma/hfi1/user_exp_rcv.h | 8 + 5 files changed, 278 insertions(+), 1 deletion(-) create mode 100644 drivers/staging/rdma/hfi1/user_exp_rcv.c diff --git a/drivers/staging/rdma/hfi1/Kconfig b/drivers/staging/rdma/hfi1/Kconfig index fd25078ee923..bd0249bcf199 100644 --- a/drivers/staging/rdma/hfi1/Kconfig +++ b/drivers/staging/rdma/hfi1/Kconfig @@ -1,6 +1,7 @@ config INFINIBAND_HFI1 tristate "Intel OPA Gen1 support" depends on X86_64 + select MMU_NOTIFIER default m ---help--- This is a low-level driver for Intel OPA Gen1 adapter. diff --git a/drivers/staging/rdma/hfi1/Makefile b/drivers/staging/rdma/hfi1/Makefile index 68c5a315e557..e63251b9c56b 100644 --- a/drivers/staging/rdma/hfi1/Makefile +++ b/drivers/staging/rdma/hfi1/Makefile @@ -10,7 +10,7 @@ obj-$(CONFIG_INFINIBAND_HFI1) += hfi1.o hfi1-y := chip.o cq.o device.o diag.o dma.o driver.o efivar.o eprom.o file_ops.o firmware.o \ init.o intr.o keys.o mad.o mmap.o mr.o pcie.o pio.o pio_copy.o \ qp.o qsfp.o rc.o ruc.o sdma.o srq.o sysfs.o trace.o twsi.o \ - uc.o ud.o user_pages.o user_sdma.o verbs_mcast.o verbs.o + uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs_mcast.o verbs.o hfi1-$(CONFIG_DEBUG_FS) += debugfs.o CFLAGS_trace.o = -I$(src) diff --git a/drivers/staging/rdma/hfi1/hfi.h b/drivers/staging/rdma/hfi1/hfi.h index a4a294558c03..057a41cee734 100644 --- a/drivers/staging/rdma/hfi1/hfi.h +++ b/drivers/staging/rdma/hfi1/hfi.h @@ -65,6 +65,8 @@ #include #include #include +#include +#include #include "chip_registers.h" #include "common.h" @@ -1126,6 +1128,8 @@ struct hfi1_devdata { #define PT_EAGER1 #define PT_INVALID 2 +struct mmu_rb_node; + /* Private data for file operations */ struct hfi1_filedata { struct hfi1_ctxtdata *uctxt; diff --git a/drivers/staging/rdma/hfi1/user_exp_rcv.c b/drivers/staging/rdma/hfi1/user_exp_rcv.c new file mode 100644 index ..bafeddf67c8f --- /dev/null +++ b/drivers/staging/rdma/hfi1/user_exp_rcv.c @@ -0,0 +1,264 @@ +/* + * + * This file is provided under a dual BSD/GPLv2 license. When using or + * redistributing this file, you may do so under either license. + * + * GPL LICENSE SUMMARY + * + * Copyright(c) 2015 Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * BSD LICENSE + * + * Copyright(c) 2015 Intel Corporation. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * - Redistributions of source code must retain the above copyright + *notice, this list of conditions and the following disclaimer. + * - Redistributions in binary form must reproduce the above copyright + *notice, this list of conditions and the following disclaimer in + *the documentation and/or other materials provided with the + *distribution. + * - Neither the name of Intel Corporation nor the names of its + *contributors may be used to endorse or promote products derived + *from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR
[PATCH 03/14] uapi/rdma/hfi/hfi1_user.h: Convert definitions to use BIT() macro
From: Mitko Haralanov Convert bit definitions to use BIT() macro as per checkpatch.pl requirements. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- include/uapi/rdma/hfi/hfi1_user.h | 56 +++ 1 file changed, 28 insertions(+), 28 deletions(-) diff --git a/include/uapi/rdma/hfi/hfi1_user.h b/include/uapi/rdma/hfi/hfi1_user.h index cf172718e3d5..a65f2fe17660 100644 --- a/include/uapi/rdma/hfi/hfi1_user.h +++ b/include/uapi/rdma/hfi/hfi1_user.h @@ -83,29 +83,29 @@ * driver features. The same set of bits are communicated to user * space. */ -#define HFI1_CAP_DMA_RTAIL(1UL << 0) /* Use DMA'ed RTail value */ -#define HFI1_CAP_SDMA (1UL << 1) /* Enable SDMA support */ -#define HFI1_CAP_SDMA_AHG (1UL << 2) /* Enable SDMA AHG support */ -#define HFI1_CAP_EXTENDED_PSN (1UL << 3) /* Enable Extended PSN support */ -#define HFI1_CAP_HDRSUPP (1UL << 4) /* Enable Header Suppression */ -/* 1UL << 5 unused */ -#define HFI1_CAP_USE_SDMA_HEAD(1UL << 6) /* DMA Hdr Q tail vs. use CSR */ -#define HFI1_CAP_MULTI_PKT_EGR(1UL << 7) /* Enable multi-packet Egr buffs*/ -#define HFI1_CAP_NODROP_RHQ_FULL (1UL << 8) /* Don't drop on Hdr Q full */ -#define HFI1_CAP_NODROP_EGR_FULL (1UL << 9) /* Don't drop on EGR buffs full */ -#define HFI1_CAP_TID_UNMAP(1UL << 10) /* Disable Expected TID caching */ -#define HFI1_CAP_PRINT_UNIMPL (1UL << 11) /* Show for unimplemented feats */ -#define HFI1_CAP_ALLOW_PERM_JKEY (1UL << 12) /* Allow use of permissive JKEY */ -#define HFI1_CAP_NO_INTEGRITY (1UL << 13) /* Enable ctxt integrity checks */ -#define HFI1_CAP_PKEY_CHECK (1UL << 14) /* Enable ctxt PKey checking */ -#define HFI1_CAP_STATIC_RATE_CTRL (1UL << 15) /* Allow PBC.StaticRateControl */ -/* 1UL << 16 unused */ -#define HFI1_CAP_SDMA_HEAD_CHECK (1UL << 17) /* SDMA head checking */ -#define HFI1_CAP_EARLY_CREDIT_RETURN (1UL << 18) /* early credit return */ - -#define HFI1_RCVHDR_ENTSIZE_2(1UL << 0) -#define HFI1_RCVHDR_ENTSIZE_16 (1UL << 1) -#define HFI1_RCVDHR_ENTSIZE_32 (1UL << 2) +#define HFI1_CAP_DMA_RTAILBIT(0) /* Use DMA'ed RTail value */ +#define HFI1_CAP_SDMA BIT(1) /* Enable SDMA support */ +#define HFI1_CAP_SDMA_AHG BIT(2) /* Enable SDMA AHG support */ +#define HFI1_CAP_EXTENDED_PSN BIT(3) /* Enable Extended PSN support */ +#define HFI1_CAP_HDRSUPP BIT(4) /* Enable Header Suppression */ +/* BIT(5) unused */ +#define HFI1_CAP_USE_SDMA_HEADBIT(6) /* DMA Hdr Q tail vs. use CSR */ +#define HFI1_CAP_MULTI_PKT_EGRBIT(7) /* Enable multi-packet Egr buffs*/ +#define HFI1_CAP_NODROP_RHQ_FULL BIT(8) /* Don't drop on Hdr Q full */ +#define HFI1_CAP_NODROP_EGR_FULL BIT(9) /* Don't drop on EGR buffs full */ +#define HFI1_CAP_TID_UNMAPBIT(10) /* Disable Expected TID caching */ +#define HFI1_CAP_PRINT_UNIMPL BIT(11) /* Show for unimplemented feats */ +#define HFI1_CAP_ALLOW_PERM_JKEY BIT(12) /* Allow use of permissive JKEY */ +#define HFI1_CAP_NO_INTEGRITY BIT(13) /* Enable ctxt integrity checks */ +#define HFI1_CAP_PKEY_CHECK BIT(14) /* Enable ctxt PKey checking */ +#define HFI1_CAP_STATIC_RATE_CTRL BIT(15) /* Allow PBC.StaticRateControl */ +/* BIT(16) unused */ +#define HFI1_CAP_SDMA_HEAD_CHECK BIT(17) /* SDMA head checking */ +#define HFI1_CAP_EARLY_CREDIT_RETURN BIT(18) /* early credit return */ + +#define HFI1_RCVHDR_ENTSIZE_2BIT(0) +#define HFI1_RCVHDR_ENTSIZE_16 BIT(1) +#define HFI1_RCVDHR_ENTSIZE_32 BIT(2) /* * If the unit is specified via open, HFI choice is fixed. If port is @@ -149,11 +149,11 @@ #define _HFI1_EVENT_SL2VL_CHANGE_BIT 4 #define _HFI1_MAX_EVENT_BIT _HFI1_EVENT_SL2VL_CHANGE_BIT -#define HFI1_EVENT_FROZEN(1UL << _HFI1_EVENT_FROZEN_BIT) -#define HFI1_EVENT_LINKDOWN (1UL << _HFI1_EVENT_LINKDOWN_BIT) -#define HFI1_EVENT_LID_CHANGE(1UL << _HFI1_EVENT_LID_CHANGE_BIT) -#define HFI1_EVENT_LMC_CHANGE(1UL << _HFI1_EVENT_LMC_CHANGE_BIT) -#define HFI1_EVENT_SL2VL_CHANGE (1UL << _HFI1_EVENT_SL2VL_CHANGE_BIT) +#define HFI1_EVENT_FROZENBIT(_HFI1_EVENT_FROZEN_BIT) +#define HFI1_EVENT_LINKDOWN BIT(_HFI1_EVENT_LINKDOWN_BIT) +#define HFI1_EVENT_LID_CHANGEBIT(_HFI1_EVENT_LID_CHANGE_BIT) +#define HFI1_EVENT_LMC_CHANGEBIT(_HFI1_EVENT_LMC_CHANGE_BIT) +#define HFI1_EVENT_SL2VL_CHANGE BIT(_HFI1_EVENT_SL2VL_CHANGE_BIT) /* * These are the status bits readable (in ASCII form, 64bit value) -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/14] Implement Expected Receive TID Caching
From: Ira Weiny Expected receives work by user-space libraries (PSM) calling into the driver with information about the user's receive buffer and have the driver DMA-map that buffer and program the HFI to receive data directly into it. This is an expensive operation as it requires the driver to pin the pages which the user's buffer maps to, DMA-map them, and then program the HFI. When the receive is complete, user-space libraries have to call into the driver again so the buffer is removed from the HFI, un-mapped, and the pages unpinned. All of these operations are expensive, considering that a lot of applications (especially micro-benchmarks) use the same buffer over and over. In order to get better performance for user-space applications, it is highly beneficial that they don't continuously call into the driver to register and unregister the same buffer. Rather, they can register the buffer and cache it for future work. The buffer can be unregistered when it is freed by the user. This change implements such buffer caching by making use of the kernel's MMU notifier API. User-space libraries call into the driver only when they need to register a new buffer. Once a buffer is registered, it stays programmed into the HFI until the kernel notifies the driver that the buffer has been freed by the user. At that time, the user-space library is notified and it can do the necessary work to remove the buffer from its cache. Buffers which have been invalidated by the kernel are not automatically removed from the HFI and do not have their pages unpinned. Buffers are only completely removed when the user-space libraries call into the driver to free them. This is done to ensure that any ongoing transfers into that buffer are complete. This is important when a buffer is not completely freed but rather it is shrunk. The user-space library could still have uncompleted transfers into the remaining buffer. With this feature, it is important that systems are setup with reasonable limits for the amount of lockable memory. Keeping the limit at "unlimited" (as we've done up to this point), may result in jobs being killed by the kernel's OOM due to them taking up excessive amounts of memory. TID caching started as a single patch which we have broken up. Original patch here. http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2015-November/080855.html This directly depends on the initial break up work which was submitted before: http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2015-December/082339.html Mitko Haralanov (14): staging/rdma/hfi1: Add function stubs for TID caching uapi/rdma/hfi/hfi1_user.h: Correct comment for capability bit uapi/rdma/hfi/hfi1_user.h: Convert definitions to use BIT() macro uapi/rdma/hfi/hfi1_user.h: Add command and event for TID caching staging/rdma/hfi1: Add definitions needed for TID caching support staging/rdma/hfi1: Remove un-needed variable staging/rdma/hfi1: Add definitions and support functions for TID groups staging/rdma/hfi1: Start adding building blocks for TID caching staging/rdma/hfi1: Convert lock to mutex staging/rdma/hfi1: Add Expected receive init and free functions staging/rdma/hfi1: Add MMU notifier callback function staging/rdma/hfi1: Add TID free/clear function bodies staging/rdma/hfi1: Add TID entry program function body staging/rdma/hfi1: Enable TID caching feature drivers/staging/rdma/hfi1/Kconfig|1 + drivers/staging/rdma/hfi1/Makefile |2 +- drivers/staging/rdma/hfi1/file_ops.c | 458 +--- drivers/staging/rdma/hfi1/hfi.h | 40 +- drivers/staging/rdma/hfi1/init.c |5 +- drivers/staging/rdma/hfi1/trace.h| 132 ++-- drivers/staging/rdma/hfi1/user_exp_rcv.c | 1181 ++ drivers/staging/rdma/hfi1/user_exp_rcv.h |8 + drivers/staging/rdma/hfi1/user_pages.c | 14 - include/uapi/rdma/hfi/hfi1_user.h| 68 +- 10 files changed, 1373 insertions(+), 536 deletions(-) create mode 100644 drivers/staging/rdma/hfi1/user_exp_rcv.c -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/14] uapi/rdma/hfi/hfi1_user.h: Add command and event for TID caching
From: Mitko Haralanov TID caching will use a new event to signal userland that cache invalidation has occurred and needs a matching command code that will be used to read the invalidated TIDs. Add the event bit and the new command to the exported header file. The command is also added to the switch() statement in file_ops.c for completeness and in preparation for its usage later. Reviewed-by: Ira Weiny Signed-off-by: Mitko Haralanov --- drivers/staging/rdma/hfi1/file_ops.c | 1 + include/uapi/rdma/hfi/hfi1_user.h| 5 - 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/staging/rdma/hfi1/file_ops.c b/drivers/staging/rdma/hfi1/file_ops.c index d57d549052c8..c66693532be0 100644 --- a/drivers/staging/rdma/hfi1/file_ops.c +++ b/drivers/staging/rdma/hfi1/file_ops.c @@ -241,6 +241,7 @@ static ssize_t hfi1_file_write(struct file *fp, const char __user *data, must_be_root = 1; /* validate user */ copy = 0; break; + case HFI1_CMD_TID_INVAL_READ: default: ret = -EINVAL; goto bail; diff --git a/include/uapi/rdma/hfi/hfi1_user.h b/include/uapi/rdma/hfi/hfi1_user.h index a65f2fe17660..959204df5318 100644 --- a/include/uapi/rdma/hfi/hfi1_user.h +++ b/include/uapi/rdma/hfi/hfi1_user.h @@ -134,6 +134,7 @@ #define HFI1_CMD_ACK_EVENT 10/* ack & clear user status bits */ #define HFI1_CMD_SET_PKEY11 /* set context's pkey */ #define HFI1_CMD_CTXT_RESET 12 /* reset context's HW send context */ +#define HFI1_CMD_TID_INVAL_READ 13 /* read TID cache invalidations */ /* separate EPROM commands from normal PSM commands */ #define HFI1_CMD_EP_INFO 64 /* read EPROM device ID */ #define HFI1_CMD_EP_ERASE_CHIP 65 /* erase whole EPROM */ @@ -147,13 +148,15 @@ #define _HFI1_EVENT_LID_CHANGE_BIT 2 #define _HFI1_EVENT_LMC_CHANGE_BIT 3 #define _HFI1_EVENT_SL2VL_CHANGE_BIT 4 -#define _HFI1_MAX_EVENT_BIT _HFI1_EVENT_SL2VL_CHANGE_BIT +#define _HFI1_EVENT_TID_MMU_NOTIFY_BIT 5 +#define _HFI1_MAX_EVENT_BIT _HFI1_EVENT_TID_MMU_NOTIFY_BIT #define HFI1_EVENT_FROZENBIT(_HFI1_EVENT_FROZEN_BIT) #define HFI1_EVENT_LINKDOWN BIT(_HFI1_EVENT_LINKDOWN_BIT) #define HFI1_EVENT_LID_CHANGEBIT(_HFI1_EVENT_LID_CHANGE_BIT) #define HFI1_EVENT_LMC_CHANGEBIT(_HFI1_EVENT_LMC_CHANGE_BIT) #define HFI1_EVENT_SL2VL_CHANGE BIT(_HFI1_EVENT_SL2VL_CHANGE_BIT) +#define HFI1_EVENT_TID_MMU_NOTIFYBIT(_HFI1_EVENT_TID_MMU_NOTIFY_BIT) /* * These are the status bits readable (in ASCII form, 64bit value) -- 1.8.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/15] i40iw: changes for build of i40iw module
Hi Faisal, [auto build test WARNING on net/master] [also build test WARNING on v4.4-rc5 next-20151216] [cannot apply to net-next/master] url: https://github.com/0day-ci/linux/commits/Faisal-Latif/add-Intel-R-X722-iWARP-driver/20151217-040340 config: sparc-allyesconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=sparc All warnings (new ones prefixed by >>): drivers/infiniband/hw/i40iw/i40iw_verbs.c: In function 'i40iw_setup_kmode_qp': >> drivers/infiniband/hw/i40iw/i40iw_verbs.c:571:28: warning: cast to pointer >> from integer of different size [-Wint-to-pointer-cast] info->rq_pa = (uintptr_t)((u8 *)mem->pa + (sqdepth * I40IW_QP_WQE_MIN_SIZE)); ^ vim +571 drivers/infiniband/hw/i40iw/i40iw_verbs.c e4d636f5 Faisal Latif 2015-12-16 555 ukinfo->rq_wrid_array = (u64 *)&ukinfo->sq_wrtrk_array[sqdepth]; e4d636f5 Faisal Latif 2015-12-16 556 e4d636f5 Faisal Latif 2015-12-16 557 size = (sqdepth + rqdepth) * I40IW_QP_WQE_MIN_SIZE; e4d636f5 Faisal Latif 2015-12-16 558 size += (I40IW_SHADOW_AREA_SIZE << 3); e4d636f5 Faisal Latif 2015-12-16 559 e4d636f5 Faisal Latif 2015-12-16 560 status = i40iw_allocate_dma_mem(iwdev->sc_dev.hw, mem, size, 256); e4d636f5 Faisal Latif 2015-12-16 561 if (status) { e4d636f5 Faisal Latif 2015-12-16 562 kfree(ukinfo->sq_wrtrk_array); e4d636f5 Faisal Latif 2015-12-16 563 ukinfo->sq_wrtrk_array = NULL; e4d636f5 Faisal Latif 2015-12-16 564 return -ENOMEM; e4d636f5 Faisal Latif 2015-12-16 565 } e4d636f5 Faisal Latif 2015-12-16 566 e4d636f5 Faisal Latif 2015-12-16 567 ukinfo->sq = mem->va; e4d636f5 Faisal Latif 2015-12-16 568 info->sq_pa = mem->pa; e4d636f5 Faisal Latif 2015-12-16 569 e4d636f5 Faisal Latif 2015-12-16 570 ukinfo->rq = (u64 *)((u8 *)mem->va + (sqdepth * I40IW_QP_WQE_MIN_SIZE)); e4d636f5 Faisal Latif 2015-12-16 @571 info->rq_pa = (uintptr_t)((u8 *)mem->pa + (sqdepth * I40IW_QP_WQE_MIN_SIZE)); e4d636f5 Faisal Latif 2015-12-16 572 e4d636f5 Faisal Latif 2015-12-16 573 ukinfo->shadow_area = (u64 *)((u8 *)ukinfo->rq + e4d636f5 Faisal Latif 2015-12-16 574 (rqdepth * I40IW_QP_WQE_MIN_SIZE)); e4d636f5 Faisal Latif 2015-12-16 575 info->shadow_area_pa = info->rq_pa + (rqdepth * I40IW_QP_WQE_MIN_SIZE); e4d636f5 Faisal Latif 2015-12-16 576 e4d636f5 Faisal Latif 2015-12-16 577 ukinfo->sq_size = sq_size; e4d636f5 Faisal Latif 2015-12-16 578 ukinfo->rq_size = rq_size; e4d636f5 Faisal Latif 2015-12-16 579 ukinfo->qp_id = iwqp->ibqp.qp_num; :: The code at line 571 was first introduced by commit :: e4d636f5c9dea5d2dd1f5c74e3a2235218a537a8 i40iw: add files for iwarp interface :: TO: Faisal Latif :: CC: 0day robot --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
[PATCH] IB/mlx4: Replace kfree with kvfree in mlx4_ib_destroy_srq
Commit 0ef2f05c7e02ff99c0b5b583d7dee2cd12b053f2 uses vmalloc for WR buffers when needed and uses kvfree to free the buffers. It missed changing kfree to kvfree in mlx4_ib_destroy_srq(). Reported-by: Matthew Finaly Signed-off-by: Wengang Wang --- drivers/infiniband/hw/mlx4/srq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c index 8d133c4..c394376 100644 --- a/drivers/infiniband/hw/mlx4/srq.c +++ b/drivers/infiniband/hw/mlx4/srq.c @@ -286,7 +286,7 @@ int mlx4_ib_destroy_srq(struct ib_srq *srq) mlx4_ib_db_unmap_user(to_mucontext(srq->uobject->context), &msrq->db); ib_umem_release(msrq->umem); } else { - kfree(msrq->wrid); + kvfree(msrq->wrid); mlx4_buf_free(dev->dev, msrq->msrq.max << msrq->msrq.wqe_shift, &msrq->buf); mlx4_db_free(dev->dev, &msrq->db); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] IB/mlx4: Use vmalloc for WR buffers when needed
Hi Matt, Yes, you are right. Since the patch is already merged in, I am going to make a separated patch for that. thanks, wengang 在 2015年12月12日 04:28, Matthew Finlay 写道: Hi Wengang, I was going through your patch set here, and it seems that you missed changing kfree to kvfree in mlx4_ib_destroy_srq(). In the current code if the srq wrid is allocated using vmalloc, then on cleanup we will use kfree, which is a bug. Thanks, -matt On 10/7/15, 10:27 PM, "linux-rdma-ow...@vger.kernel.org on behalf of Wengang Wang" wrote: There are several hits that WR buffer allocation(kmalloc) failed. It failed at order 3 and/or 4 contigous pages allocation. At the same time there are actually 100MB+ free memory but well fragmented. So try vmalloc when kmalloc failed. Signed-off-by: Wengang Wang Acked-by: Or Gerlitz --- drivers/infiniband/hw/mlx4/qp.c | 19 +-- drivers/infiniband/hw/mlx4/srq.c | 11 --- 2 files changed, 21 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c index 4ad9be3..3ccbd3a 100644 --- a/drivers/infiniband/hw/mlx4/qp.c +++ b/drivers/infiniband/hw/mlx4/qp.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include @@ -786,8 +787,14 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd, if (err) goto err_mtt; - qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof (u64), gfp); - qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof (u64), gfp); + qp->sq.wrid = kmalloc(qp->sq.wqe_cnt * sizeof(u64), gfp); + if (!qp->sq.wrid) + qp->sq.wrid = __vmalloc(qp->sq.wqe_cnt * sizeof(u64), + gfp, PAGE_KERNEL); + qp->rq.wrid = kmalloc(qp->rq.wqe_cnt * sizeof(u64), gfp); + if (!qp->rq.wrid) + qp->rq.wrid = __vmalloc(qp->rq.wqe_cnt * sizeof(u64), + gfp, PAGE_KERNEL); if (!qp->sq.wrid || !qp->rq.wrid) { err = -ENOMEM; goto err_wrid; @@ -874,8 +881,8 @@ err_wrid: if (qp_has_rq(init_attr)) mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), &qp->db); } else { - kfree(qp->sq.wrid); - kfree(qp->rq.wrid); + kvfree(qp->sq.wrid); + kvfree(qp->rq.wrid); } err_mtt: @@ -1050,8 +1057,8 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp, &qp->db); ib_umem_release(qp->umem); } else { - kfree(qp->sq.wrid); - kfree(qp->rq.wrid); + kvfree(qp->sq.wrid); + kvfree(qp->rq.wrid); if (qp->mlx4_ib_qp_type & (MLX4_IB_QPT_PROXY_SMI_OWNER | MLX4_IB_QPT_PROXY_SMI | MLX4_IB_QPT_PROXY_GSI)) free_proxy_bufs(&dev->ib_dev, qp); diff --git a/drivers/infiniband/hw/mlx4/srq.c b/drivers/infiniband/hw/mlx4/srq.c index dce5dfe..8d133c4 100644 --- a/drivers/infiniband/hw/mlx4/srq.c +++ b/drivers/infiniband/hw/mlx4/srq.c @@ -34,6 +34,7 @@ #include #include #include +#include #include "mlx4_ib.h" #include "user.h" @@ -172,8 +173,12 @@ struct ib_srq *mlx4_ib_create_srq(struct ib_pd *pd, srq->wrid = kmalloc(srq->msrq.max * sizeof (u64), GFP_KERNEL); if (!srq->wrid) { - err = -ENOMEM; - goto err_mtt; + srq->wrid = __vmalloc(srq->msrq.max * sizeof(u64), + GFP_KERNEL, PAGE_KERNEL); + if (!srq->wrid) { + err = -ENOMEM; + goto err_mtt; + } } } @@ -204,7 +209,7 @@ err_wrid: if (pd->uobject) mlx4_ib_db_unmap_user(to_mucontext(pd->uobject->context), &srq->db); else - kfree(srq->wrid); + kvfree(srq->wrid); err_mtt: mlx4_mtt_cleanup(dev->dev, &srq->mtt); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html N�r��y���b�X��ǧv�^�){.n�+{��ٚ�{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!tml= -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 08/10] xprtrdma: Add ro_unmap_sync method for all-physical registration
physical's ro_unmap is synchronous already. The new ro_unmap_sync method just has to DMA unmap all MRs associated with the RPC request. Signed-off-by: Chuck Lever Tested-by: Devesh Sharma --- net/sunrpc/xprtrdma/physical_ops.c | 13 + 1 file changed, 13 insertions(+) diff --git a/net/sunrpc/xprtrdma/physical_ops.c b/net/sunrpc/xprtrdma/physical_ops.c index 617b76f..dbb302e 100644 --- a/net/sunrpc/xprtrdma/physical_ops.c +++ b/net/sunrpc/xprtrdma/physical_ops.c @@ -83,6 +83,18 @@ physical_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg) return 1; } +/* DMA unmap all memory regions that were mapped for "req". + */ +static void +physical_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req) +{ + struct ib_device *device = r_xprt->rx_ia.ri_device; + unsigned int i; + + for (i = 0; req->rl_nchunks; --req->rl_nchunks) + rpcrdma_unmap_one(device, &req->rl_segments[i++]); +} + static void physical_op_destroy(struct rpcrdma_buffer *buf) { @@ -90,6 +102,7 @@ physical_op_destroy(struct rpcrdma_buffer *buf) const struct rpcrdma_memreg_ops rpcrdma_physical_memreg_ops = { .ro_map = physical_op_map, + .ro_unmap_sync = physical_op_unmap_sync, .ro_unmap = physical_op_unmap, .ro_open= physical_op_open, .ro_maxpages= physical_op_maxpages, -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 09/10] xprtrdma: Invalidate in the RPC reply handler
There is a window between the time the RPC reply handler wakes the waiting RPC task and when xprt_release() invokes ops->buf_free. During this time, memory regions containing the data payload may still be accessed by a broken or malicious server, but the RPC application has already been allowed access to the memory containing the RPC request's data payloads. The server should be fenced from client memory containing RPC data payloads _before_ the RPC application is allowed to continue. This change also more strongly enforces send queue accounting. There is a maximum number of RPC calls allowed to be outstanding. When an RPC/RDMA transport is set up, just enough send queue resources are allocated to handle registration, Send, and invalidation WRs for each those RPCs at the same time. Before, additional RPC calls could be dispatched while invalidation WRs were still consuming send WQEs. When invalidation WRs backed up, dispatching additional RPCs resulted in a send queue overrun. Now, the reply handler prevents RPC dispatch until invalidation is complete. This prevents RPC call dispatch until there are enough send queue resources to proceed. Still to do: If an RPC exits early (say, ^C), the reply handler has no opportunity to perform invalidation. Currently, xprt_rdma_free() still frees remaining RDMA resources, which could deadlock. Additional changes are needed to handle invalidation properly in this case. Reported-by: Jason Gunthorpe Signed-off-by: Chuck Lever Tested-by: Devesh Sharma --- net/sunrpc/xprtrdma/rpc_rdma.c | 16 1 file changed, 16 insertions(+) diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c index c10d969..0f28f2d 100644 --- a/net/sunrpc/xprtrdma/rpc_rdma.c +++ b/net/sunrpc/xprtrdma/rpc_rdma.c @@ -804,6 +804,11 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep) if (req->rl_reply) goto out_duplicate; + /* Sanity checking has passed. We are now committed +* to complete this transaction. +*/ + list_del_init(&rqst->rq_list); + spin_unlock_bh(&xprt->transport_lock); dprintk("RPC: %s: reply 0x%p completes request 0x%p\n" " RPC request 0x%p xid 0x%08x\n", __func__, rep, req, rqst, @@ -888,12 +893,23 @@ badheader: break; } + /* Invalidate and flush the data payloads before waking the +* waiting application. This guarantees the memory region is +* properly fenced from the server before the application +* accesses the data. It also ensures proper send flow +* control: waking the next RPC waits until this RPC has +* relinquished all its Send Queue entries. +*/ + if (req->rl_nchunks) + r_xprt->rx_ia.ri_ops->ro_unmap_sync(r_xprt, req); + credits = be32_to_cpu(headerp->rm_credit); if (credits == 0) credits = 1;/* don't deadlock */ else if (credits > r_xprt->rx_buf.rb_max_requests) credits = r_xprt->rx_buf.rb_max_requests; + spin_lock_bh(&xprt->transport_lock); cwnd = xprt->cwnd; xprt->cwnd = credits << RPC_CWNDSHIFT; if (xprt->cwnd > cwnd) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 10/10] xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit').
The root of the problem was that sends (especially unsignalled FASTREG and LOCAL_INV Work Requests) were not properly flow- controlled, which allowed a send queue overrun. Now that the RPC/RDMA reply handler waits for invalidation to complete, the send queue is properly flow-controlled. Thus this limit is no longer necessary. Signed-off-by: Chuck Lever Tested-by: Devesh Sharma --- net/sunrpc/xprtrdma/verbs.c |6 ++ net/sunrpc/xprtrdma/xprt_rdma.h |6 -- 2 files changed, 2 insertions(+), 10 deletions(-) diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index f23f3d6..1867e3a 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -608,10 +608,8 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia, /* set trigger for requesting send completion */ ep->rep_cqinit = ep->rep_attr.cap.max_send_wr/2 - 1; - if (ep->rep_cqinit > RPCRDMA_MAX_UNSIGNALED_SENDS) - ep->rep_cqinit = RPCRDMA_MAX_UNSIGNALED_SENDS; - else if (ep->rep_cqinit <= 2) - ep->rep_cqinit = 0; + if (ep->rep_cqinit <= 2) + ep->rep_cqinit = 0; /* always signal? */ INIT_CQCOUNT(ep); init_waitqueue_head(&ep->rep_connect_wait); INIT_DELAYED_WORK(&ep->rep_connect_worker, rpcrdma_connect_worker); diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index b8bac41..a563ffc 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -87,12 +87,6 @@ struct rpcrdma_ep { struct delayed_work rep_connect_worker; }; -/* - * Force a signaled SEND Work Request every so often, - * in case the provider needs to do some housekeeping. - */ -#define RPCRDMA_MAX_UNSIGNALED_SENDS (32) - #define INIT_CQCOUNT(ep) atomic_set(&(ep)->rep_cqcount, (ep)->rep_cqinit) #define DECR_CQCOUNT(ep) atomic_sub_return(1, &(ep)->rep_cqcount) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 06/10] xprtrdma: Add ro_unmap_sync method for FRWR
FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts LOCAL_INV Work Requests and waits for them to complete before returning. Note also, DMA unmapping is now done _after_ invalidation. Signed-off-by: Chuck Lever Tested-by: Devesh Sharma --- net/sunrpc/xprtrdma/frwr_ops.c | 136 ++- net/sunrpc/xprtrdma/xprt_rdma.h |2 + 2 files changed, 134 insertions(+), 4 deletions(-) diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c index 660d0b6..aa078a0 100644 --- a/net/sunrpc/xprtrdma/frwr_ops.c +++ b/net/sunrpc/xprtrdma/frwr_ops.c @@ -244,12 +244,14 @@ frwr_op_maxpages(struct rpcrdma_xprt *r_xprt) rpcrdma_max_segments(r_xprt) * ia->ri_max_frmr_depth); } -/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs to be reset. */ +/* If FAST_REG or LOCAL_INV failed, indicate the frmr needs + * to be reset. + * + * WARNING: Only wr_id and status are reliable at this point + */ static void -frwr_sendcompletion(struct ib_wc *wc) +__frwr_sendcompletion_flush(struct ib_wc *wc, struct rpcrdma_mw *r) { - struct rpcrdma_mw *r; - if (likely(wc->status == IB_WC_SUCCESS)) return; @@ -260,9 +262,23 @@ frwr_sendcompletion(struct ib_wc *wc) else pr_warn("RPC: %s: frmr %p error, status %s (%d)\n", __func__, r, ib_wc_status_msg(wc->status), wc->status); + r->r.frmr.fr_state = FRMR_IS_STALE; } +static void +frwr_sendcompletion(struct ib_wc *wc) +{ + struct rpcrdma_mw *r = (struct rpcrdma_mw *)(unsigned long)wc->wr_id; + struct rpcrdma_frmr *f = &r->r.frmr; + + if (unlikely(wc->status != IB_WC_SUCCESS)) + __frwr_sendcompletion_flush(wc, r); + + if (f->fr_waiter) + complete(&f->fr_linv_done); +} + static int frwr_op_init(struct rpcrdma_xprt *r_xprt) { @@ -334,6 +350,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg, } while (mw->r.frmr.fr_state != FRMR_IS_INVALID); frmr = &mw->r.frmr; frmr->fr_state = FRMR_IS_VALID; + frmr->fr_waiter = false; mr = frmr->fr_mr; reg_wr = &frmr->fr_regwr; @@ -413,6 +430,116 @@ out_senderr: return rc; } +static struct ib_send_wr * +__frwr_prepare_linv_wr(struct rpcrdma_mr_seg *seg) +{ + struct rpcrdma_mw *mw = seg->rl_mw; + struct rpcrdma_frmr *f = &mw->r.frmr; + struct ib_send_wr *invalidate_wr; + + f->fr_waiter = false; + f->fr_state = FRMR_IS_INVALID; + invalidate_wr = &f->fr_invwr; + + memset(invalidate_wr, 0, sizeof(*invalidate_wr)); + invalidate_wr->wr_id = (unsigned long)(void *)mw; + invalidate_wr->opcode = IB_WR_LOCAL_INV; + invalidate_wr->ex.invalidate_rkey = f->fr_mr->rkey; + + return invalidate_wr; +} + +static void +__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg, +int rc) +{ + struct ib_device *device = r_xprt->rx_ia.ri_device; + struct rpcrdma_mw *mw = seg->rl_mw; + struct rpcrdma_frmr *f = &mw->r.frmr; + + seg->rl_mw = NULL; + + ib_dma_unmap_sg(device, f->sg, f->sg_nents, seg->mr_dir); + + if (!rc) + rpcrdma_put_mw(r_xprt, mw); + else + __frwr_queue_recovery(mw); +} + +/* Invalidate all memory regions that were registered for "req". + * + * Sleeps until it is safe for the host CPU to access the + * previously mapped memory regions. + */ +static void +frwr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req) +{ + struct ib_send_wr *invalidate_wrs, *pos, *prev, *bad_wr; + struct rpcrdma_ia *ia = &r_xprt->rx_ia; + struct rpcrdma_mr_seg *seg; + unsigned int i, nchunks; + struct rpcrdma_frmr *f; + int rc; + + dprintk("RPC: %s: req %p\n", __func__, req); + + /* ORDER: Invalidate all of the req's MRs first +* +* Chain the LOCAL_INV Work Requests and post them with +* a single ib_post_send() call. +*/ + invalidate_wrs = pos = prev = NULL; + seg = NULL; + for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) { + seg = &req->rl_segments[i]; + + pos = __frwr_prepare_linv_wr(seg); + + if (!invalidate_wrs) + invalidate_wrs = pos; + else + prev->next = pos; + prev = pos; + + i += seg->mr_nsegs; + } + f = &seg->rl_mw->r.frmr; + + /* Strong send queue ordering guarantees that when the +* last WR in the chain completes, all WRs in the chain +* are complete. +*/ + f->fr_invwr.send_flags = IB_SEND_SIGNALED; + f->fr_waiter = true; + init_completion(&f->fr_linv_done); + INIT_CQCOUNT(&r_xprt->rx_ep); + + /* Transport disconnect drains the receive CQ before it +* replaces the Q
[PATCH v4 07/10] xprtrdma: Add ro_unmap_sync method for FMR
FMR's ro_unmap method is already synchronous because ib_unmap_fmr() is a synchronous verb. However, some improvements can be made here. 1. Gather all the MRs for the RPC request onto a list, and invoke ib_unmap_fmr() once with that list. This reduces the number of doorbells when there is more than one MR to invalidate 2. Perform the DMA unmap _after_ the MRs are unmapped, not before. This is critical after invalidating a Write chunk. Signed-off-by: Chuck Lever Tested-by: Devesh Sharma --- net/sunrpc/xprtrdma/fmr_ops.c | 64 + 1 file changed, 64 insertions(+) diff --git a/net/sunrpc/xprtrdma/fmr_ops.c b/net/sunrpc/xprtrdma/fmr_ops.c index f1e8daf..c14f3a4 100644 --- a/net/sunrpc/xprtrdma/fmr_ops.c +++ b/net/sunrpc/xprtrdma/fmr_ops.c @@ -179,6 +179,69 @@ out_maperr: return rc; } +static void +__fmr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg) +{ + struct ib_device *device = r_xprt->rx_ia.ri_device; + struct rpcrdma_mw *mw = seg->rl_mw; + int nsegs = seg->mr_nsegs; + + seg->rl_mw = NULL; + + while (nsegs--) + rpcrdma_unmap_one(device, seg++); + + rpcrdma_put_mw(r_xprt, mw); +} + +/* Invalidate all memory regions that were registered for "req". + * + * Sleeps until it is safe for the host CPU to access the + * previously mapped memory regions. + */ +static void +fmr_op_unmap_sync(struct rpcrdma_xprt *r_xprt, struct rpcrdma_req *req) +{ + struct rpcrdma_mr_seg *seg; + unsigned int i, nchunks; + struct rpcrdma_mw *mw; + LIST_HEAD(unmap_list); + int rc; + + dprintk("RPC: %s: req %p\n", __func__, req); + + /* ORDER: Invalidate all of the req's MRs first +* +* ib_unmap_fmr() is slow, so use a single call instead +* of one call per mapped MR. +*/ + for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) { + seg = &req->rl_segments[i]; + mw = seg->rl_mw; + + list_add(&mw->r.fmr.fmr->list, &unmap_list); + + i += seg->mr_nsegs; + } + rc = ib_unmap_fmr(&unmap_list); + if (rc) + pr_warn("%s: ib_unmap_fmr failed (%i)\n", __func__, rc); + + /* ORDER: Now DMA unmap all of the req's MRs, and return +* them to the free MW list. +*/ + for (i = 0, nchunks = req->rl_nchunks; nchunks; nchunks--) { + seg = &req->rl_segments[i]; + + __fmr_dma_unmap(r_xprt, seg); + + i += seg->mr_nsegs; + seg->mr_nsegs = 0; + } + + req->rl_nchunks = 0; +} + /* Use the ib_unmap_fmr() verb to prevent further remote * access via RDMA READ or RDMA WRITE. */ @@ -231,6 +294,7 @@ fmr_op_destroy(struct rpcrdma_buffer *buf) const struct rpcrdma_memreg_ops rpcrdma_fmr_memreg_ops = { .ro_map = fmr_op_map, + .ro_unmap_sync = fmr_op_unmap_sync, .ro_unmap = fmr_op_unmap, .ro_open= fmr_op_open, .ro_maxpages= fmr_op_maxpages, -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 04/10] xprtrdma: Move struct ib_send_wr off the stack
For FRWR FASTREG and LOCAL_INV, move the ib_*_wr structure off the stack. This allows frwr_op_map and frwr_op_unmap to chain WRs together without limit to register or invalidate a set of MRs with a single ib_post_send(). (This will be for chaining LOCAL_INV requests). Signed-off-by: Chuck Lever Tested-by: Devesh Sharma --- net/sunrpc/xprtrdma/frwr_ops.c | 38 -- net/sunrpc/xprtrdma/xprt_rdma.h |4 2 files changed, 24 insertions(+), 18 deletions(-) diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c index ae2a241..660d0b6 100644 --- a/net/sunrpc/xprtrdma/frwr_ops.c +++ b/net/sunrpc/xprtrdma/frwr_ops.c @@ -318,7 +318,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg, struct rpcrdma_mw *mw; struct rpcrdma_frmr *frmr; struct ib_mr *mr; - struct ib_reg_wr reg_wr; + struct ib_reg_wr *reg_wr; struct ib_send_wr *bad_wr; int rc, i, n, dma_nents; u8 key; @@ -335,6 +335,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg, frmr = &mw->r.frmr; frmr->fr_state = FRMR_IS_VALID; mr = frmr->fr_mr; + reg_wr = &frmr->fr_regwr; if (nsegs > ia->ri_max_frmr_depth) nsegs = ia->ri_max_frmr_depth; @@ -380,19 +381,19 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg, key = (u8)(mr->rkey & 0x00FF); ib_update_fast_reg_key(mr, ++key); - reg_wr.wr.next = NULL; - reg_wr.wr.opcode = IB_WR_REG_MR; - reg_wr.wr.wr_id = (uintptr_t)mw; - reg_wr.wr.num_sge = 0; - reg_wr.wr.send_flags = 0; - reg_wr.mr = mr; - reg_wr.key = mr->rkey; - reg_wr.access = writing ? - IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE : - IB_ACCESS_REMOTE_READ; + reg_wr->wr.next = NULL; + reg_wr->wr.opcode = IB_WR_REG_MR; + reg_wr->wr.wr_id = (uintptr_t)mw; + reg_wr->wr.num_sge = 0; + reg_wr->wr.send_flags = 0; + reg_wr->mr = mr; + reg_wr->key = mr->rkey; + reg_wr->access = writing ? +IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE : +IB_ACCESS_REMOTE_READ; DECR_CQCOUNT(&r_xprt->rx_ep); - rc = ib_post_send(ia->ri_id->qp, ®_wr.wr, &bad_wr); + rc = ib_post_send(ia->ri_id->qp, ®_wr->wr, &bad_wr); if (rc) goto out_senderr; @@ -422,23 +423,24 @@ frwr_op_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg) struct rpcrdma_ia *ia = &r_xprt->rx_ia; struct rpcrdma_mw *mw = seg1->rl_mw; struct rpcrdma_frmr *frmr = &mw->r.frmr; - struct ib_send_wr invalidate_wr, *bad_wr; + struct ib_send_wr *invalidate_wr, *bad_wr; int rc, nsegs = seg->mr_nsegs; dprintk("RPC: %s: FRMR %p\n", __func__, mw); seg1->rl_mw = NULL; frmr->fr_state = FRMR_IS_INVALID; + invalidate_wr = &mw->r.frmr.fr_invwr; - memset(&invalidate_wr, 0, sizeof(invalidate_wr)); - invalidate_wr.wr_id = (unsigned long)(void *)mw; - invalidate_wr.opcode = IB_WR_LOCAL_INV; - invalidate_wr.ex.invalidate_rkey = frmr->fr_mr->rkey; + memset(invalidate_wr, 0, sizeof(*invalidate_wr)); + invalidate_wr->wr_id = (uintptr_t)mw; + invalidate_wr->opcode = IB_WR_LOCAL_INV; + invalidate_wr->ex.invalidate_rkey = frmr->fr_mr->rkey; DECR_CQCOUNT(&r_xprt->rx_ep); ib_dma_unmap_sg(ia->ri_device, frmr->sg, frmr->sg_nents, seg1->mr_dir); read_lock(&ia->ri_qplock); - rc = ib_post_send(ia->ri_id->qp, &invalidate_wr, &bad_wr); + rc = ib_post_send(ia->ri_id->qp, invalidate_wr, &bad_wr); read_unlock(&ia->ri_qplock); if (rc) goto out_err; diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index 4197191..b1065ca 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -206,6 +206,10 @@ struct rpcrdma_frmr { enum rpcrdma_frmr_state fr_state; struct work_struct fr_work; struct rpcrdma_xprt *fr_xprt; + union { + struct ib_reg_wrfr_regwr; + struct ib_send_wr fr_invwr; + }; }; struct rpcrdma_fmr { -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 02/10] xprtrdma: xprt_rdma_free() must not release backchannel reqs
Preserve any rpcrdma_req that is attached to rpc_rqst's allocated for the backchannel. Otherwise, after all the pre-allocated backchannel req's are consumed, incoming backward calls start writing on freed memory. Somehow this hunk got lost. Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst') Signed-off-by: Chuck Lever Tested-by: Devesh Sharma --- net/sunrpc/xprtrdma/transport.c |3 +++ 1 file changed, 3 insertions(+) diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c index 8c545f7..740bddc 100644 --- a/net/sunrpc/xprtrdma/transport.c +++ b/net/sunrpc/xprtrdma/transport.c @@ -576,6 +576,9 @@ xprt_rdma_free(void *buffer) rb = container_of(buffer, struct rpcrdma_regbuf, rg_base[0]); req = rb->rg_owner; + if (req->rl_backchannel) + return; + r_xprt = container_of(req->rl_buffer, struct rpcrdma_xprt, rx_buf); dprintk("RPC: %s: called on 0x%p\n", __func__, req->rl_reply); -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 03/10] xprtrdma: Disable RPC/RDMA backchannel debugging messages
Clean up. Fixes: 63cae47005af ('xprtrdma: Handle incoming backward direction') Signed-off-by: Chuck Lever Tested-by: Devesh Sharma --- net/sunrpc/xprtrdma/backchannel.c | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c index 11d2cfb..cd31181 100644 --- a/net/sunrpc/xprtrdma/backchannel.c +++ b/net/sunrpc/xprtrdma/backchannel.c @@ -15,7 +15,7 @@ # define RPCDBG_FACILITY RPCDBG_TRANS #endif -#define RPCRDMA_BACKCHANNEL_DEBUG +#undef RPCRDMA_BACKCHANNEL_DEBUG static void rpcrdma_bc_free_rqst(struct rpcrdma_xprt *r_xprt, struct rpc_rqst *rqst) @@ -136,6 +136,7 @@ int xprt_rdma_bc_setup(struct rpc_xprt *xprt, unsigned int reqs) __func__); goto out_free; } + dprintk("RPC: %s: new rqst %p\n", __func__, rqst); rqst->rq_xprt = &r_xprt->rx_xprt; INIT_LIST_HEAD(&rqst->rq_list); @@ -216,12 +217,14 @@ int rpcrdma_bc_marshal_reply(struct rpc_rqst *rqst) rpclen = rqst->rq_svec[0].iov_len; +#ifdef RPCRDMA_BACKCHANNEL_DEBUG pr_info("RPC: %s: rpclen %zd headerp 0x%p lkey 0x%x\n", __func__, rpclen, headerp, rdmab_lkey(req->rl_rdmabuf)); pr_info("RPC: %s: RPC/RDMA: %*ph\n", __func__, (int)RPCRDMA_HDRLEN_MIN, headerp); pr_info("RPC: %s: RPC: %*ph\n", __func__, (int)rpclen, rqst->rq_svec[0].iov_base); +#endif req->rl_send_iov[0].addr = rdmab_addr(req->rl_rdmabuf); req->rl_send_iov[0].length = RPCRDMA_HDRLEN_MIN; @@ -265,6 +268,9 @@ void xprt_rdma_bc_free_rqst(struct rpc_rqst *rqst) { struct rpc_xprt *xprt = rqst->rq_xprt; + dprintk("RPC: %s: freeing rqst %p (req %p)\n", + __func__, rqst, rpcr_to_rdmar(rqst)); + smp_mb__before_atomic(); WARN_ON_ONCE(!test_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state)); clear_bit(RPC_BC_PA_IN_USE, &rqst->rq_bc_pa_state); @@ -329,9 +335,7 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt, struct rpc_rqst, rq_bc_pa_list); list_del(&rqst->rq_bc_pa_list); spin_unlock(&xprt->bc_pa_lock); -#ifdef RPCRDMA_BACKCHANNEL_DEBUG - pr_info("RPC: %s: using rqst %p\n", __func__, rqst); -#endif + dprintk("RPC: %s: using rqst %p\n", __func__, rqst); /* Prepare rqst */ rqst->rq_reply_bytes_recvd = 0; @@ -351,10 +355,8 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt, * direction reply. */ req = rpcr_to_rdmar(rqst); -#ifdef RPCRDMA_BACKCHANNEL_DEBUG - pr_info("RPC: %s: attaching rep %p to req %p\n", + dprintk("RPC: %s: attaching rep %p to req %p\n", __func__, rep, req); -#endif req->rl_reply = rep; /* Defeat the retransmit detection logic in send_request */ -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 05/10] xprtrdma: Introduce ro_unmap_sync method
In the current xprtrdma implementation, some memreg strategies implement ro_unmap synchronously (the MR is knocked down before the method returns) and some asynchonously (the MR will be knocked down and returned to the pool in the background). To guarantee the MR is truly invalid before the RPC consumer is allowed to resume execution, we need an unmap method that is always synchronous, invoked from the RPC/RDMA reply handler. The new method unmaps all MRs for an RPC. The existing ro_unmap method unmaps only one MR at a time. Signed-off-by: Chuck Lever Tested-by: Devesh Sharma --- net/sunrpc/xprtrdma/xprt_rdma.h |2 ++ 1 file changed, 2 insertions(+) diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index b1065ca..512184d 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -367,6 +367,8 @@ struct rpcrdma_xprt; struct rpcrdma_memreg_ops { int (*ro_map)(struct rpcrdma_xprt *, struct rpcrdma_mr_seg *, int, bool); + void(*ro_unmap_sync)(struct rpcrdma_xprt *, +struct rpcrdma_req *); int (*ro_unmap)(struct rpcrdma_xprt *, struct rpcrdma_mr_seg *); int (*ro_open)(struct rpcrdma_ia *, -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 01/10] xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)
Clean up. rb_lock critical sections added in rpcrdma_ep_post_extra_recv() should have first been converted to use normal spin_lock now that the reply handler is a work queue. The backchannel set up code should use the appropriate helper instead of open-coding a rb_recv_bufs list add. Problem introduced by glib patch re-ordering on my part. Fixes: f531a5dbc451 ('xprtrdma: Pre-allocate backward rpc_rqst') Signed-off-by: Chuck Lever Tested-by: Devesh Sharma --- net/sunrpc/xprtrdma/backchannel.c |6 +- net/sunrpc/xprtrdma/verbs.c |7 +++ 2 files changed, 4 insertions(+), 9 deletions(-) diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c index 2dcb44f..11d2cfb 100644 --- a/net/sunrpc/xprtrdma/backchannel.c +++ b/net/sunrpc/xprtrdma/backchannel.c @@ -84,9 +84,7 @@ out_fail: static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt, unsigned int count) { - struct rpcrdma_buffer *buffers = &r_xprt->rx_buf; struct rpcrdma_rep *rep; - unsigned long flags; int rc = 0; while (count--) { @@ -98,9 +96,7 @@ static int rpcrdma_bc_setup_reps(struct rpcrdma_xprt *r_xprt, break; } - spin_lock_irqsave(&buffers->rb_lock, flags); - list_add(&rep->rr_list, &buffers->rb_recv_bufs); - spin_unlock_irqrestore(&buffers->rb_lock, flags); + rpcrdma_recv_buffer_put(rep); } return rc; diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index 650034b..f23f3d6 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -1329,15 +1329,14 @@ rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, unsigned int count) struct rpcrdma_ia *ia = &r_xprt->rx_ia; struct rpcrdma_ep *ep = &r_xprt->rx_ep; struct rpcrdma_rep *rep; - unsigned long flags; int rc; while (count--) { - spin_lock_irqsave(&buffers->rb_lock, flags); + spin_lock(&buffers->rb_lock); if (list_empty(&buffers->rb_recv_bufs)) goto out_reqbuf; rep = rpcrdma_buffer_get_rep_locked(buffers); - spin_unlock_irqrestore(&buffers->rb_lock, flags); + spin_unlock(&buffers->rb_lock); rc = rpcrdma_ep_post_recv(ia, ep, rep); if (rc) @@ -1347,7 +1346,7 @@ rpcrdma_ep_post_extra_recv(struct rpcrdma_xprt *r_xprt, unsigned int count) return 0; out_reqbuf: - spin_unlock_irqrestore(&buffers->rb_lock, flags); + spin_unlock(&buffers->rb_lock); pr_warn("%s: no extra receive buffers\n", __func__); return -ENOMEM; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 00/10] NFS/RDMA client patches for 4.5
For 4.5, I'd like to address the send queue accounting and invalidation/unmap ordering issues Jason brought up a couple of months ago. Also available in the "nfs-rdma-for-4.5" topic branch of this git repo: git://git.linux-nfs.org/projects/cel/cel-2.6.git Or for browsing: http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5 Changes since v3: - Dropped xprt_commit_rqst() - __frmr_dma_unmap now uses ib_dma_unmap_sg() - Use transparent union in struct rpcrdma_frmr Changes since v2: - Rebased on Christoph's ib_device_attr branch Changes since v1: - Rebased on v4.4-rc3 - Receive buffer safety margin patch dropped - Backchannel pr_err and pr_info converted to dprintk - Backchannel spin locks converted to work queue-safe locks - Fixed premature release of backchannel request buffer - NFSv4.1 callbacks tested with for-4.5 server --- Chuck Lever (10): xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock) xprtrdma: xprt_rdma_free() must not release backchannel reqs xprtrdma: Disable RPC/RDMA backchannel debugging messages xprtrdma: Move struct ib_send_wr off the stack xprtrdma: Introduce ro_unmap_sync method xprtrdma: Add ro_unmap_sync method for FRWR xprtrdma: Add ro_unmap_sync method for FMR xprtrdma: Add ro_unmap_sync method for all-physical registration xprtrdma: Invalidate in the RPC reply handler xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit'). net/sunrpc/xprtrdma/backchannel.c | 22 ++--- net/sunrpc/xprtrdma/fmr_ops.c | 64 + net/sunrpc/xprtrdma/frwr_ops.c | 174 +++- net/sunrpc/xprtrdma/physical_ops.c | 13 +++ net/sunrpc/xprtrdma/rpc_rdma.c | 16 +++ net/sunrpc/xprtrdma/transport.c|3 + net/sunrpc/xprtrdma/verbs.c| 13 +-- net/sunrpc/xprtrdma/xprt_rdma.h| 14 ++- 8 files changed, 271 insertions(+), 48 deletions(-) -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/15] add Intel(R) X722 iWARP driver
On Wed, 2015-12-16 at 13:58 -0600, Faisal Latif wrote: > This series contains the addition of the i40iw.ko driver. This series should probably be respun against -next instead of linus' tree. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 00/11] Add RoCE v2 support
On Wed, Dec 16, 2015 at 03:39:16PM -0500, Doug Ledford wrote: > These patches add the concept of duplicate GIDs that are differentiated > by their RoCE version (also called network type). and by vlan, and smac, and ... Basically everything network unique about a namespace has to be encapsulted in the gid index now. Each namespace thus has a subset of gid indexes that are valid for it to use for outbound and to recieve packets on. roce didn't really have a way to work with net namespaces, AFAIK (?) so it gets a pass. But rocev2 very clearly does. It needs needs to address the issue outlined in commit b8cab5dab15ff5c2acc3faefdde28919b0341c11 (IB/cma: Accept connection without a valid netdev on RoCE) That means cma.c needs to get the gid index every single CMA packet it processes and confirm that the associated net device is permitted to talk to the matching CM ID. It is no mistake there is a hole in cma.c waiting for this code, when Haggai did that work it was very clear in my mind that rocev2 would need to slot into here as well. > Jason's objections are this: > > 1) The lazy resolution is wrong. Wrong in the sense it doesn't actually exist in a usable form anyplace. cma.c does not do it, and absoultely must as discussed above. init_ah_from_wc needs to do it, and maybe does. It is hard to tell, perhaps a 'rdma_wc_to_dgid_index()' is actually open coded in there now. Just from a code readability perspective that is ugly. Then we get into the missing route handling in all places that construct a rocev2 AH... > Jason's preference would be that the above issues be resolved by > skipping the lazy resolution and instead doing proactive resolution > on I am happy with lazy resolution, that is a fine compromise. I just want to see kapi that makes sense here. It is very clear to me no kernel user can possibly correctly touch a rocev2 UD packet without retrieving the gid index, so we must have a kAPI for this. > namespace. Or, at a minimum, at least make the information added to the > core API not something vendor specific like network_type, which is a > detail of the Mellanox implementation. I keep suggesting a rdma_wc_to_dgid_index() API call. Perhasp most of he code for this already seems to exist in init_ah_from_wc. > 1 - Actually, for any received packet with associated IP address > information. We've only enabled net namespaces for IP connections > between user space applications, for direct IB connections or for kernel > connections there is not yet any namespace support. IHMO, this is actually a problem for rocev2. IB needs more work to create a rdma namespace, but rocve2 does not. The kernel software side should certainly be completed as a quick follow on to this series, that means the use of gid_indexes at all uAPI access points needs to be checked for rocev2. HW support is needed to complete rocve2 containment, as the hw must check the gid_index on all directly posted WCs and *ALL* rx'd packets for a QP to ensure it is allowed. Some kind of warn on until that support is available would also be great. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: warning in ext4 with nfs/rdma server
On Tue, Dec 08, 2015 at 07:31:56AM -0600, Steve Wise wrote: > > > > -Original Message- > > From: Chuck Lever [mailto:chuck.le...@oracle.com] > > Sent: Monday, December 07, 2015 9:45 AM > > To: Steve Wise > > Cc: linux-rdma@vger.kernel.org; Veeresh U. Kokatnur; Linux NFS Mailing List > > Subject: Re: warning in ext4 with nfs/rdma server > > > > Hi Steve- > > > > > On Dec 7, 2015, at 10:38 AM, Steve Wise > > > wrote: > > > > > > Hey Chuck/NFS developers, > > > > > > We're hitting this warning in ext4 on the linux-4.3 nfs server running > > > over RDMA/cxgb4. We're still gathering data, like if it > > > happens with NFS/TCP. But has anyone seen this warning on 4.3? Is it > > > likely to indicate some bug in the xprtrdma transport or > > > above it in NFS? > > > > Yes, please confirm with NFS/TCP. Thanks! > > > > The same thing happens with NFS/TCP, so this isn't related to xprtrdma. > > > > > > We can hit this running cthon tests over 2 mount points: > > > > > > - > > > #!/bin/bash > > > rm -rf /root/cthon04/loop_iter.txt > > > while [ 1 ] > > > do > > > { > > > > > > ./server -s -m /mnt/share1 -o rdma,port=20049,vers=4 -p /mnt/share1 -N 100 > > > 102.1.1.162 & > > > ./server -s -m /mnt/share2 -o > > > rdma,port=20049,vers=3,rsize=65535,wsize=65535 -p > > > /mnt/share2 -N 100 102.2.2.162 & > > > wait > > > echo "iteration $i" >>/root/cthon04/loop_iter.txt > > > date >>/root/cthon04/loop_iter.txt > > > } > > > done > > > -- > > > > > > Thanks, > > > > > > Steve. > > > > > > [ cut here ] > > > WARNING: CPU: 14 PID: 6689 at fs/ext4/inode.c:231 > > > ext4_evict_inode+0x41e/0x490 Looks like this is the WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count)); in ext4_evice_inode? Ext4 developers, any idea how that could happen? --b. > > > [ext4]() > > > Modules linked in: nfsd(E) lockd(E) grace(E) nfs_acl(E) exportfs(E) > > > auth_rpcgss(E) rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_uverbs(E) rdma_cm(E) > > > ib_cm(E) ib_sa(E) ib_mad(E) iw_cxgb4(E) iw_cm(E) ib_core(E) ib_addr(E) > > > cxgb4(E) > > > autofs4(E) target_core_iblock(E) target_core_file(E) target_core_pscsi(E) > > > target_core_mod(E) configfs(E) bnx2fc(E) cnic(E) uio(E) fcoe(E) libfcoe(E) > > > 8021q(E) libfc(E) garp(E) stp(E) llc(E) cpufreq_ondemand(E) cachefiles(E) > > > fscache(E) ipv6(E) dm_mirror(E) dm_region_hash(E) dm_log(E) vhost_net(E) > > > macvtap(E) macvlan(E) vhost(E) tun(E) kvm(E) uinput(E) microcode(E) sg(E) > > > pcspkr(E) serio_raw(E) fam15h_power(E) k10temp(E) amd64_edac_mod(E) > > > edac_core(E) edac_mce_amd(E) i2c_piix4(E) igb(E) dca(E) i2c_algo_bit(E) > > > i2c_core(E) ptp(E) pps_core(E) scsi_transport_fc(E) acpi_cpufreq(E) > > > dm_mod(E) > > > ext4(E) jbd2(E) mbcache(E) sr_mod(E) cdrom(E) sd_mod(E) ahci(E) libahci(E) > > > [last unloaded: cxgb4] > > > CPU: 14 PID: 6689 Comm: nfsd Tainted: GE 4.3.0 #1 > > > Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.512/19/2013 > > > 00e7 88400634fad8 812a4084 a00c96eb > > > 88400634fb18 81059fd5 88400634fbd8 > > > 880fd1a460c8 880fd1a461d8 880fd1a46008 88400634fbd8 > > > Call Trace: > > > [] dump_stack+0x48/0x64 > > > [] warn_slowpath_common+0x95/0xe0 > > > [] warn_slowpath_null+0x1a/0x20 > > > [] ext4_evict_inode+0x41e/0x490 [ext4] > > > [] evict+0xae/0x1a0 > > > [] iput_final+0xe5/0x170 > > > [] iput+0xa3/0xf0 > > > [] ? fsnotify_destroy_marks+0x64/0x80 > > > [] dentry_unlink_inode+0xa9/0xe0 > > > [] d_delete+0xa6/0xb0 > > > [] vfs_unlink+0x138/0x140 > > > [] nfsd_unlink+0x165/0x200 [nfsd] > > > [] ? lru_put_end+0x5c/0x70 [nfsd] > > > [] nfsd3_proc_remove+0x83/0x120 [nfsd] > > > [] nfsd_dispatch+0xdc/0x210 [nfsd] > > > [] svc_process_common+0x311/0x620 [sunrpc] > > > [] ? nfsd_set_nrthreads+0x1b0/0x1b0 [nfsd] > > > [] svc_process+0x128/0x1b0 [sunrpc] > > > [] nfsd+0xf3/0x160 [nfsd] > > > [] kthread+0xcc/0xf0 > > > [] ? schedule_tail+0x1e/0xc0 > > > [] ? kthread_freezable_should_stop+0x70/0x70 > > > [] ret_from_fork+0x3f/0x70 > > > [] ? kthread_freezable_should_stop+0x70/0x70 > > > ---[ end trace 39afe9aeef2cfb34 ]--- > > > [ cut here ] > > > > -- > > Chuck Lever > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 15/15] i40iw: changes for build of i40iw module
> --- a/include/uapi/rdma/rdma_netlink.h > +++ b/include/uapi/rdma/rdma_netlink.h > @@ -5,6 +5,7 @@ > > enum { > RDMA_NL_RDMA_CM = 1, > + RDMA_NL_I40IW, > RDMA_NL_NES, > RDMA_NL_C4IW, > RDMA_NL_LS, /* RDMA Local Services */ This changes the values for the existing RDMA_NL_NES, RDMA_NL_C4IW and RDMA_NL_LS symbols. Please add your new value at the end. And it should probably be a separate patch as it's not related to the build system and referenced by the earlier patches. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/15] i40e: Add support for client interface for IWARP driver
On Wed, 2015-12-16 at 13:58 -0600, Faisal Latif wrote: > From: Anjali Singhai Jain > > This patch adds a Client interface for i40iw driver > support. Also expands the Virtchannel to support messages > from i40evf driver on behalf of i40iwvf driver. [] > diff --git a/drivers/net/ethernet/intel/i40e/i40e_client.c > b/drivers/net/ethernet/intel/i40e/i40e_client.c [] > + * Contact Information: > + * e1000-devel Mailing List trivia: This should probably be: intel-wired-...@lists.osuosl.org -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 00/11] Add RoCE v2 support
On 12/16/2015 01:56 AM, Moni Shoua wrote: >> The part that bothers me about this is that this statement makes sense >> when just thinking about the spec, as you say. However, once you >> consider namespaces, security implications make this statement spec >> compliant, but still unacceptable. The spec itself is silent on >> namespaces. But, you guys wanted, and you got, namespace support. >> Since that's beyond spec, and carries security requirements, I think >> it's fair to say that from now on, the Linux kernel RDMA stack can no >> longer *just* be spec compliant. There are additional concerns that >> must always be addressed with new changes, and those are the namespace >> constraint preservation concerns. > > I can't object to that but I really would like to get an example of a > security risk. *This* is exactly the conversation to be having right now. The namespace support has been added to the core, and so now we need to define exactly what the impact of that is for new feature submissions like this one. More on that below... > So far, besides hearing that the way we choose to handle completions > is wrong, I didn't get a convincing example of how where it doesn't > work. Work is too fuzzy of a word to use here. It could mean "applications keep running", but that could be contrary to the namespace restrictions as it may be that the application should *not* have continued to run when namespace considerations were taken into account. > Moreover, regarding security, all we wanted is for HW to report > the L3 protocol (IB, IPv4, or IPv6) in the packet. This is data that > with some extra CPU cycles can be obtained from the 40 bytes that are > scattered to the receive bufs anyway. So, if there is a security hole > it exists from day one of the IB stack and this is not the time we > should insist on fixing it. No, not true. You are implementing RoCEv2 support, which is an entirely new feature. So this feature can't have had a security hole since forever as it has never been in the kernel before now. The objections are arising because of the ordering of events. Specifically, we added the core namespace support (even though it isn't complete, so far it's the infrastructure ready for various upper portions of the stack to start using, but it isn't a complete stack wide solution yet) first, and so this new feature, which will need to be a part of that namespace infrastructure that other parts of the IB stack can use, should have its namespace support already enabled (ideally, but if it didn't, it should at least have a clear plan for how to enable it in the future). Jason's objection is based upon this premise and the fact that a technical review of the code makes it look like the core namespace infrastructure becomes less complete, not more, with the inclusion of these patches. As I understand it, prior to these patches there would always be a 1:1 mapping of GID to gid_index because you would never have duplicate GIDs in the GID table. That allowed an easy, definitive 1:1 mapping of GID to namespace via the existing infrastructure for any received packet [1]. These patches add the concept of duplicate GIDs that are differentiated by their RoCE version (also called network type). So, now, an incoming packet could match a couple different gid_indexes and we need additional information to get back to the definitive 1:1 mapping. The submitted patches are designed around a lazy resolution of the namespace, preferring to defer the work of mapping the incoming packet to a unique namespace until that information is actually needed. To enable this lazy resolution, it provides the network_type so that the resolution can be done. This is a fair assessment of the current state of things and what these patches do, yes? Jason's objections are this: 1) The lazy resolution is wrong. 2) The use of network_type as the additional information to get to the unique namespace is vendor specific cruft that shouldn't be part of the core kernel API. Jason's preference would be that the above issues be resolved by skipping the lazy resolution and instead doing proactive resolution on receipt of a packet and then probably just pass the namespace around instead of passing around the information needed to resolve the namespace. Or, at a minimum, at least make the information added to the core API not something vendor specific like network_type, which is a detail of the Mellanox implementation. Jason, is this accurate for your position? If everyone agrees that this is a fair statement of where we stand, then I'll continue my response. If not, please correct anything I have wrong above and I'll take that into my continued response. 1 - Actually, for any received packet with associated IP address information. We've only enabled net namespaces for IP connections between user space applications, for direct IB connections or for kernel connections there is not yet any namespace support. -- Doug Ledford
Re: [PATCH 15/15] i40iw: changes for build of i40iw module
Hi Faisal, [auto build test WARNING on net/master] [also build test WARNING on v4.4-rc5 next-20151216] [cannot apply to net-next/master] url: https://github.com/0day-ci/linux/commits/Faisal-Latif/add-Intel-R-X722-iWARP-driver/20151217-040340 config: arm-allyesconfig (attached as .config) reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=arm All warnings (new ones prefixed by >>): In file included from include/linux/byteorder/big_endian.h:4:0, from arch/arm/include/uapi/asm/byteorder.h:19, from include/asm-generic/bitops/le.h:5, from arch/arm/include/asm/bitops.h:340, from include/linux/bitops.h:36, from include/linux/kernel.h:10, from include/linux/skbuff.h:17, from include/linux/ip.h:20, from drivers/infiniband/hw/i40iw/i40iw_cm.c:36: drivers/infiniband/hw/i40iw/i40iw_cm.c: In function 'i40iw_init_tcp_ctx': include/uapi/linux/byteorder/big_endian.h:32:26: warning: large integer implicitly truncated to unsigned type [-Woverflow] #define __cpu_to_le32(x) ((__force __le32)__swab32((x))) ^ include/linux/byteorder/generic.h:87:21: note: in expansion of macro '__cpu_to_le32' #define cpu_to_le32 __cpu_to_le32 ^ >> drivers/infiniband/hw/i40iw/i40iw_cm.c:3513:18: note: in expansion of macro >> 'cpu_to_le32' tcp_info->ttl = cpu_to_le32(I40IW_DEFAULT_TTL); ^ vim +/cpu_to_le32 +3513 drivers/infiniband/hw/i40iw/i40iw_cm.c 2d207efd Faisal Latif 2015-12-16 3497 * i40iw_init_tcp_ctx - setup qp context 2d207efd Faisal Latif 2015-12-16 3498 * @cm_node: connection's node 2d207efd Faisal Latif 2015-12-16 3499 * @tcp_info: offload info for tcp 2d207efd Faisal Latif 2015-12-16 3500 * @iwqp: associate qp for the connection 2d207efd Faisal Latif 2015-12-16 3501 */ 2d207efd Faisal Latif 2015-12-16 3502 static void i40iw_init_tcp_ctx(struct i40iw_cm_node *cm_node, 2d207efd Faisal Latif 2015-12-16 3503 struct i40iw_tcp_offload_info *tcp_info, 2d207efd Faisal Latif 2015-12-16 3504 struct i40iw_qp *iwqp) 2d207efd Faisal Latif 2015-12-16 3505 { 2d207efd Faisal Latif 2015-12-16 3506 tcp_info->ipv4 = cm_node->ipv4; 2d207efd Faisal Latif 2015-12-16 3507 tcp_info->drop_ooo_seg = true; 2d207efd Faisal Latif 2015-12-16 3508 tcp_info->wscale = true; 2d207efd Faisal Latif 2015-12-16 3509 tcp_info->ignore_tcp_opt = true; 2d207efd Faisal Latif 2015-12-16 3510 tcp_info->ignore_tcp_uns_opt = true; 2d207efd Faisal Latif 2015-12-16 3511 tcp_info->no_nagle = false; 2d207efd Faisal Latif 2015-12-16 3512 2d207efd Faisal Latif 2015-12-16 @3513 tcp_info->ttl = cpu_to_le32(I40IW_DEFAULT_TTL); 2d207efd Faisal Latif 2015-12-16 3514 tcp_info->rtt_var = cpu_to_le32(I40IW_DEFAULT_RTT_VAR); 2d207efd Faisal Latif 2015-12-16 3515 tcp_info->ss_thresh = cpu_to_le32(I40IW_DEFAULT_SS_THRESH); 2d207efd Faisal Latif 2015-12-16 3516 tcp_info->rexmit_thresh = I40IW_DEFAULT_REXMIT_THRESH; 2d207efd Faisal Latif 2015-12-16 3517 2d207efd Faisal Latif 2015-12-16 3518 tcp_info->tcp_state = I40IW_TCP_STATE_ESTABLISHED; 2d207efd Faisal Latif 2015-12-16 3519 tcp_info->snd_wscale = cm_node->tcp_cntxt.snd_wscale; 2d207efd Faisal Latif 2015-12-16 3520 tcp_info->rcv_wscale = cm_node->tcp_cntxt.rcv_wscale; 2d207efd Faisal Latif 2015-12-16 3521 :: The code at line 3513 was first introduced by commit :: 2d207efd7fd9e5a190b2ebd6f077139412b0343f i40iw: add connection management code :: TO: Faisal Latif :: CC: 0day robot --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH 02/15] i40iw: add main, hdr, status
On Wed, 2015-12-16 at 13:58 -0600, Faisal Latif wrote: > i40iw_main.c contains routines for i40e <=> i40iw interface and setup. > i40iw.h is header file for main device data structures. > i40iw_status.h is for return status codes. [] > diff --git a/drivers/infiniband/hw/i40iw/i40iw.h > b/drivers/infiniband/hw/i40iw/i40iw.h [] > +#define i40iw_pr_err(fmt, args ...) pr_err("%s: error " fmt, __func__, ## > args) > + > +#define i40iw_pr_info(fmt, args ...) pr_info("%s: " fmt, __func__, ## args) > + > +#define i40iw_pr_warn(fmt, args ...) pr_warn("%s: " fmt, __func__, ## args) Using "error " in the output doesn't really add much as there's already a KERN_ERR with the output. Using __func__ hardly adds anything. Using netdev_ is generally preferred > + > +struct i40iw_cqp_request { > + struct cqp_commands_info info; > + wait_queue_head_t waitq; > + struct list_head list; > + atomic_t refcount; > + void (*callback_fcn)(struct i40iw_cqp_request*, u32); > + void *param; > + struct i40iw_cqp_compl_info compl_info; > + u8 waiting:1; > + u8 request_done:1; > + u8 dynamic:1; > + u8 polling:1; These would bitfields might be better as bool -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/15] i40iw: add hardware related header files
header files for hardware accesses Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/i40iw_d.h| 1713 ++ drivers/infiniband/hw/i40iw/i40iw_p.h| 106 ++ drivers/infiniband/hw/i40iw/i40iw_type.h | 1308 +++ 3 files changed, 3127 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/i40iw_d.h create mode 100644 drivers/infiniband/hw/i40iw/i40iw_p.h create mode 100644 drivers/infiniband/hw/i40iw/i40iw_type.h diff --git a/drivers/infiniband/hw/i40iw/i40iw_d.h b/drivers/infiniband/hw/i40iw/i40iw_d.h new file mode 100644 index 000..f6668d7 --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_d.h @@ -0,0 +1,1713 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use in source and binary forms, with or +* without modification, are permitted provided that the following +* conditions are met: +* +*- Redistributions of source code must retain the above +* copyright notice, this list of conditions and the following +* disclaimer. +* +*- Redistributions in binary form must reproduce the above +* copyright notice, this list of conditions and the following +* disclaimer in the documentation and/or other materials +* provided with the distribution. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +* +***/ + +#ifndef I40IW_D_H +#define I40IW_D_H + +#define I40IW_DB_ADDR_OFFSET(4 * 1024 * 1024 - 64 * 1024) +#define I40IW_VF_DB_ADDR_OFFSET (64 * 1024) + +#define I40IW_PUSH_OFFSET (4 * 1024 * 1024) +#define I40IW_PF_FIRST_PUSH_PAGE_INDEX 16 +#define I40IW_VF_PUSH_OFFSET((8 + 64) * 1024) +#define I40IW_VF_FIRST_PUSH_PAGE_INDEX 2 + +#define I40IW_PE_DB_SIZE_4M 1 +#define I40IW_PE_DB_SIZE_8M 2 + +#define I40IW_DDP_VER 1 +#define I40IW_RDMAP_VER 1 + +#define I40IW_RDMA_MODE_RDMAC 0 +#define I40IW_RDMA_MODE_IETF 1 + +#define I40IW_QP_STATE_INVALID 0 +#define I40IW_QP_STATE_IDLE 1 +#define I40IW_QP_STATE_RTS 2 +#define I40IW_QP_STATE_CLOSING 3 +#define I40IW_QP_STATE_RESERVED 4 +#define I40IW_QP_STATE_TERMINATE 5 +#define I40IW_QP_STATE_ERROR 6 + +#define I40IW_STAG_STATE_INVALID 0 +#define I40IW_STAG_STATE_VALID 1 + +#define I40IW_STAG_TYPE_SHARED 0 +#define I40IW_STAG_TYPE_NONSHARED 1 + +#define I40IW_MAX_USER_PRIORITY 8 + +#define LS_64_1(val, bits) ((u64)(uintptr_t)val << bits) +#define RS_64_1(val, bits) ((u64)(uintptr_t)val >> bits) +#define LS_32_1(val, bits) (u32)(val << bits) +#define RS_32_1(val, bits) (u32)(val >> bits) +#define I40E_HI_DWORD(x)((u32)x) >> 16) >> 16) & 0x)) + +#define LS_64(val, field) (((u64)val << field ## _SHIFT) & (field ## _MASK)) + +#define RS_64(val, field) ((u64)(u64)(val & field ## _MASK) >> field ## _SHIFT) +#define LS_32(val, field) ((val << field ## _SHIFT) & (field ## _MASK)) +#define RS_32(val, field) ((val & field ## _MASK) >> field ## _SHIFT) + +#define TERM_DDP_LEN_TAGGED 14 +#define TERM_DDP_LEN_UNTAGGED 18 +#define TERM_RDMA_LEN 28 +#define RDMA_OPCODE_MASK0x0f +#define RDMA_READ_REQ_OPCODE1 +#define Q2_BAD_FRAME_OFFSET 72 +#define CQE_MAJOR_DRV 0x8000 + +#define I40IW_TERM_SENT 0x01 +#define I40IW_TERM_RCVD 0x02 +#define I40IW_TERM_DONE 0x04 +#define I40IW_MAC_HLEN 14 + +#define I40IW_INVALID_WQE_INDEX 0x + +#define I40IW_CQP_WAIT_POLL_REGS 1 +#define I40IW_CQP_WAIT_POLL_CQ 2 +#define I40IW_CQP_WAIT_EVENT 3 + +#define I40IW_CQP_INIT_WQE(wqe) memset(wqe, 0, 64) + +#define I40IW_GET_CURRENT_CQ_ELEMENT(_cq) \ + ( \ + &((_cq)->cq_base[I40IW_RING_GETCURRENT_HEAD((_cq)->cq_ring)]) \ + ) +#define I40IW_GET_CURRENT_EXTENDED_CQ_ELEMENT(_cq) \ + ( \ + &(((struct i40iw_extended_cqe *)\ + ((_cq)->cq_base))[I40IW_RING_GETCURRENT_HEAD((_cq)->cq_ring)]) \ + ) + +#define I40IW_GET_CURRENT_AEQ_ELEMENT(_aeq) \ + ( \ + &_aeq->aeqe_base[I40IW_RING_GETCURRENT_TAIL(_aeq->aeq_ring)] \ + ) + +#define I40IW_GET_CURRENT_CEQ_ELEMENT(_ceq) \ +
[PATCH 06/15] i40iw: add hmc resource files
i40iw_hmc.[ch] are to manage hmc for the device. Acked-by: Anjali Singhai Jain Acked-by: Shannon Nelson Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/i40iw_hmc.c | 823 drivers/infiniband/hw/i40iw/i40iw_hmc.h | 241 ++ 2 files changed, 1064 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/i40iw_hmc.c create mode 100644 drivers/infiniband/hw/i40iw/i40iw_hmc.h diff --git a/drivers/infiniband/hw/i40iw/i40iw_hmc.c b/drivers/infiniband/hw/i40iw/i40iw_hmc.c new file mode 100644 index 000..f4f4055 --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_hmc.c @@ -0,0 +1,823 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use in source and binary forms, with or +* without modification, are permitted provided that the following +* conditions are met: +* +*- Redistributions of source code must retain the above +* copyright notice, this list of conditions and the following +* disclaimer. +* +*- Redistributions in binary form must reproduce the above +* copyright notice, this list of conditions and the following +* disclaimer in the documentation and/or other materials +* provided with the distribution. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +* +***/ + +#include "i40iw_osdep.h" +#include "i40iw_register.h" +#include "i40iw_status.h" +#include "i40iw_hmc.h" +#include "i40iw_d.h" +#include "i40iw_type.h" +#include "i40iw_p.h" +#include "i40iw_vf.h" +#include "i40iw_virtchnl.h" + +/** + * i40iw_find_sd_index_limit - finds segment descriptor index limit + * @hmc_info: pointer to the HMC configuration information structure + * @type: type of HMC resources we're searching + * @index: starting index for the object + * @cnt: number of objects we're trying to create + * @sd_idx: pointer to return index of the segment descriptor in question + * @sd_limit: pointer to return the maximum number of segment descriptors + * + * This function calculates the segment descriptor index and index limit + * for the resource defined by i40iw_hmc_rsrc_type. + */ + +static inline void i40iw_find_sd_index_limit(struct i40iw_hmc_info *hmc_info, +u32 type, +u32 idx, +u32 cnt, +u32 *sd_idx, +u32 *sd_limit) +{ + u64 fpm_addr, fpm_limit; + + fpm_addr = hmc_info->hmc_obj[(type)].base + + hmc_info->hmc_obj[type].size * idx; + fpm_limit = fpm_addr + hmc_info->hmc_obj[type].size * cnt; + *sd_idx = (u32)(fpm_addr / I40IW_HMC_DIRECT_BP_SIZE); + *sd_limit = (u32)((fpm_limit - 1) / I40IW_HMC_DIRECT_BP_SIZE); + *sd_limit += 1; +} + +/** + * i40iw_find_pd_index_limit - finds page descriptor index limit + * @hmc_info: pointer to the HMC configuration information struct + * @type: HMC resource type we're examining + * @idx: starting index for the object + * @cnt: number of objects we're trying to create + * @pd_index: pointer to return page descriptor index + * @pd_limit: pointer to return page descriptor index limit + * + * Calculates the page descriptor index and index limit for the resource + * defined by i40iw_hmc_rsrc_type. + */ + +static inline void i40iw_find_pd_index_limit(struct i40iw_hmc_info *hmc_info, +u32 type, +u32 idx, +u32 cnt, +u32 *pd_idx, +u32 *pd_limit) +{ + u64 fpm_adr, fpm_limit; + + fpm_adr = hmc_info->hmc_obj[type].base + + hmc_info->hmc_obj[type].size * idx; + fpm_limit = fpm_adr + (hmc_info)->hmc_obj[(type)].size * (cnt); + *(pd_idx) = (u32)(fpm_adr / I40IW_HMC_PAGED_BP_SIZE); + *(pd_limit) = (u32)((fpm_limit - 1) / I40IW_HMC_PAGED_BP_SIZE);
[PATCH 11/15] i40iw: add X722 register file
X722 Hardware registers defines for iWARP component. Acked-by: Anjali Singhai Jain Acked-by: Shannon Nelson Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/i40iw_register.h | 1027 ++ 1 file changed, 1027 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/i40iw_register.h diff --git a/drivers/infiniband/hw/i40iw/i40iw_register.h b/drivers/infiniband/hw/i40iw/i40iw_register.h new file mode 100644 index 000..01da7c5 --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_register.h @@ -0,0 +1,1027 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use in source and binary forms, with or +* without modification, are permitted provided that the following +* conditions are met: +* +*- Redistributions of source code must retain the above +* copyright notice, this list of conditions and the following +* disclaimer. +* +*- Redistributions in binary form must reproduce the above +* copyright notice, this list of conditions and the following +* disclaimer in the documentation and/or other materials +* provided with the distribution. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +* +***/ + +#ifndef I40IW_REGISTER_H +#define I40IW_REGISTER_H + +#define I40E_GLGEN_STAT 0x000B612C /* Reset: POR */ + +#define I40E_PFHMC_PDINV 0x000C0300 /* Reset: PFR */ +#define I40E_PFHMC_PDINV_PMSDIDX_SHIFT 0 +#define I40E_PFHMC_PDINV_PMSDIDX_MASK I40E_MASK(0xFFF, I40E_PFHMC_PDINV_PMSDIDX_SHIFT) +#define I40E_PFHMC_PDINV_PMPDIDX_SHIFT 16 +#define I40E_PFHMC_PDINV_PMPDIDX_MASK I40E_MASK(0x1FF, I40E_PFHMC_PDINV_PMPDIDX_SHIFT) +#define I40E_PFHMC_SDCMD_PMSDWR_SHIFT 31 +#define I40E_PFHMC_SDCMD_PMSDWR_MASK I40E_MASK(0x1, I40E_PFHMC_SDCMD_PMSDWR_SHIFT) +#define I40E_PFHMC_SDDATALOW_PMSDVALID_SHIFT 0 +#define I40E_PFHMC_SDDATALOW_PMSDVALID_MASKI40E_MASK(0x1, I40E_PFHMC_SDDATALOW_PMSDVALID_SHIFT) +#define I40E_PFHMC_SDDATALOW_PMSDTYPE_SHIFT1 +#define I40E_PFHMC_SDDATALOW_PMSDTYPE_MASK I40E_MASK(0x1, I40E_PFHMC_SDDATALOW_PMSDTYPE_SHIFT) +#define I40E_PFHMC_SDDATALOW_PMSDBPCOUNT_SHIFT 2 +#define I40E_PFHMC_SDDATALOW_PMSDBPCOUNT_MASK I40E_MASK(0x3FF, I40E_PFHMC_SDDATALOW_PMSDBPCOUNT_SHIFT) + +#define I40E_PFINT_DYN_CTLN(_INTPF) (0x00034800 + ((_INTPF) * 4)) /* _i=0...511 */ /* Reset: PFR */ +#define I40E_PFINT_DYN_CTLN_INTENA_SHIFT 0 +#define I40E_PFINT_DYN_CTLN_INTENA_MASK I40E_MASK(0x1, I40E_PFINT_DYN_CTLN_INTENA_SHIFT) +#define I40E_PFINT_DYN_CTLN_CLEARPBA_SHIFT1 +#define I40E_PFINT_DYN_CTLN_CLEARPBA_MASK I40E_MASK(0x1, I40E_PFINT_DYN_CTLN_CLEARPBA_SHIFT) +#define I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT3 +#define I40E_PFINT_DYN_CTLN_ITR_INDX_MASK I40E_MASK(0x3, I40E_PFINT_DYN_CTLN_ITR_INDX_SHIFT) + +#define I40E_VFINT_DYN_CTLN1(_INTVF) (0x3800 + ((_INTVF) * 4)) /* _i=0...15 */ /* Reset: VFR */ +#define I40E_GLHMC_VFPDINV(_i) (0x000C8300 + ((_i) * 4)) /* _i=0...31 */ /* Reset: CORER */ + +#define I40E_PFHMC_PDINV_PMSDPARTSEL_SHIFT 15 +#define I40E_PFHMC_PDINV_PMSDPARTSEL_MASK I40E_MASK(0x1, I40E_PFHMC_PDINV_PMSDPARTSEL_SHIFT) +#define I40E_GLPCI_LBARCTRL0x000BE484 /* Reset: POR */ +#define I40E_GLPCI_LBARCTRL_PE_DB_SIZE_SHIFT4 +#define I40E_GLPCI_LBARCTRL_PE_DB_SIZE_MASK I40E_MASK(0x3, I40E_GLPCI_LBARCTRL_PE_DB_SIZE_SHIFT) + +#define I40E_PFPE_AEQALLOC 0x00131180 /* Reset: PFR */ +#define I40E_PFPE_AEQALLOC_AECOUNT_SHIFT 0 +#define I40E_PFPE_AEQALLOC_AECOUNT_MASK I40E_MASK(0x, I40E_PFPE_AEQALLOC_AECOUNT_SHIFT) +#define I40E_PFPE_CCQPHIGH 0x8200 /* Reset: PFR */ +#define I40E_PFPE_CCQPHIGH_PECCQPHIGH_SHIFT 0 +#define I40E_PFPE_CCQPHIGH_PECCQPHIGH_MASK I40E_MASK(0x, I40E_PFPE_CCQPHIGH_PECCQPHIGH_SHIFT) +#define I40E_PFPE_CCQPLOW 0x8180 /* Reset: PFR */ +#define I40E_PFPE_CCQPLOW_PECCQPLOW_SHIFT 0 +#define I40E_PFPE_CCQPLOW_PECCQPLOW_MASK I40E_MASK(0x, I40E_PFPE_CCQPLOW
[PATCH 03/15] i40iw: add connection management code
i40iw_cm.c i40iw_cm.h are used for connection management. Acked-by: Anjali Singhai Jain Acked-by: Shannon Nelson Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/i40iw_cm.c | 4447 drivers/infiniband/hw/i40iw/i40iw_cm.h | 456 2 files changed, 4903 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/i40iw_cm.c create mode 100644 drivers/infiniband/hw/i40iw/i40iw_cm.h diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c b/drivers/infiniband/hw/i40iw/i40iw_cm.c new file mode 100644 index 000..aa6263f --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c @@ -0,0 +1,4447 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use in source and binary forms, with or +* without modification, are permitted provided that the following +* conditions are met: +* +*- Redistributions of source code must retain the above +* copyright notice, this list of conditions and the following +* disclaimer. +* +*- Redistributions in binary form must reproduce the above +* copyright notice, this list of conditions and the following +* disclaimer in the documentation and/or other materials +* provided with the distribution. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +* +***/ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "i40iw.h" + +static void i40iw_rem_ref_cm_node(struct i40iw_cm_node *); +static void i40iw_cm_post_event(struct i40iw_cm_event *event); +static void i40iw_disconnect_worker(struct work_struct *work); + +/** + * i40iw_free_sqbuf - put back puda buffer if refcount = 0 + * @dev: FPK device + * @buf: puda buffer to free + */ +void i40iw_free_sqbuf(struct i40iw_sc_dev *dev, void *bufp) +{ + struct i40iw_puda_buf *buf = (struct i40iw_puda_buf *)bufp; + struct i40iw_puda_rsrc *ilq = dev->ilq; + + if (!atomic_dec_return(&buf->refcount)) + i40iw_puda_ret_bufpool(ilq, buf); +} + +/** + * i40iw_derive_hw_ird_setting - Calculate IRD + * + * @cm_ird: IRD of connection's node + * + * The ird from the connection is rounded to a supported HW + * setting (2,8,32,64) and then encoded for ird_size field of + * qp_ctx + */ +static u8 i40iw_derive_hw_ird_setting(u16 cm_ird) +{ + u8 encoded_ird_size = 0; + u8 pof2_cm_ird = 1; + + /* round-off to next powerof2 */ + while (pof2_cm_ird < cm_ird) + pof2_cm_ird *= 2; + + /* ird_size field is encoded in qp_ctx */ + switch (pof2_cm_ird) { + case I40IW_HW_IRD_SETTING_64: + encoded_ird_size = 3; + break; + case I40IW_HW_IRD_SETTING_32: + case I40IW_HW_IRD_SETTING_16: + encoded_ird_size = 2; + break; + case I40IW_HW_IRD_SETTING_8: + case I40IW_HW_IRD_SETTING_4: + encoded_ird_size = 1; + break; + case I40IW_HW_IRD_SETTING_2: + default: + encoded_ird_size = 0; + break; + } + return encoded_ird_size; +} + +/** + * i40iw_record_ird_ord - Record IRD/ORD passed in + * @cm_node: connection's node + * @conn_ird: connection IRD + * @conn_ord: connection ORD + */ +static void i40iw_record_ird_ord(struct i40iw_cm_node *cm_node, u16 conn_ird, u16 conn_ord) +{ + if (conn_ird > I40IW_MAX_IRD_SIZE) + conn_ird = I40IW_MAX_IRD_SIZE; + + if (conn_ord > I40IW_MAX_ORD_SIZE) + conn_ord = I40IW_MAX_ORD_SIZE; + + cm_node->ird_size = conn_ird; + cm_node->ord_size = conn_ord; +} + +/** + * i40iw_copy_ip_ntohl - change network to host ip + * @dst: host ip + * @src: big endian + */ +void i40iw_copy_ip_ntohl(u32 *dst, __be32 *src) +{ + *dst++ = ntohl(*src++); + *dst++ = ntohl(*src++); + *dst++ = ntohl(*src++);
[PATCH 05/15] i40iw: add pble resource files
i40iw_pble.[ch] to manage pble resource for iwarp clients. Acked-by: Anjali Singhai Jain Acked-by: Shannon Nelson Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/i40iw_pble.c | 618 +++ drivers/infiniband/hw/i40iw/i40iw_pble.h | 131 +++ 2 files changed, 749 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/i40iw_pble.c create mode 100644 drivers/infiniband/hw/i40iw/i40iw_pble.h diff --git a/drivers/infiniband/hw/i40iw/i40iw_pble.c b/drivers/infiniband/hw/i40iw/i40iw_pble.c new file mode 100644 index 000..217997e --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_pble.c @@ -0,0 +1,618 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use in source and binary forms, with or +* without modification, are permitted provided that the following +* conditions are met: +* +*- Redistributions of source code must retain the above +* copyright notice, this list of conditions and the following +* disclaimer. +* +*- Redistributions in binary form must reproduce the above +* copyright notice, this list of conditions and the following +* disclaimer in the documentation and/or other materials +* provided with the distribution. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +* +***/ + +#include "i40iw_status.h" +#include "i40iw_osdep.h" +#include "i40iw_register.h" +#include "i40iw_hmc.h" + +#include "i40iw_d.h" +#include "i40iw_type.h" +#include "i40iw_p.h" + +#include +#include +#include +#include "i40iw_pble.h" +#include "i40iw.h" + +struct i40iw_device; +static enum i40iw_status_code add_pble_pool(struct i40iw_sc_dev *dev, + struct i40iw_hmc_pble_rsrc *pble_rsrc); +static void i40iw_free_vmalloc_mem(struct i40iw_hw *hw, struct i40iw_chunk *chunk); + +/** + * i40iw_destroy_pble_pool - destroy pool during module unload + * @pble_rsrc: pble resources + */ +void i40iw_destroy_pble_pool(struct i40iw_sc_dev *dev, struct i40iw_hmc_pble_rsrc *pble_rsrc) +{ + struct list_head *clist; + struct list_head *tlist; + struct i40iw_chunk *chunk; + struct i40iw_pble_pool *pinfo = &pble_rsrc->pinfo; + + if (pinfo->pool) { + list_for_each_safe(clist, tlist, &pinfo->clist) { + chunk = list_entry(clist, struct i40iw_chunk, list); + if (chunk->type == I40IW_VMALLOC) + i40iw_free_vmalloc_mem(dev->hw, chunk); + kfree(chunk); + } + gen_pool_destroy(pinfo->pool); + } +} + +/** + * i40iw_hmc_init_pble - Initialize pble resources during module load + * @dev: i40iw_sc_dev struct + * @pble_rsrc: pble resources + */ +enum i40iw_status_code i40iw_hmc_init_pble(struct i40iw_sc_dev *dev, + struct i40iw_hmc_pble_rsrc *pble_rsrc) +{ + struct i40iw_hmc_info *hmc_info; + u32 fpm_idx = 0; + + hmc_info = dev->hmc_info; + pble_rsrc->fpm_base_addr = hmc_info->hmc_obj[I40IW_HMC_IW_PBLE].base; + /* Now start the pble' on 4k boundary */ + if (pble_rsrc->fpm_base_addr & 0xfff) + fpm_idx = (PAGE_SIZE - (pble_rsrc->fpm_base_addr & 0xfff)) >> 3; + + pble_rsrc->unallocated_pble = + hmc_info->hmc_obj[I40IW_HMC_IW_PBLE].cnt - fpm_idx; + pble_rsrc->next_fpm_addr = pble_rsrc->fpm_base_addr + (fpm_idx << 3); + + pble_rsrc->pinfo.pool_shift = POOL_SHIFT; + pble_rsrc->pinfo.pool = gen_pool_create(pble_rsrc->pinfo.pool_shift, -1); + INIT_LIST_HEAD(&pble_rsrc->pinfo.clist); + if (!pble_rsrc->pinfo.pool) + goto error; + + if (add_pble_pool(dev, pble_rsrc)) + goto error; + + return 0; + + error:i40iw_destroy_pble_pool(dev, pble_rsrc); + return I40IW_ERR_NO_MEMORY; +} + +/** + * get_sd_pd_idx - Returns sd index, pd index and rel_pd_idx from fpm address + * @ pble_rsrc:structure containing fpm address + * @ idx: where to return indexe
[PATCH 12/15] i40iw: user kernel shared files
i40iw_user.h and i40iw_uk.c are used by both user library as well as kernel requests. Acked-by: Anjali Singhai Jain Acked-by: Shannon Nelson Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/i40iw_uk.c | 1213 ++ drivers/infiniband/hw/i40iw/i40iw_user.h | 438 +++ 2 files changed, 1651 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/i40iw_uk.c create mode 100644 drivers/infiniband/hw/i40iw/i40iw_user.h diff --git a/drivers/infiniband/hw/i40iw/i40iw_uk.c b/drivers/infiniband/hw/i40iw/i40iw_uk.c new file mode 100644 index 000..d7ae9e6 --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_uk.c @@ -0,0 +1,1213 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use in source and binary forms, with or +* without modification, are permitted provided that the following +* conditions are met: +* +*- Redistributions of source code must retain the above +* copyright notice, this list of conditions and the following +* disclaimer. +* +*- Redistributions in binary form must reproduce the above +* copyright notice, this list of conditions and the following +* disclaimer in the documentation and/or other materials +* provided with the distribution. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +* +***/ + +#include "i40iw_osdep.h" +#include "i40iw_status.h" +#include "i40iw_d.h" +#include "i40iw_user.h" +#include "i40iw_register.h" + +static u32 nop_signature = 0x; + +/** + * i40iw_nop_1 - insert a nop wqe and move head. no post work + * @qp: hw qp ptr + */ +static enum i40iw_status_code i40iw_nop_1(struct i40iw_qp_uk *qp) +{ + u64 header, *wqe; + u64 *wqe_0 = NULL; + u32 wqe_idx, peek_head; + bool signaled = false; + + if (!qp->sq_ring.head) + return I40IW_ERR_PARAM; + + wqe_idx = I40IW_RING_GETCURRENT_HEAD(qp->sq_ring); + wqe = &qp->sq_base[wqe_idx << 2]; + peek_head = (qp->sq_ring.head + 1) % qp->sq_ring.size; + wqe_0 = &qp->sq_base[peek_head << 2]; + if (peek_head) + wqe_0[3] = LS_64(!qp->swqe_polarity, I40IWQPSQ_VALID); + else + wqe_0[3] = LS_64(qp->swqe_polarity, I40IWQPSQ_VALID); + + set_64bit_val(wqe, 0, 0); + set_64bit_val(wqe, 8, 0); + set_64bit_val(wqe, 16, 0); + + header = LS_64(I40IWQP_OP_NOP, I40IWQPSQ_OPCODE) | + LS_64(signaled, I40IWQPSQ_SIGCOMPL) | + LS_64(qp->swqe_polarity, I40IWQPSQ_VALID) | nop_signature++; + + wmb(); /* Memory barrier to ensure data is written before valid bit is set */ + + set_64bit_val(wqe, 24, header); + return 0; +} + +/** + * i40iw_qp_post_wr - post wr to hrdware + * @qp: hw qp ptr + */ +void i40iw_qp_post_wr(struct i40iw_qp_uk *qp) +{ + u64 temp; + u32 hw_sq_tail; + u32 sw_sq_head; + + wmb(); /* make sure valid bit is written */ + + /* read the doorbell shadow area */ + get_64bit_val(qp->shadow_area, 0, &temp); + + rmb(); /* make sure read is finished */ + + hw_sq_tail = (u32)RS_64(temp, I40IW_QP_DBSA_HW_SQ_TAIL); + sw_sq_head = I40IW_RING_GETCURRENT_HEAD(qp->sq_ring); + if (sw_sq_head != hw_sq_tail) { + if (sw_sq_head > qp->initial_ring.head) { + if ((hw_sq_tail >= qp->initial_ring.head) && + (hw_sq_tail < sw_sq_head)) { + db_wr32(qp->wqe_alloc_reg, qp->qp_id); + } + } else if (sw_sq_head != qp->initial_ring.head) { + if ((hw_sq_tail >= qp->initial_ring.head) || + (hw_sq_tail < sw_sq_head)) { + db_wr32(qp->wqe_alloc_reg, qp->qp_id); + } + } + } + + qp->initial_ring.head = qp->sq_ring.head; +} + +/** + * i40iw_qp_ring_push_db - ring qp doorbell + * @qp: hw qp ptr + * @wqe_idx: wqe index + */ +static void i40iw_qp_ring_push_db(struct i40iw_qp_uk *qp, u
[PATCH 07/15] i40iw: add hw and utils files
i40iw_hw.c, i40iw_utils.c and i40iw_osdep.h are files to handle interrupts and processing. Acked-by: Anjali Singhai Jain Acked-by: Shannon Nelson Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/i40iw_hw.c| 705 + drivers/infiniband/hw/i40iw/i40iw_osdep.h | 235 ++ drivers/infiniband/hw/i40iw/i40iw_utils.c | 1233 + 3 files changed, 2173 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/i40iw_hw.c create mode 100644 drivers/infiniband/hw/i40iw/i40iw_osdep.h create mode 100644 drivers/infiniband/hw/i40iw/i40iw_utils.c diff --git a/drivers/infiniband/hw/i40iw/i40iw_hw.c b/drivers/infiniband/hw/i40iw/i40iw_hw.c new file mode 100644 index 000..13d0d9e --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_hw.c @@ -0,0 +1,705 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use in source and binary forms, with or +* without modification, are permitted provided that the following +* conditions are met: +* +*- Redistributions of source code must retain the above +* copyright notice, this list of conditions and the following +* disclaimer. +* +*- Redistributions in binary form must reproduce the above +* copyright notice, this list of conditions and the following +* disclaimer in the documentation and/or other materials +* provided with the distribution. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +* +***/ + +#include +#include +#include +#include +#include +#include +#include + +#include "i40iw.h" + +/** + * i40iw_initialize_hw_resources - initialize hw resource during open + * @iwdev: iwarp device + */ +u32 i40iw_initialize_hw_resources(struct i40iw_device *iwdev) +{ + unsigned long num_pds; + u32 resources_size; + u32 max_mr; + u32 max_qp; + u32 max_cq; + u32 arp_table_size; + u32 mrdrvbits; + void *resource_ptr; + + max_qp = iwdev->sc_dev.hmc_info->hmc_obj[I40IW_HMC_IW_QP].cnt; + max_cq = iwdev->sc_dev.hmc_info->hmc_obj[I40IW_HMC_IW_CQ].cnt; + max_mr = iwdev->sc_dev.hmc_info->hmc_obj[I40IW_HMC_IW_MR].cnt; + arp_table_size = iwdev->sc_dev.hmc_info->hmc_obj[I40IW_HMC_IW_ARP].cnt; + iwdev->max_cqe = 0xF; + num_pds = max_qp * 4; + resources_size = sizeof(struct i40iw_arp_entry) * arp_table_size; + resources_size += sizeof(unsigned long) * BITS_TO_LONGS(max_qp); + resources_size += sizeof(unsigned long) * BITS_TO_LONGS(max_mr); + resources_size += sizeof(unsigned long) * BITS_TO_LONGS(max_cq); + resources_size += sizeof(unsigned long) * BITS_TO_LONGS(num_pds); + resources_size += sizeof(unsigned long) * BITS_TO_LONGS(arp_table_size); + resources_size += sizeof(struct i40iw_qp **) * max_qp; + iwdev->mem_resources = kzalloc(resources_size, GFP_KERNEL); + + if (!iwdev->mem_resources) + return -ENOMEM; + + iwdev->max_qp = max_qp; + iwdev->max_mr = max_mr; + iwdev->max_cq = max_cq; + iwdev->max_pd = num_pds; + iwdev->arp_table_size = arp_table_size; + iwdev->arp_table = (struct i40iw_arp_entry *)iwdev->mem_resources; + resource_ptr = iwdev->mem_resources + (sizeof(struct i40iw_arp_entry) * arp_table_size); + + iwdev->device_cap_flags = IB_DEVICE_LOCAL_DMA_LKEY | + IB_DEVICE_MEM_WINDOW | IB_DEVICE_MEM_MGT_EXTENSIONS; + + iwdev->allocated_qps = resource_ptr; + iwdev->allocated_cqs = &iwdev->allocated_qps[BITS_TO_LONGS(max_qp)]; + iwdev->allocated_mrs = &iwdev->allocated_cqs[BITS_TO_LONGS(max_cq)]; + iwdev->allocated_pds = &iwdev->allocated_mrs[BITS_TO_LONGS(max_mr)]; + iwdev->allocated_arps = &iwdev->allocated_pds[BITS_TO_LONGS(num_pds)]; + iwdev->qp_table = (struct i40iw_qp **)(&iwdev->allocated_arps[BITS_TO_LONGS(arp_table_size)]); + set_bit(0, iwdev->allocated_mrs); + set_bit(0, iwdev->allocated_qps); + set_bit(0, iwdev->allocated_cqs); + set_bit(0, iwdev->allocated_pds); + s
[PATCH 02/15] i40iw: add main, hdr, status
i40iw_main.c contains routines for i40e <=> i40iw interface and setup. i40iw.h is header file for main device data structures. i40iw_status.h is for return status codes. Acked-by: Anjali Singhai Jain Acked-by: Shannon Nelson Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/i40iw.h| 573 + drivers/infiniband/hw/i40iw/i40iw_main.c | 1905 drivers/infiniband/hw/i40iw/i40iw_status.h | 100 ++ 3 files changed, 2578 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/i40iw.h create mode 100644 drivers/infiniband/hw/i40iw/i40iw_main.c create mode 100644 drivers/infiniband/hw/i40iw/i40iw_status.h diff --git a/drivers/infiniband/hw/i40iw/i40iw.h b/drivers/infiniband/hw/i40iw/i40iw.h new file mode 100644 index 000..c048f06b --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw.h @@ -0,0 +1,573 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use in source and binary forms, with or +* without modification, are permitted provided that the following +* conditions are met: +* +*- Redistributions of source code must retain the above +* copyright notice, this list of conditions and the following +* disclaimer. +* +*- Redistributions in binary form must reproduce the above +* copyright notice, this list of conditions and the following +* disclaimer in the documentation and/or other materials +* provided with the distribution. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +* +***/ + +#ifndef I40IW_IW_H +#define I40IW_IW_H +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "i40iw_status.h" +#include "i40iw_osdep.h" +#include "i40iw_d.h" +#include "i40iw_hmc.h" + +#include +#include "i40iw_type.h" +#include "i40iw_p.h" +#include "i40iw_ucontext.h" +#include "i40iw_pble.h" +#include "i40iw_verbs.h" +#include "i40iw_cm.h" +#include "i40iw_user.h" +#include "i40iw_puda.h" + +#define I40IW_FW_VERSION 2 +#define I40IW_HW_VERSION 2 + +#define I40IW_ARP_ADD 1 +#define I40IW_ARP_DELETE 2 +#define I40IW_ARP_RESOLVE 3 + +#define I40IW_MACIP_ADD 1 +#define I40IW_MACIP_DELETE 2 + +#define IW_CCQ_SIZE (I40IW_CQP_SW_SQSIZE_2048 + 1) +#define IW_CEQ_SIZE 2048 +#define IW_AEQ_SIZE 2048 + +#define RX_BUF_SIZE(1536 + 8) +#define IW_REG0_SIZE (4 * 1024) +#define IW_TX_TIMEOUT (6 * HZ) +#define IW_FIRST_QPN 1 +#define IW_SW_CONTEXT_ALIGN1024 + +#define MAX_DPC_ITERATIONS 128 + +#define I40IW_EVENT_TIMEOUT10 +#define I40IW_VCHNL_EVENT_TIMEOUT 10 + +#defineI40IW_NO_VLAN 0x +#defineI40IW_NO_QSET 0x + +/* access to mcast filter list */ +#define IW_ADD_MCAST false +#define IW_DEL_MCAST true + +#define I40IW_DRV_OPT_ENABLE_MPA_VER_0 0x0001 +#define I40IW_DRV_OPT_DISABLE_MPA_CRC 0x0002 +#define I40IW_DRV_OPT_DISABLE_FIRST_WRITE 0x0004 +#define I40IW_DRV_OPT_DISABLE_INTF 0x0008 +#define I40IW_DRV_OPT_ENABLE_MSI 0x0010 +#define I40IW_DRV_OPT_DUAL_LOGICAL_PORT0x0020 +#define I40IW_DRV_OPT_NO_INLINE_DATA 0x0080 +#define I40IW_DRV_OPT_DISABLE_INT_MOD 0x0100 +#define I40IW_DRV_OPT_DISABLE_VIRT_WQ 0x0200 +#define I40IW_DRV_OPT_ENABLE_PAU 0x0400 +#define I40IW_DRV_OPT_MCAST_LOGPORT_MAP0x0800 + +#define IW_HMC_OBJ_TYPE_NUM ARRAY_SIZE(iw_hmc_obj_types) +#define IW_CFG_FPM_QP_COUNT32768 + +#define I40IW_MTU_TO_MSS 40 +#define I40IW_DEFAULT_MSS 1460 + +struct i40iw_cqp_compl_info { + u32 op_ret_val; + u16 maj_err_code; + u16 min_err_code; + bool error; + u8 op_code; +}; + +#define CHECK_CQP_REQ(cqp_request) \ +{ \ + if (!cqp_request) { \ +
[PATCH 04/15] i40iw: add puda code
i40iw_puda.[ch] are files to handle iwarp connection packets as well as exception packets over multiple privilege mode uda queues. Acked-by: Anjali Singhai Jain Acked-by: Shannon Nelson Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/i40iw_puda.c | 1443 ++ drivers/infiniband/hw/i40iw/i40iw_puda.h | 183 2 files changed, 1626 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/i40iw_puda.c create mode 100644 drivers/infiniband/hw/i40iw/i40iw_puda.h diff --git a/drivers/infiniband/hw/i40iw/i40iw_puda.c b/drivers/infiniband/hw/i40iw/i40iw_puda.c new file mode 100644 index 000..8e628af --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_puda.c @@ -0,0 +1,1443 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use in source and binary forms, with or +* without modification, are permitted provided that the following +* conditions are met: +* +*- Redistributions of source code must retain the above +* copyright notice, this list of conditions and the following +* disclaimer. +* +*- Redistributions in binary form must reproduce the above +* copyright notice, this list of conditions and the following +* disclaimer in the documentation and/or other materials +* provided with the distribution. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +* +***/ + +#include "i40iw_osdep.h" +#include "i40iw_register.h" +#include "i40iw_status.h" +#include "i40iw_hmc.h" + +#include "i40iw_d.h" +#include "i40iw_type.h" +#include "i40iw_p.h" +#include "i40iw_puda.h" + +static void i40iw_ieq_receive(struct i40iw_sc_dev *dev, + struct i40iw_puda_buf *buf); +static void i40iw_ieq_tx_compl(struct i40iw_sc_dev *dev, void *sqwrid); +static void i40iw_ilq_putback_rcvbuf(struct i40iw_sc_qp *qp, u32 wqe_idx); +static enum i40iw_status_code i40iw_puda_replenish_rq(struct i40iw_puda_rsrc + *rsrc, bool initial); +/** + * i40iw_puda_get_listbuf - get buffer from puda list + * @list: list to use for buffers (ILQ or IEQ) + */ +static struct i40iw_puda_buf *i40iw_puda_get_listbuf(struct list_head *list) +{ + struct i40iw_puda_buf *buf = NULL; + + if (!list_empty(list)) { + buf = (struct i40iw_puda_buf *)list->next; + list_del((struct list_head *)&buf->list); + } + return buf; +} + +/** + * i40iw_puda_get_bufpool - return buffer from resource + * @rsrc: resource to use for buffer + */ +struct i40iw_puda_buf *i40iw_puda_get_bufpool(struct i40iw_puda_rsrc *rsrc) +{ + struct i40iw_puda_buf *buf = NULL; + struct list_head *list = &rsrc->bufpool; + unsigned long flags; + + spin_lock_irqsave(&rsrc->bufpool_lock, flags); + buf = i40iw_puda_get_listbuf(list); + if (buf) + rsrc->avail_buf_count--; + else + rsrc->stats_buf_alloc_fail++; + spin_unlock_irqrestore(&rsrc->bufpool_lock, flags); + return buf; +} + +/** + * i40iw_puda_ret_bufpool - return buffer to rsrc list + * @rsrc: resource to use for buffer + * @buf: buffe to return to resouce + */ +void i40iw_puda_ret_bufpool(struct i40iw_puda_rsrc *rsrc, + struct i40iw_puda_buf *buf) +{ + unsigned long flags; + + spin_lock_irqsave(&rsrc->bufpool_lock, flags); + list_add(&buf->list, &rsrc->bufpool); + spin_unlock_irqrestore(&rsrc->bufpool_lock, flags); + rsrc->avail_buf_count++; +} + +/** + * i40iw_puda_post_recvbuf - set wqe for rcv buffer + * @rsrc: resource ptr + * @wqe_idx: wqe index to use + * @buf: puda buffer for rcv q + * @initial: flag if during init time + */ +static void i40iw_puda_post_recvbuf(struct i40iw_puda_rsrc *rsrc, u32 wqe_idx, + struct i40iw_puda_buf *buf, bool initial) +{ + u64 *wqe; + struct i40iw_sc_qp *qp = &rsrc->qp; + u64 offset24 = 0; + + qp->qp_uk.rq_wrid_array[wqe_idx] = (uintptr_t)buf; + wqe = &qp->qp_uk.rq_base[
[PATCH 01/15] i40e: Add support for client interface for IWARP driver
From: Anjali Singhai Jain This patch adds a Client interface for i40iw driver support. Also expands the Virtchannel to support messages from i40evf driver on behalf of i40iwvf driver. This client API is used by the i40iw and i40iwvf driver to access the core driver resources brokered by the i40e driver. Signed-off-by: Anjali Singhai Jain --- drivers/net/ethernet/intel/i40e/Makefile |1 + drivers/net/ethernet/intel/i40e/i40e.h | 22 + drivers/net/ethernet/intel/i40e/i40e_client.c | 1012 drivers/net/ethernet/intel/i40e/i40e_client.h | 232 + drivers/net/ethernet/intel/i40e/i40e_main.c| 115 ++- drivers/net/ethernet/intel/i40e/i40e_type.h|3 +- drivers/net/ethernet/intel/i40e/i40e_virtchnl.h| 34 + drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 247 - drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.h |4 + 9 files changed, 1657 insertions(+), 13 deletions(-) create mode 100644 drivers/net/ethernet/intel/i40e/i40e_client.c create mode 100644 drivers/net/ethernet/intel/i40e/i40e_client.h diff --git a/drivers/net/ethernet/intel/i40e/Makefile b/drivers/net/ethernet/intel/i40e/Makefile index b4729ba..3b3c63e 100644 --- a/drivers/net/ethernet/intel/i40e/Makefile +++ b/drivers/net/ethernet/intel/i40e/Makefile @@ -41,6 +41,7 @@ i40e-objs := i40e_main.o \ i40e_diag.o \ i40e_txrx.o \ i40e_ptp.o \ + i40e_client.o \ i40e_virtchnl_pf.o i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h index 4dd3e26..1417ae8 100644 --- a/drivers/net/ethernet/intel/i40e/i40e.h +++ b/drivers/net/ethernet/intel/i40e/i40e.h @@ -59,6 +59,7 @@ #ifdef I40E_FCOE #include "i40e_fcoe.h" #endif +#include "i40e_client.h" #include "i40e_virtchnl.h" #include "i40e_virtchnl_pf.h" #include "i40e_txrx.h" @@ -178,6 +179,7 @@ struct i40e_lump_tracking { u16 search_hint; u16 list[0]; #define I40E_PILE_VALID_BIT 0x8000 +#define I40E_IWARP_IRQ_PILE_ID (I40E_PILE_VALID_BIT - 2) }; #define I40E_DEFAULT_ATR_SAMPLE_RATE 20 @@ -264,6 +266,8 @@ struct i40e_pf { #endif /* I40E_FCOE */ u16 num_lan_qps; /* num lan queues this PF has set up */ u16 num_lan_msix; /* num queue vectors for the base PF vsi */ + u16 num_iwarp_msix;/* num of iwarp vectors for this PF */ + int iwarp_base_vector; int queues_left; /* queues left unclaimed */ u16 rss_size; /* num queues in the RSS array */ u16 rss_size_max; /* HW defined max RSS queues */ @@ -313,6 +317,7 @@ struct i40e_pf { #define I40E_FLAG_16BYTE_RX_DESC_ENABLED BIT_ULL(13) #define I40E_FLAG_CLEAN_ADMINQ BIT_ULL(14) #define I40E_FLAG_FILTER_SYNC BIT_ULL(15) +#define I40E_FLAG_SERVICE_CLIENT_REQUESTED BIT_ULL(16) #define I40E_FLAG_PROCESS_MDD_EVENTBIT_ULL(17) #define I40E_FLAG_PROCESS_VFLR_EVENT BIT_ULL(18) #define I40E_FLAG_SRIOV_ENABLEDBIT_ULL(19) @@ -550,6 +555,8 @@ struct i40e_vsi { struct kobject *kobj; /* sysfs object */ bool current_isup; /* Sync 'link up' logging */ + void *priv; /* client driver data reference. */ + /* VSI specific handlers */ irqreturn_t (*irq_handler)(int irq, void *data); @@ -702,6 +709,10 @@ void i40e_vsi_setup_queue_map(struct i40e_vsi *vsi, struct i40e_vsi_context *ctxt, u8 enabled_tc, bool is_add); #endif +void i40e_service_event_schedule(struct i40e_pf *pf); +void i40e_notify_client_of_vf_msg(struct i40e_vsi *vsi, u32 vf_id, + u8 *msg, u16 len); + int i40e_vsi_control_rings(struct i40e_vsi *vsi, bool enable); int i40e_reconfig_rss_queues(struct i40e_pf *pf, int queue_count); struct i40e_veb *i40e_veb_setup(struct i40e_pf *pf, u16 flags, u16 uplink_seid, @@ -724,6 +735,17 @@ static inline void i40e_dbg_pf_exit(struct i40e_pf *pf) {} static inline void i40e_dbg_init(void) {} static inline void i40e_dbg_exit(void) {} #endif /* CONFIG_DEBUG_FS*/ +/* needed by client drivers */ +int i40e_lan_add_device(struct i40e_pf *pf); +int i40e_lan_del_device(struct i40e_pf *pf); +void i40e_client_subtask(struct i40e_pf *pf); +void i40e_notify_client_of_l2_param_changes(struct i40e_vsi *vsi); +void i40e_notify_client_of_netdev_open(struct i40e_vsi *vsi); +void i40e_notify_client_of_netdev_close(struct i40e_vsi *vsi, bool reset); +void i40e_notify_client_of_vf_enable(struct i40e_pf *pf, u32 num_vfs); +void i40e_notify_client_of_vf_reset(struct i40e_pf *pf, u32 vf_id); +int i40e_vf_client_capable(struct i40e_pf *pf, u32 vf_id, + enum i40e_client_type type); /** * i40e_irq_dynamic_enable - Enable default interrupt generation settings * @vsi:
[PATCH 09/15] i40iw: add file to handle cqp calls
i40iw_ctrl.c provides for hardware wqe supporti and cqp. Acked-by: Anjali Singhai Jain Acked-by: Shannon Nelson Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/i40iw_ctrl.c | 4774 ++ 1 file changed, 4774 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/i40iw_ctrl.c diff --git a/drivers/infiniband/hw/i40iw/i40iw_ctrl.c b/drivers/infiniband/hw/i40iw/i40iw_ctrl.c new file mode 100644 index 000..d0f2a23 --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_ctrl.c @@ -0,0 +1,4774 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use in source and binary forms, with or +* without modification, are permitted provided that the following +* conditions are met: +* +*- Redistributions of source code must retain the above +* copyright notice, this list of conditions and the following +* disclaimer. +* +*- Redistributions in binary form must reproduce the above +* copyright notice, this list of conditions and the following +* disclaimer in the documentation and/or other materials +* provided with the distribution. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +* +***/ + +#include "i40iw_osdep.h" +#include "i40iw_register.h" +#include "i40iw_status.h" +#include "i40iw_hmc.h" + +#include "i40iw_d.h" +#include "i40iw_type.h" +#include "i40iw_p.h" +#include "i40iw_vf.h" +#include "i40iw_virtchnl.h" + +/** + * i40iw_insert_wqe_hdr - write wqe header + * @wqe: cqp wqe for header + * @header: header for the cqp wqe + */ +static inline void i40iw_insert_wqe_hdr(u64 *wqe, u64 header) +{ + wmb();/* make sure WQE is populated before valid bit is set */ + set_64bit_val(wqe, 24, header); +} + +/** + * i40iw_get_cqp_reg_info - get head and tail for cqp using registers + * @cqp: struct for cqp hw + * @val: cqp tail register value + * @tail:wqtail register value + * @error: cqp processing err + */ +static inline void i40iw_get_cqp_reg_info(struct i40iw_sc_cqp *cqp, + u32 *val, + u32 *tail, + u32 *error) +{ + if (cqp->dev->is_pf) { + *val = rd32(cqp->dev->hw, I40E_PFPE_CQPTAIL); + *tail = RS_32(*val, I40E_PFPE_CQPTAIL_WQTAIL); + *error = RS_32(*val, I40E_PFPE_CQPTAIL_CQP_OP_ERR); + } else { + *val = rd32(cqp->dev->hw, I40E_VFPE_CQPTAIL1); + *tail = RS_32(*val, I40E_VFPE_CQPTAIL_WQTAIL); + *error = RS_32(*val, I40E_VFPE_CQPTAIL_CQP_OP_ERR); + } +} + +/** + * i40iw_cqp_poll_registers - poll cqp registers + * @cqp: struct for cqp hw + * @tail:wqtail register value + * @count: how many times to try for completion + */ +static enum i40iw_status_code i40iw_cqp_poll_registers( + struct i40iw_sc_cqp *cqp, + u32 tail, + u32 count) +{ + u32 i = 0; + u32 newtail, error, val; + + while (i < count) { + i++; + i40iw_get_cqp_reg_info(cqp, &val, &newtail, &error); + if (error) { + error = (cqp->dev->is_pf) ? +rd32(cqp->dev->hw, I40E_PFPE_CQPERRCODES) : +rd32(cqp->dev->hw, I40E_VFPE_CQPERRCODES1); + return I40IW_ERR_CQP_COMPL_ERROR; + } + if (newtail != tail) { + /* SUCCESS */ + I40IW_RING_MOVE_TAIL(cqp->sq_ring); + return 0; + } + udelay(I40IW_SLEEP_COUNT); + } + return I40IW_ERR_TIMEOUT; +} + +/** + * i40iw_sc_parse_fpm_commit_buf - parse fpm commit buffer + * @buf: ptr to fpm commit buffer + * @info: ptr to i40iw_hmc_obj_info struct + * + * parses fpm commit info and copy base value + * of hmc objects in hmc_info + */ +static enum i40iw_statu
[PATCH 08/15] i40iw: add files for iwarp interface
i40iw_verbs.[ch] are to handle iwarp interface. Acked-by: Anjali Singhai Jain Acked-by: Shannon Nelson Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/i40iw_ucontext.h | 110 ++ drivers/infiniband/hw/i40iw/i40iw_verbs.c| 2492 ++ drivers/infiniband/hw/i40iw/i40iw_verbs.h| 173 ++ 3 files changed, 2775 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/i40iw_ucontext.h create mode 100644 drivers/infiniband/hw/i40iw/i40iw_verbs.c create mode 100644 drivers/infiniband/hw/i40iw/i40iw_verbs.h diff --git a/drivers/infiniband/hw/i40iw/i40iw_ucontext.h b/drivers/infiniband/hw/i40iw/i40iw_ucontext.h new file mode 100644 index 000..5c65c25 --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_ucontext.h @@ -0,0 +1,110 @@ +/* + * Copyright (c) 2006 - 2015 Intel Corporation. All rights reserved. + * Copyright (c) 2005 Topspin Communications. All rights reserved. + * Copyright (c) 2005 Cisco Systems. All rights reserved. + * Copyright (c) 2005 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef I40IW_USER_CONTEXT_H +#define I40IW_USER_CONTEXT_H + +#include + +#define I40IW_ABI_USERSPACE_VER 4 +#define I40IW_ABI_KERNEL_VER4 +struct i40iw_alloc_ucontext_req { + __u32 reserved32; + __u8 userspace_ver; + __u8 reserved8[3]; +}; + +struct i40iw_alloc_ucontext_resp { + __u32 max_pds; /* maximum pds allowed for this user process */ + __u32 max_qps; /* maximum qps allowed for this user process */ + __u32 wq_size; /* size of the WQs (sq+rq) allocated to the mmaped area */ + __u8 kernel_ver; + __u8 reserved[3]; +}; + +struct i40iw_alloc_pd_resp { + __u32 pd_id; + __u8 reserved[4]; +}; + +struct i40iw_create_cq_req { + __u64 user_cq_buffer; + __u64 user_shadow_area; +}; + +struct i40iw_create_qp_req { + __u64 user_wqe_buffers; + __u64 user_compl_ctx; + + /* UDA QP PHB */ + __u64 user_sq_phb; /* place for VA of the sq phb buff */ + __u64 user_rq_phb; /* place for VA of the rq phb buff */ +}; + +enum i40iw_memreg_type { + IW_MEMREG_TYPE_MEM = 0x, + IW_MEMREG_TYPE_QP = 0x0001, + IW_MEMREG_TYPE_CQ = 0x0002, + IW_MEMREG_TYPE_MW = 0x0003, + IW_MEMREG_TYPE_FMR = 0x0004, + IW_MEMREG_TYPE_FMEM = 0x0005, +}; + +struct i40iw_mem_reg_req { + __u16 reg_type; /* Memory, QP or CQ */ + __u16 cq_pages; + __u16 rq_pages; + __u16 sq_pages; +}; + +struct i40iw_create_cq_resp { + __u32 cq_id; + __u32 cq_size; + __u32 mmap_db_index; + __u32 reserved; +}; + +struct i40iw_create_qp_resp { + __u32 qp_id; + __u32 actual_sq_size; + __u32 actual_rq_size; + __u32 i40iw_drv_opt; + __u16 push_idx; + __u8 lsmm; + __u8 rsvd2; +}; + +#endif diff --git a/drivers/infiniband/hw/i40iw/i40iw_verbs.c b/drivers/infiniband/hw/i40iw/i40iw_verbs.c new file mode 100644 index 000..9bdd95f --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_verbs.c @@ -0,0 +1,2492 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use
[PATCH 15/15] i40iw: changes for build of i40iw module
IAINTAINERS< Kconfig, Makefile and rdma_netlink.h to include i40iw Signed-off-by: Faisal Latif --- MAINTAINERS | 10 ++ drivers/infiniband/Kconfig | 1 + drivers/infiniband/hw/Makefile | 1 + include/uapi/rdma/rdma_netlink.h | 1 + 4 files changed, 13 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 69c8a9c..fc0ee30 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5600,6 +5600,16 @@ F: Documentation/networking/i40evf.txt F: drivers/net/ethernet/intel/ F: drivers/net/ethernet/intel/*/ +INTEL RDMA RNIC DRIVER +M: Faisal Latif +R: Chien Tin Tung +R: Mustafa Ismail +R: Shiraz Saleem +R: Tatyana Nikolova +L: linux-rdma@vger.kernel.org +S: Supported +F: drivers/infiniband/hw/i40iw/ + INTEL-MID GPIO DRIVER M: David Cohen L: linux-g...@vger.kernel.org diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index aa26f3c..7ddd81f 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -58,6 +58,7 @@ source "drivers/infiniband/hw/mthca/Kconfig" source "drivers/infiniband/hw/qib/Kconfig" source "drivers/infiniband/hw/cxgb3/Kconfig" source "drivers/infiniband/hw/cxgb4/Kconfig" +source "drivers/infiniband/hw/i40iw/Kconfig" source "drivers/infiniband/hw/mlx4/Kconfig" source "drivers/infiniband/hw/mlx5/Kconfig" source "drivers/infiniband/hw/nes/Kconfig" diff --git a/drivers/infiniband/hw/Makefile b/drivers/infiniband/hw/Makefile index aded2a5..c7ad0a4 100644 --- a/drivers/infiniband/hw/Makefile +++ b/drivers/infiniband/hw/Makefile @@ -2,6 +2,7 @@ obj-$(CONFIG_INFINIBAND_MTHCA) += mthca/ obj-$(CONFIG_INFINIBAND_QIB) += qib/ obj-$(CONFIG_INFINIBAND_CXGB3) += cxgb3/ obj-$(CONFIG_INFINIBAND_CXGB4) += cxgb4/ +obj-$(CONFIG_INFINIBAND_I40IW) += i40iw/ obj-$(CONFIG_MLX4_INFINIBAND) += mlx4/ obj-$(CONFIG_MLX5_INFINIBAND) += mlx5/ obj-$(CONFIG_INFINIBAND_NES) += nes/ diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h index c19a5dc..56bafbe 100644 --- a/include/uapi/rdma/rdma_netlink.h +++ b/include/uapi/rdma/rdma_netlink.h @@ -5,6 +5,7 @@ enum { RDMA_NL_RDMA_CM = 1, + RDMA_NL_I40IW, RDMA_NL_NES, RDMA_NL_C4IW, RDMA_NL_LS, /* RDMA Local Services */ -- 2.5.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/15] i40iw: virtual channel handling files
i40iw_vf.[ch] and i40iw_virtchnl[ch] are used for virtual channel support for iWARP VF module. Acked-by: Anjali Singhai Jain Acked-by: Shannon Nelson Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/i40iw_vf.c | 85 +++ drivers/infiniband/hw/i40iw/i40iw_vf.h | 62 +++ drivers/infiniband/hw/i40iw/i40iw_virtchnl.c | 750 +++ drivers/infiniband/hw/i40iw/i40iw_virtchnl.h | 124 + 4 files changed, 1021 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/i40iw_vf.c create mode 100644 drivers/infiniband/hw/i40iw/i40iw_vf.h create mode 100644 drivers/infiniband/hw/i40iw/i40iw_virtchnl.c create mode 100644 drivers/infiniband/hw/i40iw/i40iw_virtchnl.h diff --git a/drivers/infiniband/hw/i40iw/i40iw_vf.c b/drivers/infiniband/hw/i40iw/i40iw_vf.c new file mode 100644 index 000..39bb0ca --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_vf.c @@ -0,0 +1,85 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use in source and binary forms, with or +* without modification, are permitted provided that the following +* conditions are met: +* +*- Redistributions of source code must retain the above +* copyright notice, this list of conditions and the following +* disclaimer. +* +*- Redistributions in binary form must reproduce the above +* copyright notice, this list of conditions and the following +* disclaimer in the documentation and/or other materials +* provided with the distribution. +* +* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +* NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +* BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +* ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +* SOFTWARE. +* +***/ + +#include "i40iw_osdep.h" +#include "i40iw_register.h" +#include "i40iw_status.h" +#include "i40iw_hmc.h" +#include "i40iw_d.h" +#include "i40iw_type.h" +#include "i40iw_p.h" +#include "i40iw_vf.h" + +/** + * i40iw_manage_vf_pble_bp - manage vf pble + * @cqp: cqp for cqp' sq wqe + * @info: pble info + * @scratch: pointer for completion + * @post_sq: to post and ring + */ +enum i40iw_status_code i40iw_manage_vf_pble_bp(struct i40iw_sc_cqp *cqp, + struct i40iw_manage_vf_pble_info *info, + u64 scratch, + bool post_sq) +{ + u64 *wqe; + u64 temp, header, pd_pl_pba = 0; + + wqe = i40iw_sc_cqp_get_next_send_wqe(cqp, scratch); + if (!wqe) + return I40IW_ERR_RING_FULL; + + temp = LS_64((info->pd_entry_cnt), I40IW_CQPSQ_MVPBP_PD_ENTRY_CNT) | + LS_64((info->first_pd_index), I40IW_CQPSQ_MVPBP_FIRST_PD_INX) | + LS_64((info->sd_index), I40IW_CQPSQ_MVPBP_SD_INX); + set_64bit_val(wqe, 16, temp); + + header = LS_64((info->inv_pd_ent ? 1 : 0), I40IW_CQPSQ_MVPBP_INV_PD_ENT) | + LS_64(I40IW_CQP_OP_MANAGE_VF_PBLE_BP, I40IW_CQPSQ_OPCODE) | + LS_64(cqp->polarity, I40IW_CQPSQ_WQEVALID); + set_64bit_val(wqe, 24, header); + + pd_pl_pba = LS_64(info->pd_pl_pba >> 3, I40IW_CQPSQ_MVPBP_PD_PLPBA); + set_64bit_val(wqe, 32, pd_pl_pba); + + i40iw_debug_buf(cqp->dev, I40IW_DEBUG_WQE, "MANAGE VF_PBLE_BP WQE", wqe, I40IW_CQP_WQE_SIZE * 8); + + if (post_sq) + i40iw_sc_cqp_post_sq(cqp); + return 0; +} + +struct i40iw_vf_cqp_ops iw_vf_cqp_ops = { + i40iw_manage_vf_pble_bp +}; diff --git a/drivers/infiniband/hw/i40iw/i40iw_vf.h b/drivers/infiniband/hw/i40iw/i40iw_vf.h new file mode 100644 index 000..cfe112d --- /dev/null +++ b/drivers/infiniband/hw/i40iw/i40iw_vf.h @@ -0,0 +1,62 @@ +/*** +* +* Copyright (c) 2015 Intel Corporation. All rights reserved. +* +* This software is available to you under a choice of one of two +* licenses. You may choose to be licensed under the terms of the GNU +* General Public License (GPL) Version 2, available from the file +* COPYING in the main directory of this source tree, or the +* OpenFabrics.org BSD license below: +* +* Redistribution and use in source and binary forms, with
[PATCH 14/15] i40iw: Kconfig and Kbuild for iwarp module
Kconfig and Kbuild needed to build iwarp module. Signed-off-by: Faisal Latif --- drivers/infiniband/hw/i40iw/Kbuild | 43 + drivers/infiniband/hw/i40iw/Kconfig | 7 ++ 2 files changed, 50 insertions(+) create mode 100644 drivers/infiniband/hw/i40iw/Kbuild create mode 100644 drivers/infiniband/hw/i40iw/Kconfig diff --git a/drivers/infiniband/hw/i40iw/Kbuild b/drivers/infiniband/hw/i40iw/Kbuild new file mode 100644 index 000..ba84a78 --- /dev/null +++ b/drivers/infiniband/hw/i40iw/Kbuild @@ -0,0 +1,43 @@ + +# +# * Copyright (c) 2015 Intel Corporation. All rights reserved. +# * +# * This software is available to you under a choice of one of two +# * licenses. You may choose to be licensed under the terms of the GNU +# * General Public License (GPL) Version 2, available from the file +# * COPYING in the main directory of this source tree, or the +# * OpenFabrics.org BSD license below: +# * +# * Redistribution and use in source and binary forms, with or +# * without modification, are permitted provided that the following +# * conditions are met: +# * +# *- Redistributions of source code must retain the above +# *copyright notice, this list of conditions and the following +# *disclaimer. +# * +# *- Redistributions in binary form must reproduce the above +# *copyright notice, this list of conditions and the following +# *disclaimer in the documentation and/or other materials +# *provided with the distribution. +# * +# * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +# * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +# * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +# * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS +# * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +# * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN +# * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# * SOFTWARE. +# + + +ccflags-y := -Idrivers/net/ethernet/intel/i40e + +obj-m += i40iw.o + +i40iw-objs :=\ + i40iw_cm.o i40iw_ctrl.o \ + i40iw_hmc.o i40iw_hw.o i40iw_main.o \ + i40iw_pble.o i40iw_puda.o i40iw_uk.o i40iw_utils.o \ + i40iw_verbs.o i40iw_virtchnl.o i40iw_vf.o diff --git a/drivers/infiniband/hw/i40iw/Kconfig b/drivers/infiniband/hw/i40iw/Kconfig new file mode 100644 index 000..6e7d27a --- /dev/null +++ b/drivers/infiniband/hw/i40iw/Kconfig @@ -0,0 +1,7 @@ +config INFINIBAND_I40IW + tristate "Intel(R) Ethernet X722 iWARP Driver" + depends on INET && I40E + select GENERIC_ALLOCATOR + ---help--- + Intel(R) Ethernet X722 iWARP Driver + INET && I40IW && INFINIBAND && I40E -- 2.5.3 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/15] add Intel(R) X722 iWARP driver
This series contains the addition of the i40iw.ko driver. This driver provides iWARP RDMA functionality for the Intel(R) X722 Ethernet controller for PCI Physical Functions. It also has support for Virtual Function driver (i40iwvf.ko) that will be part of seperate patch series. It cooperates with the Intel(R) X722 base driver (i40e.ko) to allocate resources and program the controller. This series include 1 patch to i40e.ko to provide interface support to i40iw.ko. The interface provides a driver registration mechanism, resource allocations, and device reset coordination mechanisms. This patch series is based on Doug Ledford's /github.com/dledford/linux.git Anjali Singhai Jain (1) net/ethernet/intel/i40e: Add support for client interface for IWARP driver Faisal Latif(14): infiniband/hw/i40iw: add main, hdr, status infiniband/hw/i40iw: add connection management code infiniband/hw/i40iw: add puda code infiniband/hw/i40iw: add pble resource files infiniband/hw/i40iw: add hmc resource files infiniband/hw/i40iw: add hw and utils files infiniband/hw/i40iw: add files for iwarp interface infiniband/hw/i40iw: add file to handle cqp calls infiniband/hw/i40iw: add hardware related header files infiniband/hw/i40iw: add X722 register file infiniband/hw/i40iw: user kernel shared files infiniband/hw/i40iw: virtual channel handling files infiniband/hw/i40iw: Kconfig and Kbuild for iwarp module infiniband/hw/i40iw: changes for build of i40iw module -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set
On Wed, 16 Dec 2015, Christoph Lameter wrote: > DRAFT: This is missing the check if this device supports > extended counters. Found some time and here is the patch with the detection of the extended attribute through sending a mad request. Untested. Got the info on how to do the proper mad request from an earlier patch by Or in 2011. Subject: IB Core: Display extended counter set if available V2 Check if the extended counters are available and if so create the proper extended and additional counters. Signed-off-by: Christoph Lameter Index: linux/drivers/infiniband/core/sysfs.c === --- linux.orig/drivers/infiniband/core/sysfs.c +++ linux/drivers/infiniband/core/sysfs.c @@ -39,6 +39,7 @@ #include #include +#include struct ib_port { struct kobject kobj; @@ -65,6 +66,7 @@ struct port_table_attribute { struct port_attribute attr; charname[8]; int index; + int attr_id; }; static ssize_t port_attr_show(struct kobject *kobj, @@ -314,24 +316,33 @@ static ssize_t show_port_pkey(struct ib_ #define PORT_PMA_ATTR(_name, _counter, _width, _offset) \ struct port_table_attribute port_pma_attr_##_name = { \ .attr = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\ - .index = (_offset) | ((_width) << 16) | ((_counter) << 24) \ + .index = (_offset) | ((_width) << 16) | ((_counter) << 24), \ + .attr_id = IB_PMA_PORT_COUNTERS , \ } -static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr, - char *buf) +#define PORT_PMA_ATTR_EXT(_name, _width, _offset) \ +struct port_table_attribute port_pma_attr_ext_##_name = { \ + .attr = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\ + .index = (_offset) | ((_width) << 16), \ + .attr_id = IB_PMA_PORT_COUNTERS_EXT , \ +} + + +/* + * Get a MAD block of data. + * Returns error code or the number of bytes retrieved. + */ +static int get_mad(struct ib_device *dev, int port_num, int attr, + void *data, int offset, size_t size) { - struct port_table_attribute *tab_attr = - container_of(attr, struct port_table_attribute, attr); - int offset = tab_attr->index & 0x; - int width = (tab_attr->index >> 16) & 0xff; - struct ib_mad *in_mad = NULL; - struct ib_mad *out_mad = NULL; + struct ib_mad *in_mad; + struct ib_mad *out_mad; size_t mad_size = sizeof(*out_mad); u16 out_mad_pkey_index = 0; ssize_t ret; - if (!p->ibdev->process_mad) - return sprintf(buf, "N/A (no PMA)\n"); + if (!dev->process_mad) + return -ENOSYS; in_mad = kzalloc(sizeof *in_mad, GFP_KERNEL); out_mad = kmalloc(sizeof *out_mad, GFP_KERNEL); @@ -344,12 +355,12 @@ static ssize_t show_pma_counter(struct i in_mad->mad_hdr.mgmt_class= IB_MGMT_CLASS_PERF_MGMT; in_mad->mad_hdr.class_version = 1; in_mad->mad_hdr.method= IB_MGMT_METHOD_GET; - in_mad->mad_hdr.attr_id = cpu_to_be16(0x12); /* PortCounters */ + in_mad->mad_hdr.attr_id = attr; - in_mad->data[41] = p->port_num; /* PortSelect field */ + in_mad->data[41] = port_num;/* PortSelect field */ - if ((p->ibdev->process_mad(p->ibdev, IB_MAD_IGNORE_MKEY, -p->port_num, NULL, NULL, + if ((dev->process_mad(dev, IB_MAD_IGNORE_MKEY, +port_num, NULL, NULL, (const struct ib_mad_hdr *)in_mad, mad_size, (struct ib_mad_hdr *)out_mad, &mad_size, &out_mad_pkey_index) & @@ -358,31 +369,54 @@ static ssize_t show_pma_counter(struct i ret = -EINVAL; goto out; } + memcpy(data, out_mad->data + offset, size); + ret = size; +out: + kfree(in_mad); + kfree(out_mad); + return ret; +} + +static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr, + char *buf) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + int offset = tab_attr->index & 0x; + int width = (tab_attr->index >> 16) & 0xff; + ssize_t ret; + u8 data[8]; + + ret = get_mad(p->ibdev, p->port_num, tab_attr->attr_id, &data, + 40 + offset / 8, sizeof(data)); + if (ret < 0) + return sprintf(buf, "N/A (no PMA)\n"); switch (width) { case 4: - ret = sprintf(buf, "%u\n", (out_mad->data[40 + offset / 8] >> + ret = sprintf(buf, "%u\n", (*data >>
Re: [PATCH] svc_rdma: use local_dma_lkey
On Wed, Dec 16, 2015 at 04:11:04PM +0100, Christoph Hellwig wrote: > We now alwasy have a per-PD local_dma_lkey available. Make use of that > fact in svc_rdma and stop registering our own MR. > > Signed-off-by: Christoph Hellwig Reviewed-by: Jason Gunthorpe > +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > @@ -144,6 +144,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt, > > head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no]; > head->arg.page_len += len; > + > head->arg.len += len; > if (!pg_off) > head->count++; Was this hunk deliberate? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path
On Wed, Dec 16, 2015 at 11:26:39AM +0100, Michael Wang wrote: > > On 12/15/2015 06:30 PM, Jason Gunthorpe wrote: > > On Tue, Dec 15, 2015 at 05:38:34PM +0100, Michael Wang wrote: > >> The hop_limit is only suggest that the package allowed to be > >> routed, not have to, correct? > > > > If the hop limit is >= 2 (?) then the GRH is mandatory. The > > SM will return this information in the PathRecord if it determines a > > GRH is required. The whole stack follows this protocol. > > > > The GRH is optional for in-subnet communications. > > Thanks for the explain :-) > > I've rechecked the ib_init_ah_from_path() again, and found it > still set IB_AH_GRH when the GID cache missing, but with: How do you mean? ah_attr->ah_flags = IB_AH_GRH; ah_attr->grh.dgid = rec->dgid; ret = ib_find_cached_gid(device, &rec->sgid, ndev, &port_num, &gid_index); if (ret) { if (ndev) dev_put(ndev); return ret; } If find_cached_gid fails then ib_init_ah_from_path also fails. Is there a case where ib_find_cached_gid can succeed but not return good data? I agree it would read nicer if the ah_flags and gr.dgid was moved after the ib_find_cached_gid > BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out > a check on return. That sounds like a problem. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 00/11] Add RoCE v2 support
On Wed, Dec 16, 2015 at 08:56:01AM +0200, Moni Shoua wrote: > I can't object to that but I really would like to get an example of a > security risk. How can anyone give you an example when nobody knows exactly how mlx hardware works in this area? >From an kapi prespective, the security design is very simple. Every single UD packet the kapi side has to process must be unambiguously associated with a gid_index or dropped. Period full stop. I would think that is an obvious conclusion based on the design of the gid cache. This is why we need a clear API to get this critical information. It should not be open coded in init_ah_from_wc, it should not be done some other way in the CMA code. This is a simple matter of sane kapi design, nothing more. I have no idea why this is so objectionable. > scattered to the receive bufs anyway. So, if there is a security hole > it exists from day one of the IB stack and this is not the time we > should insist on fixing it. IB isn't interacting with the net stack in the same way rocev2 is, so this is not a pre-existing problem. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 00/11] Add RoCE v2 support
On Wed, Dec 16, 2015 at 09:57:02AM +, Liran Liss wrote: > Currently, namespaces are not supported for RoCE. IMHO, we should not be accepting rocev2 without at least basic namespace support too, since it is fairly trivial to do based on the work that is already done for verbs. An obvious missing piece is the 'wc to gid index' API I keep asking for. > That said, we have everything we need for RoCE namespace support when we get > there. Then there is no problem with the 'wc to gid index' stuff, so stop complaining about it. > All of this has nothing to do with "broken" and enshrining anything in the > kapi. > That's just bullshit. No, it is a critique of the bad kAPI choices in this patch that mean it broadly doesn't use namespaces, net devices or IP routing correctly. > The design of the RDMA stack is that Verbs are used by core IB > services, such as addressing. For these services, as the > specification requires, all relevant fields must be reported in the > CQE, period. All spec-compliant HW devices follow this. Wrong, the kapi needs to meet the needs of the kernel, and is influenced but not set by the various standards. That means we get to make better choices in the kapi than exposing wc.network_type. > If a ULP wants to create an address handle from a completion, there > are service routines to accomplish that, based on the reported > fields. If it doesn't care, there is no reason to sacrifice > performance. I have no idea why you think there would be a performance sacrifice, maybe you should review the patches and my remarks again. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set
On 12/15/2015 04:46 PM, Doug Ledford wrote: > On 12/15/2015 04:42 PM, Hal Rosenstock wrote: >> On 12/15/2015 4:20 PM, Jason Gunthorpe wrote: The unicast/multicast extended counters are not always supported - > depends on setting of PerfMgt ClassPortInfo > CapabilityMask.IsExtendedWidthSupportedNoIETF (bit 10). >> >>> Yes.. certainly this proposed patch needs to account for that and >>> continue to use the 32 bit ones in that case. >> >> There are no 32 bit equivalents of those 4 "IETF" counters ([uni >> multi]cast [xmit rcv] pkts). >> >> When not supported, perhaps it is best not to populate these counters in >> sysfs so one can discern between counter not supported and 0 value. >> >> I'm still working on definitive mthca answer but think the attribute is >> not supported there. Does anyone out there have an mthca setup where >> they can try this ? > > Yes. > > From my mthca machine: [root@rdma-dev-04 ~]$ lspci | grep Mellanox 01:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev 20) [root@rdma-dev-04 ~]$ perfquery # Port counters: Lid 36 port 1 (CapMask: 0x00) PortSelect:..1 CounterSelect:...0x SymbolErrorCounter:..0 LinkErrorRecoveryCounter:0 LinkDownedCounter:...0 PortRcvErrors:...0 PortRcvRemotePhysicalErrors:.0 PortRcvSwitchRelayErrors:0 PortXmitDiscards:0 PortXmitConstraintErrors:0 PortRcvConstraintErrors:.0 CounterSelect2:..0x00 LocalLinkIntegrityErrors:0 ExcessiveBufferOverrunErrors:0 VL15Dropped:.1 PortXmitData:2470620192 PortRcvData:.2401094563 PortXmitPkts:6363544 PortRcvPkts:.6321251 [root@rdma-dev-04 ~]$ perfquery -x ibwarn: [29831] dump_perfcounters: PerfMgt ClassPortInfo 0x0; No extended counter support indicated perfquery: iberror: failed: perfextquery So, no extended counters on this device. -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set
On Tue, 15 Dec 2015, Jason Gunthorpe wrote: > > The unicast/multicast extended counters are not always supported - > > depends on setting of PerfMgt ClassPortInfo > > CapabilityMask.IsExtendedWidthSupportedNoIETF (bit 10). > > Yes.. certainly this proposed patch needs to account for that and > continue to use the 32 bit ones in that case. So this is in struct ib_class_port_info the capability_mask? This does not seem to be used anywhere in the IB core. Here is a draft patch to change the counters depending on a bit (which I do not know how to determine). So this would hopefully work if someone would insert the proper check. Note that this patch no longer needs the earlier 2 patches. >From Christoph Lameter Subject: IB Core: Display extended counter set if available Check if the extended counters are available and if so create the proper extended and additional counters. DRAFT: This is missing the check if this device supports extended counters. Signed-off-by: Christoph Lameter Index: linux/drivers/infiniband/core/sysfs.c === --- linux.orig/drivers/infiniband/core/sysfs.c +++ linux/drivers/infiniband/core/sysfs.c @@ -39,6 +39,7 @@ #include #include +#include struct ib_port { struct kobject kobj; @@ -65,6 +66,7 @@ struct port_table_attribute { struct port_attribute attr; charname[8]; int index; + int attr_id; }; static ssize_t port_attr_show(struct kobject *kobj, @@ -314,7 +316,15 @@ static ssize_t show_port_pkey(struct ib_ #define PORT_PMA_ATTR(_name, _counter, _width, _offset) \ struct port_table_attribute port_pma_attr_##_name = { \ .attr = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\ - .index = (_offset) | ((_width) << 16) | ((_counter) << 24) \ + .index = (_offset) | ((_width) << 16) | ((_counter) << 24), \ + .attr_id = IB_PMA_PORT_COUNTERS , \ +} + +#define PORT_PMA_ATTR_EXT(_name, _width, _offset) \ +struct port_table_attribute port_pma_attr_ext_##_name = { \ + .attr = __ATTR(_name, S_IRUGO, show_pma_counter, NULL),\ + .index = (_offset) | ((_width) << 16), \ + .attr_id = IB_PMA_PORT_COUNTERS_EXT , \ } static ssize_t show_pma_counter(struct ib_port *p, struct port_attribute *attr, @@ -344,7 +354,7 @@ static ssize_t show_pma_counter(struct i in_mad->mad_hdr.mgmt_class= IB_MGMT_CLASS_PERF_MGMT; in_mad->mad_hdr.class_version = 1; in_mad->mad_hdr.method= IB_MGMT_METHOD_GET; - in_mad->mad_hdr.attr_id = cpu_to_be16(0x12); /* PortCounters */ + in_mad->mad_hdr.attr_id = tab_attr->attr_id; in_mad->data[41] = p->port_num; /* PortSelect field */ @@ -375,6 +385,11 @@ static ssize_t show_pma_counter(struct i ret = sprintf(buf, "%u\n", be32_to_cpup((__be32 *)(out_mad->data + 40 + offset / 8))); break; + case 64: + ret = sprintf(buf, "%llu\n", + be64_to_cpup((__be64 *)(out_mad->data + 40 + offset / 8))); + break; + default: ret = 0; } @@ -403,6 +418,18 @@ static PORT_PMA_ATTR(port_rcv_data static PORT_PMA_ATTR(port_xmit_packets , 14, 32, 256); static PORT_PMA_ATTR(port_rcv_packets , 15, 32, 288); +/* + * Counters added by extended set + */ +static PORT_PMA_ATTR_EXT(port_xmit_data, 64, 64); +static PORT_PMA_ATTR_EXT(port_rcv_data , 64, 128); +static PORT_PMA_ATTR_EXT(port_xmit_packets , 64, 192); +static PORT_PMA_ATTR_EXT(port_rcv_packets , 64, 256); +static PORT_PMA_ATTR_EXT(unicast_xmit_packets , 64, 320); +static PORT_PMA_ATTR_EXT(unicast_rcv_packets , 64, 384); +static PORT_PMA_ATTR_EXT(multicast_xmit_packets, 64, 448); +static PORT_PMA_ATTR_EXT(multicast_rcv_packets , 64, 512); + static struct attribute *pma_attrs[] = { &port_pma_attr_symbol_error.attr.attr, &port_pma_attr_link_error_recovery.attr.attr, @@ -423,11 +450,40 @@ static struct attribute *pma_attrs[] = { NULL }; +static struct attribute *pma_attrs_ext[] = { + &port_pma_attr_symbol_error.attr.attr, + &port_pma_attr_link_error_recovery.attr.attr, + &port_pma_attr_link_downed.attr.attr, + &port_pma_attr_port_rcv_errors.attr.attr, + &port_pma_attr_port_rcv_remote_physical_errors.attr.attr, + &port_pma_attr_port_rcv_switch_relay_errors.attr.attr, + &port_pma_attr_port_xmit_discards.attr.attr, + &port_pma_attr_port_xmit_constraint_errors.attr.attr, + &port_pma_attr_port_rcv_constraint_errors.attr.attr, + &port_pma_attr_loc
Re: [PATCH] svc_rdma: use local_dma_lkey
Looks good, Reviewed-by: Sagi Grimberg -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] IB core: Display 64 bit counters from the extended set
On Tue, 15 Dec 2015, Doug Ledford wrote: > On 12/15/2015 04:42 PM, Hal Rosenstock wrote: > > On 12/15/2015 4:20 PM, Jason Gunthorpe wrote: > >>> The unicast/multicast extended counters are not always supported - > depends on setting of PerfMgt ClassPortInfo > CapabilityMask.IsExtendedWidthSupportedNoIETF (bit 10). > > > >> Yes.. certainly this proposed patch needs to account for that and > >> continue to use the 32 bit ones in that case. > > > > There are no 32 bit equivalents of those 4 "IETF" counters ([uni > > multi]cast [xmit rcv] pkts). > > > > When not supported, perhaps it is best not to populate these counters in > > sysfs so one can discern between counter not supported and 0 value. > > > > I'm still working on definitive mthca answer but think the attribute is > > not supported there. Does anyone out there have an mthca setup where > > they can try this ? > > Yes. We can return ENOSYS for the counters not supported. Or simply not create the sysfs files when the device is instantiated as well as fall back to the 32 bit counters on instantiation for those devices not supporting the extended set. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] svc_rdma: use local_dma_lkey
> On Dec 16, 2015, at 10:11 AM, Christoph Hellwig wrote: > > We now alwasy have a per-PD local_dma_lkey available. Make use of that > fact in svc_rdma and stop registering our own MR. > > Signed-off-by: Christoph Hellwig Reviewed-by: Chuck Lever > --- > include/linux/sunrpc/svc_rdma.h| 2 -- > net/sunrpc/xprtrdma/svc_rdma_backchannel.c | 2 +- > net/sunrpc/xprtrdma/svc_rdma_recvfrom.c| 4 ++-- > net/sunrpc/xprtrdma/svc_rdma_sendto.c | 6 ++--- > net/sunrpc/xprtrdma/svc_rdma_transport.c | 36 -- > 5 files changed, 10 insertions(+), 40 deletions(-) > > diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h > index b13513a..5322fea 100644 > --- a/include/linux/sunrpc/svc_rdma.h > +++ b/include/linux/sunrpc/svc_rdma.h > @@ -156,13 +156,11 @@ struct svcxprt_rdma { > struct ib_qp *sc_qp; > struct ib_cq *sc_rq_cq; > struct ib_cq *sc_sq_cq; > - struct ib_mr *sc_phys_mr; /* MR for server memory */ > int (*sc_reader)(struct svcxprt_rdma *, > struct svc_rqst *, > struct svc_rdma_op_ctxt *, > int *, u32 *, u32, u32, u64, bool); > u32 sc_dev_caps; /* distilled device caps */ > - u32 sc_dma_lkey; /* local dma key */ > unsigned int sc_frmr_pg_list_len; > struct list_head sc_frmr_q; > spinlock_t sc_frmr_q_lock; > diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c > b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c > index 417cec1..c428734 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c > @@ -128,7 +128,7 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma, > > ctxt->wr_op = IB_WR_SEND; > ctxt->direction = DMA_TO_DEVICE; > - ctxt->sge[0].lkey = rdma->sc_dma_lkey; > + ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey; > ctxt->sge[0].length = sndbuf->len; > ctxt->sge[0].addr = > ib_dma_map_page(rdma->sc_cm_id->device, ctxt->pages[0], 0, > diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > index 3dfe464..c8b8a8b 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > @@ -144,6 +144,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt, > > head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no]; > head->arg.page_len += len; > + > head->arg.len += len; > if (!pg_off) > head->count++; > @@ -160,8 +161,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt, > goto err; > atomic_inc(&xprt->sc_dma_used); > > - /* The lkey here is either a local dma lkey or a dma_mr lkey */ > - ctxt->sge[pno].lkey = xprt->sc_dma_lkey; > + ctxt->sge[pno].lkey = xprt->sc_pd->local_dma_lkey; > ctxt->sge[pno].length = len; > ctxt->count++; > > diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > index ced3151..20bd5d4 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > @@ -265,7 +265,7 @@ static int send_write(struct svcxprt_rdma *xprt, struct > svc_rqst *rqstp, >sge[sge_no].addr)) > goto err; > atomic_inc(&xprt->sc_dma_used); > - sge[sge_no].lkey = xprt->sc_dma_lkey; > + sge[sge_no].lkey = xprt->sc_pd->local_dma_lkey; > ctxt->count++; > sge_off = 0; > sge_no++; > @@ -487,7 +487,7 @@ static int send_reply(struct svcxprt_rdma *rdma, > ctxt->count = 1; > > /* Prepare the SGE for the RPCRDMA Header */ > - ctxt->sge[0].lkey = rdma->sc_dma_lkey; > + ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey; > ctxt->sge[0].length = svc_rdma_xdr_get_reply_hdr_len(rdma_resp); > ctxt->sge[0].addr = > ib_dma_map_page(rdma->sc_cm_id->device, page, 0, > @@ -511,7 +511,7 @@ static int send_reply(struct svcxprt_rdma *rdma, >ctxt->sge[sge_no].addr)) > goto err; > atomic_inc(&rdma->sc_dma_used); > - ctxt->sge[sge_no].lkey = rdma->sc_dma_lkey; > + ctxt->sge[sge_no].lkey = rdma->sc_pd->local_dma_lkey; > ctxt->sge[sge_no].length = sge_bytes; > } > if (byte_count != 0) { > diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c > b/net/sunrpc/xprtrdma/svc_rdma_transport.c > index abfbd02..faf4c49 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c > @@ -232,11 +232,11 @@ void svc_
Re: [PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack
On Wed, Dec 16, 2015 at 10:13:31AM -0500, Chuck Lever wrote: > > Shouldn't be an issue with transparent unions these days: > > > > union { > > struct ib_reg_wrfr_regwr; > > struct ib_send_wr fr_invwr; > > }; > > Right, but isn't that a gcc-ism that Al hates? If > everyone is OK with that construction, I will use it. I started out as a GNUism, but now is supported in C11. We use it a lot all over the kernel. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] svc_rdma: use local_dma_lkey
We now alwasy have a per-PD local_dma_lkey available. Make use of that fact in svc_rdma and stop registering our own MR. Signed-off-by: Christoph Hellwig --- include/linux/sunrpc/svc_rdma.h| 2 -- net/sunrpc/xprtrdma/svc_rdma_backchannel.c | 2 +- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c| 4 ++-- net/sunrpc/xprtrdma/svc_rdma_sendto.c | 6 ++--- net/sunrpc/xprtrdma/svc_rdma_transport.c | 36 -- 5 files changed, 10 insertions(+), 40 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index b13513a..5322fea 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -156,13 +156,11 @@ struct svcxprt_rdma { struct ib_qp *sc_qp; struct ib_cq *sc_rq_cq; struct ib_cq *sc_sq_cq; - struct ib_mr *sc_phys_mr; /* MR for server memory */ int (*sc_reader)(struct svcxprt_rdma *, struct svc_rqst *, struct svc_rdma_op_ctxt *, int *, u32 *, u32, u32, u64, bool); u32 sc_dev_caps; /* distilled device caps */ - u32 sc_dma_lkey; /* local dma key */ unsigned int sc_frmr_pg_list_len; struct list_head sc_frmr_q; spinlock_t sc_frmr_q_lock; diff --git a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c index 417cec1..c428734 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_backchannel.c +++ b/net/sunrpc/xprtrdma/svc_rdma_backchannel.c @@ -128,7 +128,7 @@ static int svc_rdma_bc_sendto(struct svcxprt_rdma *rdma, ctxt->wr_op = IB_WR_SEND; ctxt->direction = DMA_TO_DEVICE; - ctxt->sge[0].lkey = rdma->sc_dma_lkey; + ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey; ctxt->sge[0].length = sndbuf->len; ctxt->sge[0].addr = ib_dma_map_page(rdma->sc_cm_id->device, ctxt->pages[0], 0, diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 3dfe464..c8b8a8b 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -144,6 +144,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt, head->arg.pages[pg_no] = rqstp->rq_arg.pages[pg_no]; head->arg.page_len += len; + head->arg.len += len; if (!pg_off) head->count++; @@ -160,8 +161,7 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt, goto err; atomic_inc(&xprt->sc_dma_used); - /* The lkey here is either a local dma lkey or a dma_mr lkey */ - ctxt->sge[pno].lkey = xprt->sc_dma_lkey; + ctxt->sge[pno].lkey = xprt->sc_pd->local_dma_lkey; ctxt->sge[pno].length = len; ctxt->count++; diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c index ced3151..20bd5d4 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c @@ -265,7 +265,7 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, sge[sge_no].addr)) goto err; atomic_inc(&xprt->sc_dma_used); - sge[sge_no].lkey = xprt->sc_dma_lkey; + sge[sge_no].lkey = xprt->sc_pd->local_dma_lkey; ctxt->count++; sge_off = 0; sge_no++; @@ -487,7 +487,7 @@ static int send_reply(struct svcxprt_rdma *rdma, ctxt->count = 1; /* Prepare the SGE for the RPCRDMA Header */ - ctxt->sge[0].lkey = rdma->sc_dma_lkey; + ctxt->sge[0].lkey = rdma->sc_pd->local_dma_lkey; ctxt->sge[0].length = svc_rdma_xdr_get_reply_hdr_len(rdma_resp); ctxt->sge[0].addr = ib_dma_map_page(rdma->sc_cm_id->device, page, 0, @@ -511,7 +511,7 @@ static int send_reply(struct svcxprt_rdma *rdma, ctxt->sge[sge_no].addr)) goto err; atomic_inc(&rdma->sc_dma_used); - ctxt->sge[sge_no].lkey = rdma->sc_dma_lkey; + ctxt->sge[sge_no].lkey = rdma->sc_pd->local_dma_lkey; ctxt->sge[sge_no].length = sge_bytes; } if (byte_count != 0) { diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index abfbd02..faf4c49 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -232,11 +232,11 @@ void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt) for (i = 0; i < ctxt->count && ctxt->sge[i].length; i++) { /* * Unmap the DMA addr in the SGE
small svc_rdma cleanup
This makes use of the now always available local_dma_lkey, and goes on top of Chuck's "[PATCH v4 00/11] NFS/RDMA server patches for v4.5" series. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack
> On Dec 16, 2015, at 10:11 AM, Christoph Hellwig wrote: > > On Wed, Dec 16, 2015 at 10:06:33AM -0500, Chuck Lever wrote: >>> Would it make sense to unionize these as they are guaranteed not to >>> execute together? Some people don't like this sort of savings. >> >> I dislike unions because they make the code that uses >> them less readable. I can define macros to help that, >> but sigh! OK. > > Shouldn't be an issue with transparent unions these days: > > union { > struct ib_reg_wrfr_regwr; > struct ib_send_wr fr_invwr; > }; Right, but isn't that a gcc-ism that Al hates? If everyone is OK with that construction, I will use it. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 09/11] SUNRPC: Introduce xprt_commit_rqst()
Hi Anna- > On Dec 16, 2015, at 8:48 AM, Anna Schumaker wrote: > > Hi Chuck, > > Sorry for the last minute comment. > > On 12/14/2015 04:19 PM, Chuck Lever wrote: >> I'm about to add code in the RPC/RDMA reply handler between the >> xprt_lookup_rqst() and xprt_complete_rqst() call site that needs >> to execute outside of spinlock critical sections. >> >> Add a hook to remove an rpc_rqst from the pending list once >> the transport knows its going to invoke xprt_complete_rqst(). >> >> Signed-off-by: Chuck Lever >> --- >> include/linux/sunrpc/xprt.h|1 + >> net/sunrpc/xprt.c | 14 ++ >> net/sunrpc/xprtrdma/rpc_rdma.c |4 >> 3 files changed, 19 insertions(+) >> >> diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h >> index 69ef5b3..ab6c3a5 100644 >> --- a/include/linux/sunrpc/xprt.h >> +++ b/include/linux/sunrpc/xprt.h >> @@ -366,6 +366,7 @@ void >> xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action); >> void xprt_write_space(struct rpc_xprt *xprt); >> void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task >> *task, int result); >> struct rpc_rqst *xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid); >> +voidxprt_commit_rqst(struct rpc_task *task); >> void xprt_complete_rqst(struct rpc_task *task, int copied); >> void xprt_release_rqst_cong(struct rpc_task *task); >> void xprt_disconnect_done(struct rpc_xprt *xprt); >> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c >> index 2e98f4a..a5be4ab 100644 >> --- a/net/sunrpc/xprt.c >> +++ b/net/sunrpc/xprt.c >> @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task) >> } >> >> /** >> + * xprt_commit_rqst - remove rqst from pending list early >> + * @task: RPC request to remove >> + * >> + * Caller holds transport lock. >> + */ >> +void xprt_commit_rqst(struct rpc_task *task) >> +{ >> +struct rpc_rqst *req = task->tk_rqstp; >> + >> +list_del_init(&req->rq_list); >> +} >> +EXPORT_SYMBOL_GPL(xprt_commit_rqst); > > Can you move this function into the xprtrdma code, since it's not called > outside of there? I think that's a layering violation, and the idea is to allow other transports to use this API eventually. But I'll include this change in the next version of the series. > Thanks, > Anna > >> + >> +/** >> * xprt_complete_rqst - called when reply processing is complete >> * @task: RPC request that recently completed >> * @copied: actual number of bytes received from the transport >> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c >> index c10d969..0bc8c39 100644 >> --- a/net/sunrpc/xprtrdma/rpc_rdma.c >> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c >> @@ -804,6 +804,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep) >> if (req->rl_reply) >> goto out_duplicate; >> >> +xprt_commit_rqst(rqst->rq_task); >> +spin_unlock_bh(&xprt->transport_lock); >> + >> dprintk("RPC: %s: reply 0x%p completes request 0x%p\n" >> " RPC request 0x%p xid 0x%08x\n", >> __func__, rep, req, rqst, >> @@ -894,6 +897,7 @@ badheader: >> else if (credits > r_xprt->rx_buf.rb_max_requests) >> credits = r_xprt->rx_buf.rb_max_requests; >> >> +spin_lock_bh(&xprt->transport_lock); >> cwnd = xprt->cwnd; >> xprt->cwnd = credits << RPC_CWNDSHIFT; >> if (xprt->cwnd > cwnd) >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack
On Wed, Dec 16, 2015 at 10:06:33AM -0500, Chuck Lever wrote: > > Would it make sense to unionize these as they are guaranteed not to > > execute together? Some people don't like this sort of savings. > > I dislike unions because they make the code that uses > them less readable. I can define macros to help that, > but sigh! OK. Shouldn't be an issue with transparent unions these days: union { struct ib_reg_wrfr_regwr; struct ib_send_wr fr_invwr; }; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/11] xprtrdma: Add ro_unmap_sync method for FRWR
> On Dec 16, 2015, at 8:57 AM, Sagi Grimberg wrote: > > >> +static void >> +__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg, >> + int rc) >> +{ >> +struct ib_device *device = r_xprt->rx_ia.ri_device; >> +struct rpcrdma_mw *mw = seg->rl_mw; >> +int nsegs = seg->mr_nsegs; >> + >> +seg->rl_mw = NULL; >> + >> +while (nsegs--) >> +rpcrdma_unmap_one(device, seg++); > > Chuck, shouldn't this be replaced with ib_dma_unmap_sg? Looks like this was left over from before the conversion to use ib_dma_unmap_sg. I'll have a look. > Sorry for the late comment (Didn't find enough time to properly > review this...) -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack
> On Dec 16, 2015, at 9:00 AM, Sagi Grimberg wrote: > > >> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h >> b/net/sunrpc/xprtrdma/xprt_rdma.h >> index 4197191..e60d817 100644 >> --- a/net/sunrpc/xprtrdma/xprt_rdma.h >> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h >> @@ -206,6 +206,8 @@ struct rpcrdma_frmr { >> enum rpcrdma_frmr_state fr_state; >> struct work_struct fr_work; >> struct rpcrdma_xprt *fr_xprt; >> +struct ib_reg_wrfr_regwr; >> +struct ib_send_wr fr_invwr; > > Would it make sense to unionize these as they are guaranteed not to > execute together? Some people don't like this sort of savings. I dislike unions because they make the code that uses them less readable. I can define macros to help that, but sigh! OK. -- Chuck Lever -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 5/5] IB/mlx5: Mmap the HCA's core clock register to user-space
enum mlx5_ib_mmap_cmd { MLX5_IB_MMAP_REGULAR_PAGE = 0, - MLX5_IB_MMAP_GET_CONTIGUOUS_PAGES = 1, /* always last */ + MLX5_IB_MMAP_GET_CONTIGUOUS_PAGES = 1, + /* 5 is chosen in order to be compatible with old versions of libmlx5 */ + MLX5_IB_MMAP_CORE_CLOCK = 5, }; Overall the patches look good so I'd suggest not to apply atop of the contig pages patchset from Yishai which obviously involves some debate. Although if this bit is the only conflict then perhaps doug can take care of it... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 3/5] IB/mlx5: Add support for hca_core_clock and timestamp_mask
On Wed, Dec 16, 2015 at 4:43 PM, Sagi Grimberg wrote: > >> Reporting the hca_core_clock (in kHZ) and the timestamp_mask in >> query_device extended verb. timestamp_mask is used by users in order >> to know what is the valid range of the raw timestamps, while >> hca_core_clock reports the clock frequency that is used for >> timestamps. > > > Hi Matan, > > Shouldn't this patch come last? > Not necessarily. In order to support completion timestamping (that's what defined in this query_device patch), we only need create_cq_ex in mlx5_ib. The down stream patches adds support for reading the HCA core clock (via query_values). One could have completion timestamping support without having ibv_query_values support. Thanks for taking a look. > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 3/5] IB/mlx5: Add support for hca_core_clock and timestamp_mask
Reporting the hca_core_clock (in kHZ) and the timestamp_mask in query_device extended verb. timestamp_mask is used by users in order to know what is the valid range of the raw timestamps, while hca_core_clock reports the clock frequency that is used for timestamps. Hi Matan, Shouldn't this patch come last? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 37/37] IB/rdmavt: Add support for new memory registration API
This patch exists to provide parity for what is in qib. Should we not have it? If not, why do we have: commit 38071a461f0a ("IB/qib: Support the new memory registration API") That was done by me because I saw this in qib and assumed that it was supported. Now that I found out that it isn't, I'd say it should be removed altogether shouldn't it? That doesn't mean it can't be added to rdmavt as a future enhancement though if there is a need. Well, given that we're trying to consolidate on post send registration interface it's kind of a must I'd say. Are you asking because soft-roce will need it? I was asking in general, but in specific soft-roce as a consumer will need to support that yes. I think it makes sense to revisit when soft-roce comes in, I agree. since qib/hfi do not need IB_WR_LOCAL_INV. Can you explain? Does qib/hfi have a magic way to invalidate memory regions? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 2/5] IB/core: Add ib_is_udata_cleared
On 15/12/2015 20:30, Matan Barak wrote: > Extending core and vendor verb commands require us to check that the > unknown part of the user's given command is all zeros. > Adding ib_is_udata_cleared in order to do so. > > Signed-off-by: Matan Barak Reviewed-by: Haggai Eran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next V2 00/11] Add RoCE v2 support
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- > ow...@vger.kernel.org] On Behalf Of Doug Ledford > In particular, Liran piped up with this comment: > > "Also, I don't want to do any route resolution on the Rx path. A UD QP > completion just reports the details of the packet it received. > > Conceptually, an incoming packet may not even match an SGID index at all. > Maybe, responses should be sent from another device. This should not be > decided that the point that a packet was received." > > The part that bothers me about this is that this statement makes sense when > just thinking about the spec, as you say. However, once you consider > namespaces, security implications make this statement spec compliant, but > still unacceptable. The spec itself is silent on namespaces. But, you guys > wanted, and you got, namespace support. > Since that's beyond spec, and carries security requirements, I think it's > fair to > say that from now on, the Linux kernel RDMA stack can no longer *just* be > spec compliant. There are additional concerns that must always be > addressed with new changes, and those are the namespace constraint > preservation concerns. > Hi Doug, Currently, there is no namespace support for RoCE, so the RoCEv2 patches have *nothing* to do with this. That said, the RoCE specification does not contradict or inhibit any future implementation for namespaces. The CMA will get the from ib_wc and resolve to a netdev (or sgid_index->netdev, whatever) and process the request accordingly. We can have endless theoretical discussions on features that are not even implemented yet (e.g., RoCE namespace support) each time we add a minor straightforward, *spec-compliant* change that *all* RoCE vendors adhere to. If someone wishes to introduce a new concept, API refactoring proposal, or similar for community review, please do so with a different RFC. This is hindering progress of the whole RDMA stack development! For example, the posted SoftRoCE patches are waiting just for this. The RoCEv2 patches have been posted upstream for review for months (!) now. I simply cannot understand why this is lagging for so long; let's start to get the wheels rolling. --Liran
Re: [PATCH 37/37] IB/rdmavt: Add support for new memory registration API
On Wed, Dec 16, 2015 at 03:21:02PM +0200, Sagi Grimberg wrote: This question is not directly related to this patch, but given that this is a copy-paste from the qib driver I'll go ahead and take it anyway. How does qib (and rvt now) do memory key invalidation? I didn't see any reference to IB_WR_LOCAL_INV anywhere in the qib driver... What am I missing? ping? In short, it doesn't look like qib or hfi1 support this. Oh, I'm surprised to learn that. At least I see that qib is not exposing IB_DEVICE_MEM_MGT_EXTENSIONS. But whats the point in doing something with a IB_WR_REG_MR at all? Given that this is not supported anyway, why does this patch exist? This patch exists to provide parity for what is in qib. Should we not have it? If not, why do we have: commit 38071a461f0a ("IB/qib: Support the new memory registration API") That doesn't mean it can't be added to rdmavt as a future enhancement though if there is a need. Well, given that we're trying to consolidate on post send registration interface it's kind of a must I'd say. Are you asking because soft-roce will need it? I was asking in general, but in specific soft-roce as a consumer will need to support that yes. I think it makes sense to revisit when soft-roce comes in, since qib/hfi do not need IB_WR_LOCAL_INV. -Denny -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: device attr cleanup
On 12/16/2015 3:40 PM, Sagi Grimberg wrote: I really don't have a strong preference on either of the approaches. I just want to see this included one way or the other. sure, agree, I will send my patches tomorrow -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 04/11] xprtrdma: Move struct ib_send_wr off the stack
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index 4197191..e60d817 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -206,6 +206,8 @@ struct rpcrdma_frmr { enum rpcrdma_frmr_state fr_state; struct work_struct fr_work; struct rpcrdma_xprt *fr_xprt; + struct ib_reg_wrfr_regwr; + struct ib_send_wr fr_invwr; Would it make sense to unionize these as they are guaranteed not to execute together? Some people don't like this sort of savings. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 06/11] xprtrdma: Add ro_unmap_sync method for FRWR
+static void +__frwr_dma_unmap(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg, +int rc) +{ + struct ib_device *device = r_xprt->rx_ia.ri_device; + struct rpcrdma_mw *mw = seg->rl_mw; + int nsegs = seg->mr_nsegs; + + seg->rl_mw = NULL; + + while (nsegs--) + rpcrdma_unmap_one(device, seg++); Chuck, shouldn't this be replaced with ib_dma_unmap_sg? Sorry for the late comment (Didn't find enough time to properly review this...) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 09/11] SUNRPC: Introduce xprt_commit_rqst()
Hi Chuck, Sorry for the last minute comment. On 12/14/2015 04:19 PM, Chuck Lever wrote: > I'm about to add code in the RPC/RDMA reply handler between the > xprt_lookup_rqst() and xprt_complete_rqst() call site that needs > to execute outside of spinlock critical sections. > > Add a hook to remove an rpc_rqst from the pending list once > the transport knows its going to invoke xprt_complete_rqst(). > > Signed-off-by: Chuck Lever > --- > include/linux/sunrpc/xprt.h|1 + > net/sunrpc/xprt.c | 14 ++ > net/sunrpc/xprtrdma/rpc_rdma.c |4 > 3 files changed, 19 insertions(+) > > diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h > index 69ef5b3..ab6c3a5 100644 > --- a/include/linux/sunrpc/xprt.h > +++ b/include/linux/sunrpc/xprt.h > @@ -366,6 +366,7 @@ void > xprt_wait_for_buffer_space(struct rpc_task *task, rpc_action action); > void xprt_write_space(struct rpc_xprt *xprt); > void xprt_adjust_cwnd(struct rpc_xprt *xprt, struct rpc_task > *task, int result); > struct rpc_rqst *xprt_lookup_rqst(struct rpc_xprt *xprt, __be32 xid); > +void xprt_commit_rqst(struct rpc_task *task); > void xprt_complete_rqst(struct rpc_task *task, int copied); > void xprt_release_rqst_cong(struct rpc_task *task); > void xprt_disconnect_done(struct rpc_xprt *xprt); > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c > index 2e98f4a..a5be4ab 100644 > --- a/net/sunrpc/xprt.c > +++ b/net/sunrpc/xprt.c > @@ -837,6 +837,20 @@ static void xprt_update_rtt(struct rpc_task *task) > } > > /** > + * xprt_commit_rqst - remove rqst from pending list early > + * @task: RPC request to remove > + * > + * Caller holds transport lock. > + */ > +void xprt_commit_rqst(struct rpc_task *task) > +{ > + struct rpc_rqst *req = task->tk_rqstp; > + > + list_del_init(&req->rq_list); > +} > +EXPORT_SYMBOL_GPL(xprt_commit_rqst); Can you move this function into the xprtrdma code, since it's not called outside of there? Thanks, Anna > + > +/** > * xprt_complete_rqst - called when reply processing is complete > * @task: RPC request that recently completed > * @copied: actual number of bytes received from the transport > diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c > index c10d969..0bc8c39 100644 > --- a/net/sunrpc/xprtrdma/rpc_rdma.c > +++ b/net/sunrpc/xprtrdma/rpc_rdma.c > @@ -804,6 +804,9 @@ rpcrdma_reply_handler(struct rpcrdma_rep *rep) > if (req->rl_reply) > goto out_duplicate; > > + xprt_commit_rqst(rqst->rq_task); > + spin_unlock_bh(&xprt->transport_lock); > + > dprintk("RPC: %s: reply 0x%p completes request 0x%p\n" > " RPC request 0x%p xid 0x%08x\n", > __func__, rep, req, rqst, > @@ -894,6 +897,7 @@ badheader: > else if (credits > r_xprt->rx_buf.rb_max_requests) > credits = r_xprt->rx_buf.rb_max_requests; > > + spin_lock_bh(&xprt->transport_lock); > cwnd = xprt->cwnd; > xprt->cwnd = credits << RPC_CWNDSHIFT; > if (xprt->cwnd > cwnd) > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: device attr cleanup
Hi Doug, Lets stop beating, both horses and people. I do understand that 1. you don't link the removal of the attr 2. you do like the removal of all the query calls I am proposing to take the path of a patch that does exactly #2 while avoiding #1. I really don't have a strong preference on either of the approaches. I just want to see this included one way or the other. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 37/37] IB/rdmavt: Add support for new memory registration API
This question is not directly related to this patch, but given that this is a copy-paste from the qib driver I'll go ahead and take it anyway. How does qib (and rvt now) do memory key invalidation? I didn't see any reference to IB_WR_LOCAL_INV anywhere in the qib driver... What am I missing? ping? In short, it doesn't look like qib or hfi1 support this. Oh, I'm surprised to learn that. At least I see that qib is not exposing IB_DEVICE_MEM_MGT_EXTENSIONS. But whats the point in doing something with a IB_WR_REG_MR at all? Given that this is not supported anyway, why does this patch exist? That doesn't mean it can't be added to rdmavt as a future enhancement though if there is a need. Well, given that we're trying to consolidate on post send registration interface it's kind of a must I'd say. Are you asking because soft-roce will need it? I was asking in general, but in specific soft-roce as a consumer will need to support that yes. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/11] NFS/RDMA client patches for 4.5
Hi Chuck, iozone passed on ocrdma device. Link bounce fails to recover iozone traffic, however failure is not related to this patch series. I am in processes of finding out the patch which broke it. Tested-By: Devesh Sharma On Wed, Dec 16, 2015 at 1:07 AM, Anna Schumaker wrote: > Thanks, Chuck! > > Everything looks okay to me, so I'll apply these patches and send them to > Trond before the holiday. > > On 12/14/2015 04:17 PM, Chuck Lever wrote: >> For 4.5, I'd like to address the send queue accounting and >> invalidation/unmap ordering issues Jason brought up a couple of >> months ago. >> >> In preparation for Doug's final topic branch, Anna, I've rebased >> these on Christoph's ib_device_attr branch, but there were no merge >> conflicts or other changes needed. Could you begin preparing these >> for linux-next and other final testing and review? > > No merge conflicts is nice, and we might not need to worry about ordering the > pull request. > > Thanks, > Anna > >> >> Also available in the "nfs-rdma-for-4.5" topic branch of this git repo: >> >> git://git.linux-nfs.org/projects/cel/cel-2.6.git >> >> Or for browsing: >> >> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfs-rdma-for-4.5 >> >> >> Changes since v2: >> - Rebased on Christoph's ib_device_attr branch >> >> >> Changes since v1: >> >> - Rebased on v4.4-rc3 >> - Receive buffer safety margin patch dropped >> - Backchannel pr_err and pr_info converted to dprintk >> - Backchannel spin locks converted to work queue-safe locks >> - Fixed premature release of backchannel request buffer >> - NFSv4.1 callbacks tested with for-4.5 server >> >> --- >> >> Chuck Lever (11): >> xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock) >> xprtrdma: xprt_rdma_free() must not release backchannel reqs >> xprtrdma: Disable RPC/RDMA backchannel debugging messages >> xprtrdma: Move struct ib_send_wr off the stack >> xprtrdma: Introduce ro_unmap_sync method >> xprtrdma: Add ro_unmap_sync method for FRWR >> xprtrdma: Add ro_unmap_sync method for FMR >> xprtrdma: Add ro_unmap_sync method for all-physical registration >> SUNRPC: Introduce xprt_commit_rqst() >> xprtrdma: Invalidate in the RPC reply handler >> xprtrdma: Revert commit e7104a2a9606 ('xprtrdma: Cap req_cqinit'). >> >> >> include/linux/sunrpc/xprt.h|1 >> net/sunrpc/xprt.c | 14 +++ >> net/sunrpc/xprtrdma/backchannel.c | 22 ++--- >> net/sunrpc/xprtrdma/fmr_ops.c | 64 + >> net/sunrpc/xprtrdma/frwr_ops.c | 175 >> +++- >> net/sunrpc/xprtrdma/physical_ops.c | 13 +++ >> net/sunrpc/xprtrdma/rpc_rdma.c | 14 +++ >> net/sunrpc/xprtrdma/transport.c|3 + >> net/sunrpc/xprtrdma/verbs.c| 13 +-- >> net/sunrpc/xprtrdma/xprt_rdma.h| 12 +- >> 10 files changed, 283 insertions(+), 48 deletions(-) >> >> -- >> Chuck Lever >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/11] NFS/RDMA server patches for v4.5
iozone passed on ocrdma device. Link bounce fails to recover iozone traffic, however failure is not related to this patch series. I am in processes of finding out the patch which broke it. Tested-By: Devesh Sharma On Tue, Dec 15, 2015 at 3:00 AM, Chuck Lever wrote: > Here are patches to support server-side bi-directional RPC/RDMA > operation (to enable NFSv4.1 on RPC/RDMA transports). Thanks to > all who reviewed v1, v2, and v3. This version has some significant > changes since the previous one. > > In preparation for Doug's final topic branch, Bruce, I've rebased > these on Christoph's ib_device_attr branch. There were some merge > conflicts which I've fixed and tested. These are ready for your > review. > > Also available in the "nfsd-rdma-for-4.5" topic branch of this git repo: > > git://git.linux-nfs.org/projects/cel/cel-2.6.git > > Or for browsing: > > http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-for-4.5 > > > Changes since v3: > - Rebased on Christoph's ib_device_attr branch > - Backchannel patches have been squashed together > - Memory allocation overhaul to prevent blocking allocation > when sending backchannel calls > > > Changes since v2: > - Rebased on v4.4-rc4 > - Backchannel code in new source file to address dprintk issues > - svc_rdma_get_context() now uses a pre-allocated cache > - Dropped svc_rdma_send clean up > > > Changes since v1: > > - Rebased on v4.4-rc3 > - Removed the use of CONFIG_SUNRPC_BACKCHANNEL > - Fixed computation of forward and backward max_requests > - Updated some comments and patch descriptions > - pr_err and pr_info converted to dprintk > - Simplified svc_rdma_get_context() > - Dropped patch removing access_flags field > - NFSv4.1 callbacks tested with for-4.5 client > > --- > > Chuck Lever (11): > svcrdma: Do not send XDR roundup bytes for a write chunk > svcrdma: Clean up rdma_create_xprt() > svcrdma: Clean up process_context() > svcrdma: Improve allocation of struct svc_rdma_op_ctxt > svcrdma: Improve allocation of struct svc_rdma_req_map > svcrdma: Remove unused req_map and ctxt kmem_caches > svcrdma: Add gfp flags to svc_rdma_post_recv() > svcrdma: Remove last two __GFP_NOFAIL call sites > svcrdma: Make map_xdr non-static > svcrdma: Define maximum number of backchannel requests > svcrdma: Add class for RDMA backwards direction transport > > > include/linux/sunrpc/svc_rdma.h| 37 ++- > net/sunrpc/xprt.c |1 > net/sunrpc/xprtrdma/Makefile |2 > net/sunrpc/xprtrdma/svc_rdma.c | 41 --- > net/sunrpc/xprtrdma/svc_rdma_backchannel.c | 371 > > net/sunrpc/xprtrdma/svc_rdma_recvfrom.c| 52 > net/sunrpc/xprtrdma/svc_rdma_sendto.c | 34 ++- > net/sunrpc/xprtrdma/svc_rdma_transport.c | 284 - > net/sunrpc/xprtrdma/transport.c| 30 +- > net/sunrpc/xprtrdma/xprt_rdma.h| 20 +- > 10 files changed, 730 insertions(+), 142 deletions(-) > create mode 100644 net/sunrpc/xprtrdma/svc_rdma_backchannel.c > > -- > Signature > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH for-next V2 00/11] Add RoCE v2 support
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- > > Since you and Jason did not reach a consensus, I have to dig in and > > see if these patches make it possible to break namespace confinement, > > either accidentally or with intentionally tricky behavior. That's > > going to take me some time. > > Everything to do with parsing a wc and constructing an AH is wrong in this > series, and the fixes require the API changes I mentioned ( add 'wc to gid > index' API call, add 'route to AH' API call) > > Every time you read 'route validation' - that is an error, the route should > never just be validated, it is needed information to construct a rocev2 AH. > All > the places that roughly hand parse the rocev2 WC should not be open coded. > > Even if current HW is broken for namespaces we should not enshrine that in > the kapi. > Currently, namespaces are not supported for RoCE. So for this patches, this is irrelevant. That said, we have everything we need for RoCE namespace support when we get there. All of this has nothing to do with "broken" and enshrining anything in the kapi. That's just bullshit. The crux of the discussion is the meaning of the API. The design of the RDMA stack is that Verbs are used by core IB services, such as addressing. For these services, as the specification requires, all relevant fields must be reported in the CQE, period. All spec-compliant HW devices follow this. If a ULP wants to create an address handle from a completion, there are service routines to accomplish that, based on the reported fields. If it doesn't care, there is no reason to sacrifice performance. --Liran -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] infiniband:core:Add needed error path in cm_init_av_by_path
On 12/15/2015 06:30 PM, Jason Gunthorpe wrote: > On Tue, Dec 15, 2015 at 05:38:34PM +0100, Michael Wang wrote: >> The hop_limit is only suggest that the package allowed to be >> routed, not have to, correct? > > If the hop limit is >= 2 (?) then the GRH is mandatory. The > SM will return this information in the PathRecord if it determines a > GRH is required. The whole stack follows this protocol. > > The GRH is optional for in-subnet communications. Thanks for the explain :-) I've rechecked the ib_init_ah_from_path() again, and found it still set IB_AH_GRH when the GID cache missing, but with: grh.sgid_index = 0 grh.flow_label = 0 grh.hop_limit = 0 grh.traffic_class = 0 Not sure if it's just coincidence, hop_limit is 0, so router will discard the pkg and GRH won't be used, the transaction in subnet still works. Could this by designed as an optimization for the case like when SM reassigning the GID? BTW, cma_sidr_rep_handler() also call ib_init_ah_from_path() with out a check on return. Regards, Michael Wang > > Jason > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html