Re: [PATCH V2 for-next 0/9] Peer-Direct support
On Tue, Nov 11, 2014 at 12:44 AM, Or Gerlitz wrote: > [...] can you make it easily reviewable? that is, either send as patches to > this list or much easier put the relevant piece in public git tree (github?) While it does take some effort to obtain the code, there are currently no plans to submit patches or create a public git tree for CCL direct until all of the lower level software is upstream. > Can you point to the missing elements in the chart present @ this LWN post > http://lwn.net/Articles/564795/ The CCL direct code depends on a lower level API called the Symmetric Communications Interface (SCIF). SCIF abstracts the details of communicating over the PCIe bus while providing an API that is symmetric between the host and coprocessor. The MIC drivers referenced by the LWN post do not include the SCIF API. Once SCIF is upstream, patches for CCL direct can be submitted. The point is that there are actual consumers of peer direct that could use it if it were merged upstream. -jerrie -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] DAPL support on s390x platform
On Wed, Nov 5, 2014 at 7:04 AM, Utz Bacher wrote: > (B) Status of patches > 1. kernel code -- the new system call: reviewed, acked and accepted by the > s390x maintainer Martin Schwidefsky (2014/10/13), so we will have that > system call in the s390x kernel. > 2. libibverbs -- define barriers on s390x: Looking for your feedback. We > understand there have been no general objections so far. > 3. libmlx4 -- provide MMIO abstraction: reviewed by the Mellanox > maintainers and we understand they would apply this once you give the go > for the overall set. > Previously, a patch to DAPL to build on s390x has been accepted already > (Arlin Davis, 2014/09/02). > > We gave your concern on MMIO handling on s390x serious consideration from > various angles, but the page fault handler does not appear workable. OTOH, > Mellanox is fine with the MMIO abstraction in libmlx4, and we didn't hear > of significant other concerns. With that, could you please consider the > patch set again to add s390x to the list of supported platforms? Happy to > repost the patches for convenience. If Mellanox is willing to take on the maintenance burden of changing all MMIO access to an inline function, and if you're willing to take on the burden of knowing that every new adapter you support means tracking down and convincing the maintainer of the driver library, then I'm OK with adding the simple barrier patch to libibverbs. Could you please send the latest version of that patch to me? It does seem a little strange to be adding a new system call to simulate kernel bypass, but I guess you do what you gotta do... - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/10] xprtrdma: Display async errors
On Tue, Nov 11, 2014 at 8:49 PM, Sagi Grimberg wrote: > On 11/11/2014 6:52 PM, Chuck Lever wrote: >> >> >>> On Nov 11, 2014, at 8:30 AM, Sagi Grimberg >>> wrote: >>> On 11/9/2014 3:15 AM, Chuck Lever wrote: An async error upcall is a hard error, and should be reported in the system log. >>> >>> Could be useful to others... Any chance you put this in ib_core for all >>> of us? >> >> >> Eventually. We certainly wouldn't want copies of this array of strings >> to appear many times in the kernel. That would be a waste of space. >> >> I have a similar patch that adds an array for CQ status codes, and >> xprtrdma has a string array already for connection status. Are those >> also interesting? >> > > Yep, also RDMA_CM events. Would certainly help people avoid source > code navigation to understand what is going on... Oh yes, Chuck, good if you can pick this up, AFAIRemeber most of the strings are already in the RDS code (net/rds) - please re-factor them from there into some IB core helpers, thanks alot!! -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/10] xprtrdma: Display async errors
On 11/11/2014 6:52 PM, Chuck Lever wrote: On Nov 11, 2014, at 8:30 AM, Sagi Grimberg wrote: On 11/9/2014 3:15 AM, Chuck Lever wrote: An async error upcall is a hard error, and should be reported in the system log. Could be useful to others... Any chance you put this in ib_core for all of us? Eventually. We certainly wouldn't want copies of this array of strings to appear many times in the kernel. That would be a waste of space. I have a similar patch that adds an array for CQ status codes, and xprtrdma has a string array already for connection status. Are those also interesting? Yep, also RDMA_CM events. Would certainly help people avoid source code navigation to understand what is going on... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/10] xprtrdma: Display async errors
> On Nov 11, 2014, at 8:30 AM, Sagi Grimberg wrote: > >> On 11/9/2014 3:15 AM, Chuck Lever wrote: >> An async error upcall is a hard error, and should be reported in >> the system log. >> > > Could be useful to others... Any chance you put this in ib_core for all > of us? Eventually. We certainly wouldn't want copies of this array of strings to appear many times in the kernel. That would be a waste of space. I have a similar patch that adds an array for CQ status codes, and xprtrdma has a string array already for connection status. Are those also interesting?-- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 15/17] IB/mlx5: Handle page faults
This patch implement a page fault handler (leaving the pages pinned as of time being). The page fault handler handles initiator and responder page faults for UD/RC transports, for send/receive operations, as well as RDMA read/write initiator support. Signed-off-by: Sagi Grimberg Signed-off-by: Shachar Raindel Signed-off-by: Haggai Eran --- drivers/infiniband/hw/mlx5/odp.c | 408 +++ include/linux/mlx5/qp.h | 7 + 2 files changed, 415 insertions(+) diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c index 63bbdba396f1..bd1dbe5ebc15 100644 --- a/drivers/infiniband/hw/mlx5/odp.c +++ b/drivers/infiniband/hw/mlx5/odp.c @@ -30,6 +30,9 @@ * SOFTWARE. */ +#include +#include + #include "mlx5_ib.h" struct workqueue_struct *mlx5_ib_page_fault_wq; @@ -85,12 +88,417 @@ static void mlx5_ib_page_fault_resume(struct mlx5_ib_qp *qp, qp->mqp.qpn); } +/* + * Handle a single data segment in a page-fault WQE. + * + * Returns number of pages retrieved on success. The caller will continue to + * the next data segment. + * Can return the following error codes: + * -EAGAIN to designate a temporary error. The caller will abort handling the + * page fault and resolve it. + * -EFAULT when there's an error mapping the requested pages. The caller will + * abort the page fault handling and possibly move the QP to an error state. + * On other errors the QP should also be closed with an error. + */ +static int pagefault_single_data_segment(struct mlx5_ib_qp *qp, +struct mlx5_ib_pfault *pfault, +u32 key, u64 io_virt, size_t bcnt, +u32 *bytes_mapped) +{ + struct mlx5_ib_dev *mib_dev = to_mdev(qp->ibqp.pd->device); + int srcu_key; + unsigned int current_seq; + u64 start_idx; + int npages = 0, ret = 0; + struct mlx5_ib_mr *mr; + u64 access_mask = ODP_READ_ALLOWED_BIT; + + srcu_key = srcu_read_lock(&mib_dev->mr_srcu); + mr = mlx5_ib_odp_find_mr_lkey(mib_dev, key); + /* +* If we didn't find the MR, it means the MR was closed while we were +* handling the ODP event. In this case we return -EFAULT so that the +* QP will be closed. +*/ + if (!mr || !mr->ibmr.pd) { + pr_err("Failed to find relevant mr for lkey=0x%06x, probably the MR was destroyed\n", + key); + ret = -EFAULT; + goto srcu_unlock; + } + if (!mr->umem->odp_data) { + pr_debug("skipping non ODP MR (lkey=0x%06x) in page fault handler.\n", +key); + if (bytes_mapped) + *bytes_mapped += + (bcnt - pfault->mpfault.bytes_committed); + goto srcu_unlock; + } + if (mr->ibmr.pd != qp->ibqp.pd) { + pr_err("Page-fault with different PDs for QP and MR.\n"); + ret = -EFAULT; + goto srcu_unlock; + } + + current_seq = ACCESS_ONCE(mr->umem->odp_data->notifiers_seq); + + /* +* Avoid branches - this code will perform correctly +* in all iterations (in iteration 2 and above, +* bytes_committed == 0). +*/ + io_virt += pfault->mpfault.bytes_committed; + bcnt -= pfault->mpfault.bytes_committed; + + start_idx = (io_virt - (mr->mmr.iova & PAGE_MASK)) >> PAGE_SHIFT; + + if (mr->umem->writable) + access_mask |= ODP_WRITE_ALLOWED_BIT; + npages = ib_umem_odp_map_dma_pages(mr->umem, io_virt, bcnt, + access_mask, current_seq); + if (npages < 0) { + ret = npages; + goto srcu_unlock; + } + + if (npages > 0) { + mutex_lock(&mr->umem->odp_data->umem_mutex); + /* +* No need to check whether the MTTs really belong to +* this MR, since ib_umem_odp_map_dma_pages already +* checks this. +*/ + ret = mlx5_ib_update_mtt(mr, start_idx, npages, 0); + mutex_unlock(&mr->umem->odp_data->umem_mutex); + if (ret < 0) { + pr_err("Failed to update mkey page tables\n"); + goto srcu_unlock; + } + + if (bytes_mapped) { + u32 new_mappings = npages * PAGE_SIZE - + (io_virt - round_down(io_virt, PAGE_SIZE)); + *bytes_mapped += min_t(u32, new_mappings, bcnt); + } + } + +srcu_unlock: + srcu_read_unlock(&mib_dev->mr_srcu, srcu_key); + pfault->mpfault.bytes_committed = 0; + return ret ? ret : npages; +} + +/** + * Parse a series of data segments for page fault handling. + * + * @qp
[PATCH v2 17/17] IB/mlx5: Implement on demand paging by adding support for MMU notifiers
* Implement the relevant invalidation functions (zap MTTs as needed) * Implement interlocking (and rollback in the page fault handlers) for cases of a racing notifier and fault. * With this patch we can now enable the capability bits for supporting RC send/receive/RDMA read/RDMA write, and UD send. Signed-off-by: Sagi Grimberg Signed-off-by: Shachar Raindel Signed-off-by: Haggai Eran --- drivers/infiniband/hw/mlx5/main.c| 4 ++ drivers/infiniband/hw/mlx5/mlx5_ib.h | 3 + drivers/infiniband/hw/mlx5/mr.c | 79 +++-- drivers/infiniband/hw/mlx5/odp.c | 128 --- 4 files changed, 198 insertions(+), 16 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index a801baa79c8e..8a87404e9c76 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -574,6 +574,10 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct ib_device *ibdev, goto out_count; } +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING + context->ibucontext.invalidate_range = &mlx5_ib_invalidate_range; +#endif + INIT_LIST_HEAD(&context->db_page_list); mutex_init(&context->db_page_mutex); diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h index c6ceec3e3d6a..83f22fe297c8 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -325,6 +325,7 @@ struct mlx5_ib_mr { struct mlx5_ib_dev *dev; struct mlx5_create_mkey_mbox_out out; struct mlx5_core_sig_ctx*sig; + int live; }; struct mlx5_ib_fast_reg_page_list { @@ -629,6 +630,8 @@ int __init mlx5_ib_odp_init(void); void mlx5_ib_odp_cleanup(void); void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp); void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp); +void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start, + unsigned long end); #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ static inline int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev) diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index 9c9e16cca043..a2dd7bfc129b 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -37,6 +37,7 @@ #include #include #include +#include #include #include "mlx5_ib.h" @@ -54,6 +55,18 @@ static DEFINE_MUTEX(mlx5_ib_update_mtt_emergency_buffer_mutex); static int clean_mr(struct mlx5_ib_mr *mr); +static int destroy_mkey(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr) +{ + int err = mlx5_core_destroy_mkey(dev->mdev, &mr->mmr); + +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING + /* Wait until all page fault handlers using the mr complete. */ + synchronize_srcu(&dev->mr_srcu); +#endif + + return err; +} + static int order2idx(struct mlx5_ib_dev *dev, int order) { struct mlx5_mr_cache *cache = &dev->cache; @@ -188,7 +201,7 @@ static void remove_keys(struct mlx5_ib_dev *dev, int c, int num) ent->cur--; ent->size--; spin_unlock_irq(&ent->lock); - err = mlx5_core_destroy_mkey(dev->mdev, &mr->mmr); + err = destroy_mkey(dev, mr); if (err) mlx5_ib_warn(dev, "failed destroy mkey\n"); else @@ -479,7 +492,7 @@ static void clean_keys(struct mlx5_ib_dev *dev, int c) ent->cur--; ent->size--; spin_unlock_irq(&ent->lock); - err = mlx5_core_destroy_mkey(dev->mdev, &mr->mmr); + err = destroy_mkey(dev, mr); if (err) mlx5_ib_warn(dev, "failed destroy mkey\n"); else @@ -809,6 +822,8 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem, mr->mmr.size = len; mr->mmr.pd = to_mpd(pd)->pdn; + mr->live = 1; + unmap_dma: up(&umrc->sem); dma_unmap_single(ddev, dma, size, DMA_TO_DEVICE); @@ -994,6 +1009,7 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, u64 virt_addr, goto err_2; } mr->umem = umem; + mr->live = 1; mlx5_vfree(in); mlx5_ib_dbg(dev, "mkey = 0x%x\n", mr->mmr.key); @@ -1071,10 +1087,47 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, mr->ibmr.lkey = mr->mmr.key; mr->ibmr.rkey = mr->mmr.key; +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING + if (umem->odp_data) { + /* +* This barrier prevents the compiler from moving the +* setting of umem->odp_data->private to point to our +* MR, before reg_umr finished, to ensure that the MR +* initialization have finished before starting to +* handle invalidations. +
[PATCH v2 16/17] IB/mlx5: Add support for RDMA read/write responder page faults
Signed-off-by: Shachar Raindel Signed-off-by: Haggai Eran --- drivers/infiniband/hw/mlx5/odp.c | 79 1 file changed, 79 insertions(+) diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c index bd1dbe5ebc15..936a6cd4ecc7 100644 --- a/drivers/infiniband/hw/mlx5/odp.c +++ b/drivers/infiniband/hw/mlx5/odp.c @@ -35,6 +35,8 @@ #include "mlx5_ib.h" +#define MAX_PREFETCH_LEN (4*1024*1024U) + struct workqueue_struct *mlx5_ib_page_fault_wq; #define COPY_ODP_BIT_MLX_TO_IB(reg, ib_caps, field_name, bit_name) do { \ @@ -490,6 +492,80 @@ resolve_page_fault: free_page((unsigned long)buffer); } +static int pages_in_range(u64 address, u32 length) +{ + return (ALIGN(address + length, PAGE_SIZE) - + (address & PAGE_MASK)) >> PAGE_SHIFT; +} + +static void mlx5_ib_mr_rdma_pfault_handler(struct mlx5_ib_qp *qp, + struct mlx5_ib_pfault *pfault) +{ + struct mlx5_pagefault *mpfault = &pfault->mpfault; + u64 address; + u32 length; + u32 prefetch_len = mpfault->bytes_committed; + int prefetch_activated = 0; + u32 rkey = mpfault->rdma.r_key; + int ret; + + /* The RDMA responder handler handles the page fault in two parts. +* First it brings the necessary pages for the current packet +* (and uses the pfault context), and then (after resuming the QP) +* prefetches more pages. The second operation cannot use the pfault +* context and therefore uses the dummy_pfault context allocated on +* the stack */ + struct mlx5_ib_pfault dummy_pfault = {}; + + dummy_pfault.mpfault.bytes_committed = 0; + + mpfault->rdma.rdma_va += mpfault->bytes_committed; + mpfault->rdma.rdma_op_len -= min(mpfault->bytes_committed, +mpfault->rdma.rdma_op_len); + mpfault->bytes_committed = 0; + + address = mpfault->rdma.rdma_va; + length = mpfault->rdma.rdma_op_len; + + /* For some operations, the hardware cannot tell the exact message +* length, and in those cases it reports zero. Use prefetch +* logic. */ + if (length == 0) { + prefetch_activated = 1; + length = mpfault->rdma.packet_size; + prefetch_len = min(MAX_PREFETCH_LEN, prefetch_len); + } + + ret = pagefault_single_data_segment(qp, pfault, rkey, address, length, + NULL); + if (ret == -EAGAIN) { + /* We're racing with an invalidation, don't prefetch */ + prefetch_activated = 0; + } else if (ret < 0 || pages_in_range(address, length) > ret) { + mlx5_ib_page_fault_resume(qp, pfault, 1); + return; + } + + mlx5_ib_page_fault_resume(qp, pfault, 0); + + /* At this point, there might be a new pagefault already arriving in +* the eq, switch to the dummy pagefault for the rest of the +* processing. We're still OK with the objects being alive as the +* work-queue is being fenced. */ + + if (prefetch_activated) { + ret = pagefault_single_data_segment(qp, &dummy_pfault, rkey, + address, + prefetch_len, + NULL); + if (ret < 0) { + pr_warn("Prefetch failed (ret = %d, prefetch_activated = %d) for QPN %d, address: 0x%.16llx, length = 0x%.16x\n", + ret, prefetch_activated, + qp->ibqp.qp_num, address, prefetch_len); + } + } +} + void mlx5_ib_mr_pfault_handler(struct mlx5_ib_qp *qp, struct mlx5_ib_pfault *pfault) { @@ -499,6 +575,9 @@ void mlx5_ib_mr_pfault_handler(struct mlx5_ib_qp *qp, case MLX5_PFAULT_SUBTYPE_WQE: mlx5_ib_mr_wqe_pfault_handler(qp, pfault); break; + case MLX5_PFAULT_SUBTYPE_RDMA: + mlx5_ib_mr_rdma_pfault_handler(qp, pfault); + break; default: pr_warn("Invalid page fault event subtype: 0x%x\n", event_subtype); -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 13/17] IB/mlx5: Add mlx5_ib_update_mtt to update page tables after creation
The new function allows updating the page tables of a memory region after it was created. This can be used to handle page faults and page invalidations. Since mlx5_ib_update_mtt will need to work from within page invalidation, so it must not block on memory allocation. It employs an atomic memory allocation mechanism that is used as a fallback when kmalloc(GFP_ATOMIC) fails. In order to reuse code from mlx5_ib_populate_pas, the patch splits this function and add the needed parameters. Signed-off-by: Haggai Eran Signed-off-by: Shachar Raindel --- drivers/infiniband/hw/mlx5/mem.c | 19 +++-- drivers/infiniband/hw/mlx5/mlx5_ib.h | 5 ++ drivers/infiniband/hw/mlx5/mr.c | 132 ++- include/linux/mlx5/device.h | 1 + 4 files changed, 149 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c index 5f7b30147180..b56e4c5593ee 100644 --- a/drivers/infiniband/hw/mlx5/mem.c +++ b/drivers/infiniband/hw/mlx5/mem.c @@ -140,12 +140,16 @@ static u64 umem_dma_to_mtt(dma_addr_t umem_dma) * dev - mlx5_ib device * umem - umem to use to fill the pages * page_shift - determines the page size used in the resulting array + * offset - offset into the umem to start from, + * only implemented for ODP umems + * num_pages - total number of pages to fill * pas - bus addresses array to fill * access_flags - access flags to set on all present pages. use enum mlx5_ib_mtt_access_flags for this. */ -void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, - int page_shift, __be64 *pas, int access_flags) +void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, + int page_shift, size_t offset, size_t num_pages, + __be64 *pas, int access_flags) { unsigned long umem_page_shift = ilog2(umem->page_size); int shift = page_shift - umem_page_shift; @@ -160,13 +164,11 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, const bool odp = umem->odp_data != NULL; if (odp) { - int num_pages = ib_umem_num_pages(umem); - WARN_ON(shift != 0); WARN_ON(access_flags != (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE)); for (i = 0; i < num_pages; ++i) { - dma_addr_t pa = umem->odp_data->dma_list[i]; + dma_addr_t pa = umem->odp_data->dma_list[offset + i]; pas[i] = cpu_to_be64(umem_dma_to_mtt(pa)); } @@ -194,6 +196,13 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, } } +void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, + int page_shift, __be64 *pas, int access_flags) +{ + return __mlx5_ib_populate_pas(dev, umem, page_shift, 0, + ib_umem_num_pages(umem), pas, + access_flags); +} int mlx5_ib_get_buf_offset(u64 addr, int page_shift, u32 *offset) { u64 page_size; diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h index 83c1690e9dd0..6856e27bfb6a 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -527,6 +527,8 @@ struct ib_mr *mlx5_ib_get_dma_mr(struct ib_pd *pd, int acc); struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, u64 virt_addr, int access_flags, struct ib_udata *udata); +int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index, + int npages, int zap); int mlx5_ib_dereg_mr(struct ib_mr *ibmr); int mlx5_ib_destroy_mr(struct ib_mr *ibmr); struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd, @@ -558,6 +560,9 @@ int mlx5_ib_init_fmr(struct mlx5_ib_dev *dev); void mlx5_ib_cleanup_fmr(struct mlx5_ib_dev *dev); void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift, int *ncont, int *order); +void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, + int page_shift, size_t offset, size_t num_pages, + __be64 *pas, int access_flags); void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, int page_shift, __be64 *pas, int access_flags); void mlx5_ib_copy_pas(u64 *old, u64 *new, int step, int num); diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index fcd531e0758b..e9675325af41 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -44,9 +44,13 @@ enum { MAX_PENDING_REG_MR = 8, }; -enum { - MLX5_UMR_ALIGN = 2048 -}; +#define MLX5_UMR_ALIGN 2048 +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING +static __b
[PATCH v2 10/17] net/mlx5_core: Add support for page faults events and low level handling
* Add a handler function pointer in the mlx5_core_qp struct for page fault events. Handle page fault events by calling the handler function, if not NULL. * Add on-demand paging capability query command. * Export command for resuming QPs after page faults. * Add various constants related to paging support. Signed-off-by: Sagi Grimberg Signed-off-by: Shachar Raindel Signed-off-by: Haggai Eran --- drivers/infiniband/hw/mlx5/mr.c | 6 +- drivers/infiniband/hw/mlx5/qp.c | 4 +- drivers/net/ethernet/mellanox/mlx5/core/eq.c | 13 ++- drivers/net/ethernet/mellanox/mlx5/core/fw.c | 39 + drivers/net/ethernet/mellanox/mlx5/core/qp.c | 119 +++ include/linux/mlx5/device.h | 58 - include/linux/mlx5/driver.h | 12 +++ include/linux/mlx5/qp.h | 55 + 8 files changed, 299 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index aee3527030ac..d69db8d7d227 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -147,7 +147,7 @@ static int add_keys(struct mlx5_ib_dev *dev, int c, int num) mr->order = ent->order; mr->umred = 1; mr->dev = dev; - in->seg.status = 1 << 6; + in->seg.status = MLX5_MKEY_STATUS_FREE; in->seg.xlt_oct_size = cpu_to_be32((npages + 1) / 2); in->seg.qpn_mkey7_0 = cpu_to_be32(0xff << 8); in->seg.flags = MLX5_ACCESS_MODE_MTT | MLX5_PERM_UMR_EN; @@ -1033,7 +1033,7 @@ struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd, goto err_free; } - in->seg.status = 1 << 6; /* free */ + in->seg.status = MLX5_MKEY_STATUS_FREE; in->seg.xlt_oct_size = cpu_to_be32(ndescs); in->seg.qpn_mkey7_0 = cpu_to_be32(0xff << 8); in->seg.flags_pd = cpu_to_be32(to_mpd(pd)->pdn); @@ -1148,7 +1148,7 @@ struct ib_mr *mlx5_ib_alloc_fast_reg_mr(struct ib_pd *pd, goto err_free; } - in->seg.status = 1 << 6; /* free */ + in->seg.status = MLX5_MKEY_STATUS_FREE; in->seg.xlt_oct_size = cpu_to_be32((max_page_list_len + 1) / 2); in->seg.qpn_mkey7_0 = cpu_to_be32(0xff << 8); in->seg.flags = MLX5_PERM_UMR_EN | MLX5_ACCESS_MODE_MTT; diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c index 455d40779112..d61e4ef73c34 100644 --- a/drivers/infiniband/hw/mlx5/qp.c +++ b/drivers/infiniband/hw/mlx5/qp.c @@ -1982,7 +1982,7 @@ static void set_mkey_segment(struct mlx5_mkey_seg *seg, struct ib_send_wr *wr, { memset(seg, 0, sizeof(*seg)); if (li) { - seg->status = 1 << 6; + seg->status = MLX5_MKEY_STATUS_FREE; return; } @@ -2003,7 +2003,7 @@ static void set_reg_mkey_segment(struct mlx5_mkey_seg *seg, struct ib_send_wr *w memset(seg, 0, sizeof(*seg)); if (wr->send_flags & MLX5_IB_SEND_UMR_UNREG) { - seg->status = 1 << 6; + seg->status = MLX5_MKEY_STATUS_FREE; return; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c index ad2c96a02a53..44cc16d4eff7 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c @@ -157,6 +157,8 @@ static const char *eqe_type_str(u8 type) return "MLX5_EVENT_TYPE_CMD"; case MLX5_EVENT_TYPE_PAGE_REQUEST: return "MLX5_EVENT_TYPE_PAGE_REQUEST"; + case MLX5_EVENT_TYPE_PAGE_FAULT: + return "MLX5_EVENT_TYPE_PAGE_FAULT"; default: return "Unrecognized event"; } @@ -279,6 +281,11 @@ static int mlx5_eq_int(struct mlx5_core_dev *dev, struct mlx5_eq *eq) } break; +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING + case MLX5_EVENT_TYPE_PAGE_FAULT: + mlx5_eq_pagefault(dev, eqe); + break; +#endif default: mlx5_core_warn(dev, "Unhandled event 0x%x on EQ 0x%x\n", @@ -446,8 +453,12 @@ void mlx5_eq_cleanup(struct mlx5_core_dev *dev) int mlx5_start_eqs(struct mlx5_core_dev *dev) { struct mlx5_eq_table *table = &dev->priv.eq_table; + u32 async_event_mask = MLX5_ASYNC_EVENT_MASK; int err; + if (dev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG) + async_event_mask |= (1ull << MLX5_EVENT_TYPE_PAGE_FAULT); + err = mlx5_create_map_eq(dev, &table->cmd_eq, MLX5_EQ_VEC_CMD, MLX5_NUM_CMD_EQE, 1ull << MLX5_EVENT_TYPE_CMD, "mlx5_cmd_eq", &dev->priv.uuari.uars[0]); @@ -459,7 +470,7 @@ int mlx5_start_eqs(struct mlx5_core_dev *dev) mlx5_cmd_use_
[PATCH v2 05/17] IB/mlx5: Add function to read WQE from user-space
Add a helper function mlx5_ib_read_user_wqe to read information from user-space owned work queues. The function will be used in a later patch by the page-fault handling code in mlx5_ib. Signed-off-by: Haggai Eran --- drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 + drivers/infiniband/hw/mlx5/qp.c | 73 include/linux/mlx5/qp.h | 3 ++ 3 files changed, 78 insertions(+) diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h index 53d19e6e69a4..14a0311eaa1c 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -503,6 +503,8 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr, int mlx5_ib_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr, struct ib_recv_wr **bad_wr); void *mlx5_get_send_wqe(struct mlx5_ib_qp *qp, int n); +int mlx5_ib_read_user_wqe(struct mlx5_ib_qp *qp, int send, int wqe_index, + void *buffer, u32 length); struct ib_cq *mlx5_ib_create_cq(struct ib_device *ibdev, int entries, int vector, struct ib_ucontext *context, struct ib_udata *udata); diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c index 7f362afa1a38..455d40779112 100644 --- a/drivers/infiniband/hw/mlx5/qp.c +++ b/drivers/infiniband/hw/mlx5/qp.c @@ -101,6 +101,79 @@ void *mlx5_get_send_wqe(struct mlx5_ib_qp *qp, int n) return get_wqe(qp, qp->sq.offset + (n << MLX5_IB_SQ_STRIDE)); } +/** + * mlx5_ib_read_user_wqe() - Copy a user-space WQE to kernel space. + * + * @qp: QP to copy from. + * @send: copy from the send queue when non-zero, use the receive queue + * otherwise. + * @wqe_index: index to start copying from. For send work queues, the + * wqe_index is in units of MLX5_SEND_WQE_BB. + * For receive work queue, it is the number of work queue + * element in the queue. + * @buffer: destination buffer. + * @length: maximum number of bytes to copy. + * + * Copies at least a single WQE, but may copy more data. + * + * Return: the number of bytes copied, or an error code. + */ +int mlx5_ib_read_user_wqe(struct mlx5_ib_qp *qp, int send, int wqe_index, + void *buffer, u32 length) +{ + struct ib_device *ibdev = qp->ibqp.device; + struct mlx5_ib_dev *dev = to_mdev(ibdev); + struct mlx5_ib_wq *wq = send ? &qp->sq : &qp->rq; + size_t offset; + size_t wq_end; + struct ib_umem *umem = qp->umem; + u32 first_copy_length; + int wqe_length; + int copied; + int ret; + + if (wq->wqe_cnt == 0) { + mlx5_ib_dbg(dev, "mlx5_ib_read_user_wqe for a QP with wqe_cnt == 0. qp_type: 0x%x\n", + qp->ibqp.qp_type); + return -EINVAL; + } + + offset = wq->offset + ((wqe_index % wq->wqe_cnt) << wq->wqe_shift); + wq_end = wq->offset + (wq->wqe_cnt << wq->wqe_shift); + + if (send && length < sizeof(struct mlx5_wqe_ctrl_seg)) + return -EINVAL; + + if (offset > umem->length || + (send && offset + sizeof(struct mlx5_wqe_ctrl_seg) > umem->length)) + return -EINVAL; + + first_copy_length = min_t(u32, offset + length, wq_end) - offset; + copied = ib_umem_copy_from(umem, offset, buffer, first_copy_length); + if (copied < first_copy_length) + return copied; + + if (send) { + struct mlx5_wqe_ctrl_seg *ctrl = buffer; + int ds = be32_to_cpu(ctrl->qpn_ds) & MLX5_WQE_CTRL_DS_MASK; + + wqe_length = ds * MLX5_WQE_DS_UNITS; + } else { + wqe_length = 1 << wq->wqe_shift; + } + + if (wqe_length <= first_copy_length) + return first_copy_length; + + ret = ib_umem_copy_from(umem, wq->offset, buffer + first_copy_length, + wqe_length - first_copy_length); + if (ret < 0) + return ret; + copied += ret; + + return copied; +} + static void mlx5_ib_qp_event(struct mlx5_core_qp *qp, int type) { struct ib_qp *ibqp = &to_mibqp(qp)->ibqp; diff --git a/include/linux/mlx5/qp.h b/include/linux/mlx5/qp.h index 3fa075daeb1d..67f4b9660b06 100644 --- a/include/linux/mlx5/qp.h +++ b/include/linux/mlx5/qp.h @@ -189,6 +189,9 @@ struct mlx5_wqe_ctrl_seg { __be32 imm; }; +#define MLX5_WQE_CTRL_DS_MASK 0x3f +#define MLX5_WQE_DS_UNITS 16 + struct mlx5_wqe_xrc_seg { __be32 xrc_srqn; u8 rsvd[12]; -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 14/17] IB/mlx5: Page faults handling infrastructure
* Refactor MR registration and cleanup, and fix reg_pages accounting. * Create a work queue to handle page fault events in a kthread context. * Register a fault handler to get events from the core for each QP. The registered fault handler is empty in this patch, and only a later patch implements it. Signed-off-by: Sagi Grimberg Signed-off-by: Shachar Raindel Signed-off-by: Haggai Eran --- drivers/infiniband/hw/mlx5/main.c| 31 +++- drivers/infiniband/hw/mlx5/mlx5_ib.h | 67 +++- drivers/infiniband/hw/mlx5/mr.c | 45 +++ drivers/infiniband/hw/mlx5/odp.c | 145 +++ drivers/infiniband/hw/mlx5/qp.c | 26 ++- include/linux/mlx5/driver.h | 2 +- 6 files changed, 294 insertions(+), 22 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index e6d775f2446d..a801baa79c8e 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -864,7 +864,7 @@ static ssize_t show_reg_pages(struct device *device, struct mlx5_ib_dev *dev = container_of(device, struct mlx5_ib_dev, ib_dev.dev); - return sprintf(buf, "%d\n", dev->mdev->priv.reg_pages); + return sprintf(buf, "%d\n", atomic_read(&dev->mdev->priv.reg_pages)); } static ssize_t show_hca(struct device *device, struct device_attribute *attr, @@ -1389,16 +1389,19 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev) goto err_eqs; mutex_init(&dev->cap_mask_mutex); - spin_lock_init(&dev->mr_lock); err = create_dev_resources(&dev->devr); if (err) goto err_eqs; - err = ib_register_device(&dev->ib_dev, NULL); + err = mlx5_ib_odp_init_one(dev); if (err) goto err_rsrc; + err = ib_register_device(&dev->ib_dev, NULL); + if (err) + goto err_odp; + err = create_umr_res(dev); if (err) goto err_dev; @@ -1420,6 +1423,9 @@ err_umrc: err_dev: ib_unregister_device(&dev->ib_dev); +err_odp: + mlx5_ib_odp_remove_one(dev); + err_rsrc: destroy_dev_resources(&dev->devr); @@ -1435,8 +1441,10 @@ err_dealloc: static void mlx5_ib_remove(struct mlx5_core_dev *mdev, void *context) { struct mlx5_ib_dev *dev = context; + ib_unregister_device(&dev->ib_dev); destroy_umrc_res(dev); + mlx5_ib_odp_remove_one(dev); destroy_dev_resources(&dev->devr); free_comp_eqs(dev); ib_dealloc_device(&dev->ib_dev); @@ -1450,15 +1458,30 @@ static struct mlx5_interface mlx5_ib_interface = { static int __init mlx5_ib_init(void) { + int err; + if (deprecated_prof_sel != 2) pr_warn("prof_sel is deprecated for mlx5_ib, set it for mlx5_core\n"); - return mlx5_register_interface(&mlx5_ib_interface); + err = mlx5_ib_odp_init(); + if (err) + return err; + + err = mlx5_register_interface(&mlx5_ib_interface); + if (err) + goto clean_odp; + + return err; + +clean_odp: + mlx5_ib_odp_cleanup(); + return err; } static void __exit mlx5_ib_cleanup(void) { mlx5_unregister_interface(&mlx5_ib_interface); + mlx5_ib_odp_cleanup(); } module_init(mlx5_ib_init); diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h index 6856e27bfb6a..c6ceec3e3d6a 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -149,6 +149,29 @@ enum { MLX5_QP_EMPTY }; +/* + * Connect-IB can trigger up to four concurrent pagefaults + * per-QP. + */ +enum mlx5_ib_pagefault_context { + MLX5_IB_PAGEFAULT_RESPONDER_READ, + MLX5_IB_PAGEFAULT_REQUESTOR_READ, + MLX5_IB_PAGEFAULT_RESPONDER_WRITE, + MLX5_IB_PAGEFAULT_REQUESTOR_WRITE, + MLX5_IB_PAGEFAULT_CONTEXTS +}; + +static inline enum mlx5_ib_pagefault_context + mlx5_ib_get_pagefault_context(struct mlx5_pagefault *pagefault) +{ + return pagefault->flags & (MLX5_PFAULT_REQUESTOR | MLX5_PFAULT_WRITE); +} + +struct mlx5_ib_pfault { + struct work_struct work; + struct mlx5_pagefault mpfault; +}; + struct mlx5_ib_qp { struct ib_qpibqp; struct mlx5_core_qp mqp; @@ -194,6 +217,21 @@ struct mlx5_ib_qp { /* Store signature errors */ boolsignature_en; + +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING + /* +* A flag that is true for QP's that are in a state that doesn't +* allow page faults, and shouldn't schedule any more faults. +*/ + int disable_page_faults; + /* +* The disable_page_faults_lock protects a QP's disable_page_faults +* field, allowing for a thread to atomically check whether the QP +* allows page faults, and if so schedule a page fault. +
[PATCH v2 01/17] IB/mlx5: Remove per-MR pas and dma pointers
Since UMR code now uses its own context struct on the stack, the pas and dma pointers for the UMR operation that remained in the mlx5_ib_mr struct are not necessary. This patch removes them. Fixes: a74d24168d2d ("IB/mlx5: Refactor UMR to have its own context struct") Signed-off-by: Haggai Eran --- drivers/infiniband/hw/mlx5/mlx5_ib.h | 2 -- drivers/infiniband/hw/mlx5/mr.c | 21 - 2 files changed, 12 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h index 386780f0d1e1..29da55222070 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -261,8 +261,6 @@ struct mlx5_ib_mr { struct list_headlist; int order; int umred; - __be64 *pas; - dma_addr_t dma; int npages; struct mlx5_ib_dev *dev; struct mlx5_create_mkey_mbox_out out; diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index 8ee7cb46e059..610500810f75 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -740,6 +740,8 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem, struct mlx5_ib_mr *mr; struct ib_sge sg; int size = sizeof(u64) * npages; + __be64 *mr_pas; + dma_addr_t dma; int err = 0; int i; @@ -758,25 +760,26 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem, if (!mr) return ERR_PTR(-EAGAIN); - mr->pas = kmalloc(size + MLX5_UMR_ALIGN - 1, GFP_KERNEL); - if (!mr->pas) { + mr_pas = kmalloc(size + MLX5_UMR_ALIGN - 1, GFP_KERNEL); + if (!mr_pas) { err = -ENOMEM; goto free_mr; } mlx5_ib_populate_pas(dev, umem, page_shift, -mr_align(mr->pas, MLX5_UMR_ALIGN), 1); +mr_align(mr_pas, MLX5_UMR_ALIGN), 1); - mr->dma = dma_map_single(ddev, mr_align(mr->pas, MLX5_UMR_ALIGN), size, -DMA_TO_DEVICE); - if (dma_mapping_error(ddev, mr->dma)) { + dma = dma_map_single(ddev, mr_align(mr_pas, MLX5_UMR_ALIGN), size, +DMA_TO_DEVICE); + if (dma_mapping_error(ddev, dma)) { err = -ENOMEM; goto free_pas; } memset(&wr, 0, sizeof(wr)); wr.wr_id = (u64)(unsigned long)&umr_context; - prep_umr_reg_wqe(pd, &wr, &sg, mr->dma, npages, mr->mmr.key, page_shift, virt_addr, len, access_flags); + prep_umr_reg_wqe(pd, &wr, &sg, dma, npages, mr->mmr.key, page_shift, +virt_addr, len, access_flags); mlx5_ib_init_umr_context(&umr_context); down(&umrc->sem); @@ -798,10 +801,10 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct ib_umem *umem, unmap_dma: up(&umrc->sem); - dma_unmap_single(ddev, mr->dma, size, DMA_TO_DEVICE); + dma_unmap_single(ddev, dma, size, DMA_TO_DEVICE); free_pas: - kfree(mr->pas); + kfree(mr_pas); free_mr: if (err) { -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 07/17] IB/core: Add flags for on demand paging support
From: Sagi Grimberg * Add a configuration option for enable on-demand paging support in the infiniband subsystem (CONFIG_INFINIBAND_ON_DEMAND_PAGING). In a later patch, this configuration option will select the MMU_NOTIFIER configuration option to enable mmu notifiers. * Add a flag for on demand paging (ODP) support in the IB device capabilities. * Add a flag to request ODP MR in the access flags to reg_mr. * Fail registrations done with the ODP flag when the low-level driver doesn't support this. * Change the conditions in which an MR will be writable to explicitly specify the access flags. This is to avoid making an MR writable just because it is an ODP MR. * Add a ODP capabilities to the extended query device verb. Signed-off-by: Sagi Grimberg Signed-off-by: Shachar Raindel Signed-off-by: Haggai Eran --- drivers/infiniband/Kconfig | 10 ++ drivers/infiniband/core/umem.c | 8 +--- drivers/infiniband/core/uverbs_cmd.c | 25 + include/rdma/ib_verbs.h | 28 ++-- include/uapi/rdma/ib_user_verbs.h| 16 5 files changed, 82 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 77089399359b..089a2c2af329 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -38,6 +38,16 @@ config INFINIBAND_USER_MEM depends on INFINIBAND_USER_ACCESS != n default y +config INFINIBAND_ON_DEMAND_PAGING + bool "InfiniBand on-demand paging support" + depends on INFINIBAND_USER_MEM + default y + ---help--- + On demand paging support for the InfiniBand subsystem. + Together with driver support this allows registration of + memory regions without pinning their pages, fetching the + pages on demand instead. + config INFINIBAND_ADDR_TRANS bool depends on INFINIBAND diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 77bec75963e7..a140b2d4d94e 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -107,13 +107,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem->page_size = PAGE_SIZE; umem->pid = get_task_pid(current, PIDTYPE_PID); /* -* We ask for writable memory if any access flags other than -* "remote read" are set. "Local write" and "remote write" +* We ask for writable memory if any of the following +* access flags are set. "Local write" and "remote write" * obviously require write access. "Remote atomic" can do * things like fetch and add, which will modify memory, and * "MW bind" can change permissions by binding a window. */ - umem->writable = !!(access & ~IB_ACCESS_REMOTE_READ); + umem->writable = !!(access & + (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE | +IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_MW_BIND)); /* We assume the memory is from hugetlb until proved otherwise */ umem->hugetlb = 1; diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 74ad0d0de92b..46b60086a4bf 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -953,6 +953,18 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file, goto err_free; } + if (cmd.access_flags & IB_ACCESS_ON_DEMAND) { + struct ib_device_attr attr; + + ret = ib_query_device(pd->device, &attr); + if (ret || !(attr.device_cap_flags & + IB_DEVICE_ON_DEMAND_PAGING)) { + pr_debug("ODP support not available\n"); + ret = -EINVAL; + goto err_put; + } + } + mr = pd->device->reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va, cmd.access_flags, &udata); if (IS_ERR(mr)) { @@ -3286,6 +3298,19 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file *file, copy_query_dev_fields(file, &resp.base, &attr); resp.comp_mask = 0; +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING + if (cmd.comp_mask & IB_USER_VERBS_EX_QUERY_DEVICE_ODP) { + resp.odp_caps.general_caps = attr.odp_caps.general_caps; + resp.odp_caps.per_transport_caps.rc_odp_caps = + attr.odp_caps.per_transport_caps.rc_odp_caps; + resp.odp_caps.per_transport_caps.uc_odp_caps = + attr.odp_caps.per_transport_caps.uc_odp_caps; + resp.odp_caps.per_transport_caps.ud_odp_caps = + attr.odp_caps.per_transport_caps.ud_odp_caps; + resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP; + } +#endif + err = ib_copy_to_udata(ucore, &res
[PATCH v2 11/17] IB/mlx5: Implement the ODP capability query verb
The patch adds infrastructure to query ODP capabilities in the mlx5 driver. The code will read the capabilities from the device, and enable only those capabilities that both the driver and the device supports. At this point ODP is not supported, so no capability is copied from the device, but the patch exposes the global ODP device capability bit. Signed-off-by: Shachar Raindel Signed-off-by: Haggai Eran --- drivers/infiniband/hw/mlx5/Makefile | 1 + drivers/infiniband/hw/mlx5/main.c| 10 ++ drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 drivers/infiniband/hw/mlx5/odp.c | 60 4 files changed, 83 insertions(+) create mode 100644 drivers/infiniband/hw/mlx5/odp.c diff --git a/drivers/infiniband/hw/mlx5/Makefile b/drivers/infiniband/hw/mlx5/Makefile index 4ea0135af484..27a70159e2ea 100644 --- a/drivers/infiniband/hw/mlx5/Makefile +++ b/drivers/infiniband/hw/mlx5/Makefile @@ -1,3 +1,4 @@ obj-$(CONFIG_MLX5_INFINIBAND) += mlx5_ib.o mlx5_ib-y := main.o cq.o doorbell.o qp.o mem.o srq.o mr.o ah.o mad.o +mlx5_ib-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += odp.o diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index 1ba6c42e4df8..e6d775f2446d 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -244,6 +244,12 @@ static int mlx5_ib_query_device(struct ib_device *ibdev, props->max_mcast_grp; props->max_map_per_fmr = INT_MAX; /* no limit in ConnectIB */ +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING + if (dev->mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG) + props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING; + props->odp_caps = dev->odp_caps; +#endif + out: kfree(in_mad); kfree(out_mad); @@ -1321,6 +1327,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev) (1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) | (1ull << IB_USER_VERBS_CMD_CREATE_XSRQ) | (1ull << IB_USER_VERBS_CMD_OPEN_QP); + dev->ib_dev.uverbs_ex_cmd_mask = + (1ull << IB_USER_VERBS_EX_CMD_QUERY_DEVICE); dev->ib_dev.query_device= mlx5_ib_query_device; dev->ib_dev.query_port = mlx5_ib_query_port; @@ -1366,6 +1374,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev) dev->ib_dev.free_fast_reg_page_list = mlx5_ib_free_fast_reg_page_list; dev->ib_dev.check_mr_status = mlx5_ib_check_mr_status; + mlx5_ib_internal_query_odp_caps(dev); + if (mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_XRC) { dev->ib_dev.alloc_xrcd = mlx5_ib_alloc_xrcd; dev->ib_dev.dealloc_xrcd = mlx5_ib_dealloc_xrcd; diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h index 14a0311eaa1c..cc50fce8cca7 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -390,6 +390,9 @@ struct mlx5_ib_dev { struct mlx5_mr_cachecache; struct timer_list delay_timer; int fill_delay; +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING + struct ib_odp_caps odp_caps; +#endif }; static inline struct mlx5_ib_cq *to_mibcq(struct mlx5_core_cq *mcq) @@ -559,6 +562,15 @@ void mlx5_umr_cq_handler(struct ib_cq *cq, void *cq_context); int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask, struct ib_mr_status *mr_status); +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING +int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev); +#else +static inline int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev) +{ + return 0; +} +#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */ + static inline void init_query_mad(struct ib_smp *mad) { mad->base_version = 1; diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c new file mode 100644 index ..66c39ee16aff --- /dev/null +++ b/drivers/infiniband/hw/mlx5/odp.c @@ -0,0 +1,60 @@ +/* + * Copyright (c) 2014 Mellanox Technologies. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *
[PATCH v2 09/17] IB/core: Implement support for MMU notifiers regarding on demand paging regions
* Add an interval tree implementation for ODP umems. Create an interval tree for each ucontext (including a count of the number of ODP MRs in this context, semaphore, etc.), and register ODP umems in the interval tree. * Add MMU notifiers handling functions, using the interval tree to notify only the relevant umems and underlying MRs. * Register to receive MMU notifier events from the MM subsystem upon ODP MR registration (and unregister accordingly). * Add a completion object to synchronize the destruction of ODP umems. * Add mechanism to abort page faults when there's a concurrent invalidation. The way we synchronize between concurrent invalidations and page faults is by keeping a counter of currently running invalidations, and a sequence number that is incremented whenever an invalidation is caught. The page fault code checks the counter and also verifies that the sequence number hasn't progressed before it updates the umem's page tables. This is similar to what the kvm module does. In order to prevent the case where we register a umem in the middle of an ongoing notifier, we also keep a per ucontext counter of the total number of active mmu notifiers. We only enable new umems when all the running notifiers complete. Signed-off-by: Sagi Grimberg Signed-off-by: Shachar Raindel Signed-off-by: Haggai Eran Signed-off-by: Yuval Dagan --- drivers/infiniband/Kconfig| 1 + drivers/infiniband/core/Makefile | 2 +- drivers/infiniband/core/umem.c| 2 +- drivers/infiniband/core/umem_odp.c| 380 +- drivers/infiniband/core/umem_rbtree.c | 94 + drivers/infiniband/core/uverbs_cmd.c | 17 ++ include/rdma/ib_umem_odp.h| 65 +- include/rdma/ib_verbs.h | 19 ++ 8 files changed, 567 insertions(+), 13 deletions(-) create mode 100644 drivers/infiniband/core/umem_rbtree.c diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 089a2c2af329..b899531498eb 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -41,6 +41,7 @@ config INFINIBAND_USER_MEM config INFINIBAND_ON_DEMAND_PAGING bool "InfiniBand on-demand paging support" depends on INFINIBAND_USER_MEM + select MMU_NOTIFIER default y ---help--- On demand paging support for the InfiniBand subsystem. diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index c58f7913c560..acf736764445 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -11,7 +11,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \ ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o netlink.o ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o -ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o +ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o ib_mad-y :=mad.o smi.o agent.o mad_rmpp.o diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 45d7794c7a2b..2a173ae3522e 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -72,7 +72,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d * ib_umem_get - Pin and DMA map userspace memory. * * If access flags indicate ODP memory, avoid pinning. Instead, stores - * the mm for future page fault handling. + * the mm for future page fault handling in conjunction with MMU notifiers. * * @context: userspace context to pin memory for * @addr: userspace virtual address to start at diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index b1a6a44439a2..6095872549e7 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -41,26 +41,235 @@ #include #include +static void ib_umem_notifier_start_account(struct ib_umem *item) +{ + mutex_lock(&item->odp_data->umem_mutex); + + /* Only update private counters for this umem if it has them. +* Otherwise skip it. All page faults will be delayed for this umem. */ + if (item->odp_data->mn_counters_active) { + int notifiers_count = item->odp_data->notifiers_count++; + + if (notifiers_count == 0) + /* Initialize the completion object for waiting on +* notifiers. Since notifier_count is zero, no one +* should be waiting right now. */ + reinit_completion(&item->odp_data->notifier_completion); + } + mutex_unlock(&item->odp_data->umem_mutex); +} + +static void ib_umem_notifier_end_account(struct ib_umem *item) +{ + mutex_lock(&item->odp_data->umem_mutex); + + /* Only update private counters for this umem if it has them. +* Otherwise skip it. All page faults will be de
[PATCH v2 08/17] IB/core: Add support for on demand paging regions
From: Shachar Raindel * Extend the umem struct to keep the ODP related data. * Allocate and initialize the ODP related information in the umem (page_list, dma_list) and freeing as needed in the end of the run. * Store a reference to the process PID struct in the ucontext. Used to safely obtain the task_struct and the mm during fault handling, without preventing the task destruction if needed. * Add 2 helper functions: ib_umem_odp_map_dma_pages and ib_umem_odp_unmap_dma_pages. These functions get the DMA addresses of specific pages of the umem (and, currently, pin them). * Support for page faults only - IB core will keep the reference on the pages used and call put_page when freeing an ODP umem area. Invalidations support will be added in a later patch. Signed-off-by: Sagi Grimberg Signed-off-by: Shachar Raindel Signed-off-by: Haggai Eran Signed-off-by: Majd Dibbiny --- drivers/infiniband/core/Makefile | 1 + drivers/infiniband/core/umem.c| 24 +++ drivers/infiniband/core/umem_odp.c| 308 ++ drivers/infiniband/core/uverbs_cmd.c | 5 + drivers/infiniband/core/uverbs_main.c | 2 + include/rdma/ib_umem.h| 2 + include/rdma/ib_umem_odp.h| 97 +++ include/rdma/ib_verbs.h | 2 + 8 files changed, 441 insertions(+) create mode 100644 drivers/infiniband/core/umem_odp.c create mode 100644 include/rdma/ib_umem_odp.h diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index ffd0af6734af..c58f7913c560 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -11,6 +11,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) += ib_uverbs.o ib_ucm.o \ ib_core-y := packer.o ud_header.o verbs.o sysfs.o \ device.o fmr_pool.o cache.o netlink.o ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o +ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o ib_mad-y :=mad.o smi.o agent.o mad_rmpp.o diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index a140b2d4d94e..45d7794c7a2b 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -39,6 +39,7 @@ #include #include #include +#include #include "uverbs.h" @@ -69,6 +70,10 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d /** * ib_umem_get - Pin and DMA map userspace memory. + * + * If access flags indicate ODP memory, avoid pinning. Instead, stores + * the mm for future page fault handling. + * * @context: userspace context to pin memory for * @addr: userspace virtual address to start at * @size: length of region to pin @@ -117,6 +122,17 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, (IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_MW_BIND)); + if (access & IB_ACCESS_ON_DEMAND) { + ret = ib_umem_odp_get(context, umem); + if (ret) { + kfree(umem); + return ERR_PTR(ret); + } + return umem; + } + + umem->odp_data = NULL; + /* We assume the memory is from hugetlb until proved otherwise */ umem->hugetlb = 1; @@ -237,6 +253,11 @@ void ib_umem_release(struct ib_umem *umem) struct task_struct *task; unsigned long diff; + if (umem->odp_data) { + ib_umem_odp_release(umem); + return; + } + __ib_umem_release(umem->context->device, umem, 1); task = get_pid_task(umem->pid, PIDTYPE_PID); @@ -285,6 +306,9 @@ int ib_umem_page_count(struct ib_umem *umem) int n; struct scatterlist *sg; + if (umem->odp_data) + return ib_umem_num_pages(umem); + shift = ilog2(umem->page_size); n = 0; diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c new file mode 100644 index ..b1a6a44439a2 --- /dev/null +++ b/drivers/infiniband/core/umem_odp.c @@ -0,0 +1,308 @@ +/* + * Copyright (c) 2014 Mellanox Technologies. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *
[PATCH v2 00/17] On demand paging
Hi Roland, Following your comments, we have modified the patch set to eliminate the possibility of a negative notifiers counter, and removed the atomic accesses where possible. The new set have been rebased against upstream, and also contains some minor fixes detailed below. We have isolated 5 patches that we think can be taken as a preliminary patches for the rest of the series. These are the first 5 patches: IB/mlx5: Remove per-MR pas and dma pointers IB/mlx5: Enhance UMR support to allow partial page table update IB/core: Replace ib_umem's offset field with a full address IB/core: Add umem function to read data from user-space IB/mlx5: Add function to read WQE from user-space Regards, Haggai Changes from V1: http://www.spinics.net/lists/linux-rdma/msg20734.html - Rebased against latest upstream (3.18-rc2). - Added patch 1: remove the mr dma and pas fields which are no longer needed. - Replace extended query device patch 1 with Eli Cohen's recent submission from the extended atomic series [1]. - Patch 3: respect umem's page size when calculating offset and start address. - Patch 8: fix error handling in ib_umem_odp_map_dma_pages - Patch 9: - Add a global mmu notifier counter (per ucontext) to prevent the race that existed in v1. - Make accesses to the per-umem notifier counters non-atomic (use ACCESS_ONCE). - Rename ucontext->umem_mutex as ucontext->umem_rwsem to reflect it being a semaphore. - Patch 15: fix error handling in pagefault_single_data_segment - Patch 17: timeout when waiting for an active mmu notifier to complete - Add RC RDMA read support to the patch-set. - Minor fixes. Changes from V0: http://marc.info/?l=linux-rdma&m=139375790322547&w=2 - Rebased against latest upstream / for-next branch. - Removed dependency on patches that were accepted upstream. - Removed pre-patches that were accepted upstream [2]. - Add extended uverb call for querying device (patch 1) and use kernel device attributes to report ODP capabilities through the new uverb entry instead of having a special verb. - Allow upgrading page access permissions during page faults. - Minor fixes to issues that came up during regression testing of the patches. The following set of patches implements on-demand paging (ODP) support in the RDMA stack and in the mlx5_ib Infiniband driver. What is on-demand paging? Applications register memory with an RDMA adapter using system calls, and subsequently post IO operations that refer to the corresponding virtual addresses directly to HW. Until now, this was achieved by pinning the memory during the registration calls. The goal of on demand paging is to avoid pinning the pages of registered memory regions (MRs). This will allow users the same flexibility they get when swapping any other part of their processes address spaces. Instead of requiring the entire MR to fit in physical memory, we can allow the MR to be larger, and only fit the current working set in physical memory. This can make programming with RDMA much simpler. Today, developers that are working with more data than their RAM can hold need either to deregister and reregister memory regions throughout their process's life, or keep a single memory region and copy the data to it. On demand paging will allow these developers to register a single MR at the beginning of their process's life, and let the operating system manage which pages needs to be fetched at a given time. In the future, we might be able to provide a single memory access key for each process that would provide the entire process's address as one large memory region, and the developers wouldn't need to register memory regions at all. How does page faults generally work? With pinned memory regions, the driver would map the virtual addresses to bus addresses, and pass these addresses to the HCA to associate them with the new MR. With ODP, the driver is now allowed to mark some of the pages in the MR as not-present. When the HCA attempts to perform memory access for a communication operation, it notices the page is not present, and raises a page fault event to the driver. In addition, the HCA performs whatever operation is required by the transport protocol to suspend communication until the page fault is resolved. Upon receiving the page fault interrupt, the driver first needs to know on which virtual address the page fault occurred, and on what memory key. When handling send/receive operations, this information is inside the work queue. The driver reads the needed work queue elements, and parses them to gather the address and memory key. For other RDMA operations, the event generated by the HCA only contains the virtual address and rkey, as there are no work queue elements involved. Having the rkey, the driver can find the relevant memory region in its data structures, and calculate the actual pages needed to complete the operation. It then uses get_user_pages to retrieve the needed pages back to the
[PATCH v2 12/17] IB/mlx5: Changes in memory region creation to support on-demand paging
This patch wraps together several changes needed for on-demand paging support in the mlx5_ib_populate_pas function, and when registering memory regions. * Instead of accepting a UMR bit telling the function to enable all access flags, the function now accepts the access flags themselves. * For on-demand paging memory regions, fill the memory tables from the correct list, and enable/disable the access flags per-page according to whether the page is present. * A new bit is set to enable writing of access flags when using the firmware create_mkey command. * Disable contig pages when on-demand paging is enabled. In addition the patch changes the UMR code to use PTR_ALIGN instead of our own macro. Signed-off-by: Haggai Eran --- drivers/infiniband/hw/mlx5/mem.c | 58 ++-- drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 +++- drivers/infiniband/hw/mlx5/mr.c | 33 +++- include/linux/mlx5/device.h | 3 ++ 4 files changed, 88 insertions(+), 18 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c index dae07eae9507..5f7b30147180 100644 --- a/drivers/infiniband/hw/mlx5/mem.c +++ b/drivers/infiniband/hw/mlx5/mem.c @@ -32,6 +32,7 @@ #include #include +#include #include "mlx5_ib.h" /* @umem: umem object to scan @@ -57,6 +58,17 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift, int entry; unsigned long page_shift = ilog2(umem->page_size); + /* With ODP we must always match OS page size. */ + if (umem->odp_data) { + *count = ib_umem_page_count(umem); + *shift = PAGE_SHIFT; + *ncont = *count; + if (order) + *order = ilog2(roundup_pow_of_two(*count)); + + return; + } + addr = addr >> page_shift; tmp = (unsigned long)addr; m = find_first_bit(&tmp, sizeof(tmp)); @@ -108,8 +120,32 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift, *count = i; } +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING +static u64 umem_dma_to_mtt(dma_addr_t umem_dma) +{ + u64 mtt_entry = umem_dma & ODP_DMA_ADDR_MASK; + + if (umem_dma & ODP_READ_ALLOWED_BIT) + mtt_entry |= MLX5_IB_MTT_READ; + if (umem_dma & ODP_WRITE_ALLOWED_BIT) + mtt_entry |= MLX5_IB_MTT_WRITE; + + return mtt_entry; +} +#endif + +/* + * Populate the given array with bus addresses from the umem. + * + * dev - mlx5_ib device + * umem - umem to use to fill the pages + * page_shift - determines the page size used in the resulting array + * pas - bus addresses array to fill + * access_flags - access flags to set on all present pages. + use enum mlx5_ib_mtt_access_flags for this. + */ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, - int page_shift, __be64 *pas, int umr) + int page_shift, __be64 *pas, int access_flags) { unsigned long umem_page_shift = ilog2(umem->page_size); int shift = page_shift - umem_page_shift; @@ -120,6 +156,23 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, int len; struct scatterlist *sg; int entry; +#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING + const bool odp = umem->odp_data != NULL; + + if (odp) { + int num_pages = ib_umem_num_pages(umem); + + WARN_ON(shift != 0); + WARN_ON(access_flags != (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE)); + + for (i = 0; i < num_pages; ++i) { + dma_addr_t pa = umem->odp_data->dma_list[i]; + + pas[i] = cpu_to_be64(umem_dma_to_mtt(pa)); + } + return; + } +#endif i = 0; for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) { @@ -128,8 +181,7 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem, for (k = 0; k < len; k++) { if (!(i & mask)) { cur = base + (k << umem_page_shift); - if (umr) - cur |= 3; + cur |= access_flags; pas[i >> shift] = cpu_to_be64(cur); mlx5_ib_dbg(dev, "pas[%d] 0x%llx\n", diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h index cc50fce8cca7..83c1690e9dd0 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -268,6 +268,13 @@ struct mlx5_ib_xrcd { u32 xrcdn; }; +enum mlx5_ib_mtt_access_flags { + MLX5_IB_MTT_READ = (1 << 0), + MLX5_IB_MTT_WRITE = (1 << 1), +}; + +#define MLX5_IB_MTT_PRESENT (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE) +
[PATCH v2 04/17] IB/core: Add umem function to read data from user-space
In some drivers there's a need to read data from a user space area that was pinned using ib_umem, when running from a different process context. The ib_umem_copy_from function allows reading data from the physical pages pinned in the ib_umem struct. Signed-off-by: Haggai Eran --- drivers/infiniband/core/umem.c | 26 ++ include/rdma/ib_umem.h | 2 ++ 2 files changed, 28 insertions(+) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index e0f883292374..77bec75963e7 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -292,3 +292,29 @@ int ib_umem_page_count(struct ib_umem *umem) return n; } EXPORT_SYMBOL(ib_umem_page_count); + +/* + * Copy from the given ib_umem's pages to the given buffer. + * + * umem - the umem to copy from + * offset - offset to start copying from + * dst - destination buffer + * length - buffer length + * + * Returns the number of copied bytes, or an error code. + */ +int ib_umem_copy_from(struct ib_umem *umem, size_t offset, void *dst, + size_t length) +{ + size_t end = offset + length; + + if (offset > umem->length || end > umem->length || end < offset) { + pr_err("ib_umem_copy_from not in range. offset: %zd umem length: %zd end: %zd\n", + offset, umem->length, end); + return -EINVAL; + } + + return sg_pcopy_to_buffer(umem->sg_head.sgl, umem->nmap, dst, length, + offset + ib_umem_offset(umem)); +} +EXPORT_SYMBOL(ib_umem_copy_from); diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 7ed6d4ff58dc..ee897724cbf8 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -84,6 +84,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, size_t size, int access, int dmasync); void ib_umem_release(struct ib_umem *umem); int ib_umem_page_count(struct ib_umem *umem); +int ib_umem_copy_from(struct ib_umem *umem, size_t start, void *dst, + size_t length); #else /* CONFIG_INFINIBAND_USER_MEM */ -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 03/17] IB/core: Replace ib_umem's offset field with a full address
In order to allow umems that do not pin memory we need the umem to keep track of its region's address. This makes the offset field redundant, and so this patch removes it. Signed-off-by: Haggai Eran --- drivers/infiniband/core/umem.c | 6 +++--- drivers/infiniband/hw/amso1100/c2_provider.c | 2 +- drivers/infiniband/hw/ehca/ehca_mrmw.c | 2 +- drivers/infiniband/hw/ipath/ipath_mr.c | 2 +- drivers/infiniband/hw/nes/nes_verbs.c| 4 ++-- drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 2 +- drivers/infiniband/hw/qib/qib_mr.c | 2 +- include/rdma/ib_umem.h | 25 - 8 files changed, 34 insertions(+), 11 deletions(-) diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index df0c4f605a21..e0f883292374 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -103,7 +103,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, umem->context = context; umem->length= size; - umem->offset= addr & ~PAGE_MASK; + umem->address = addr; umem->page_size = PAGE_SIZE; umem->pid = get_task_pid(current, PIDTYPE_PID); /* @@ -132,7 +132,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long addr, if (!vma_list) umem->hugetlb = 0; - npages = PAGE_ALIGN(size + umem->offset) >> PAGE_SHIFT; + npages = ib_umem_num_pages(umem); down_write(¤t->mm->mmap_sem); @@ -246,7 +246,7 @@ void ib_umem_release(struct ib_umem *umem) if (!mm) goto out; - diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT; + diff = ib_umem_num_pages(umem); /* * We may be called with the mm's mmap_sem already held. This diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c index 2d5cbf4363e4..bdf3507810cb 100644 --- a/drivers/infiniband/hw/amso1100/c2_provider.c +++ b/drivers/infiniband/hw/amso1100/c2_provider.c @@ -476,7 +476,7 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, c2mr->umem->page_size, i, length, -c2mr->umem->offset, +ib_umem_offset(c2mr->umem), &kva, c2_convert_access(acc), c2mr); diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c b/drivers/infiniband/hw/ehca/ehca_mrmw.c index 3488e8c9fcb4..f914b30999f8 100644 --- a/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -399,7 +399,7 @@ reg_user_mr_fallback: pginfo.num_kpages = num_kpages; pginfo.num_hwpages = num_hwpages; pginfo.u.usr.region = e_mr->umem; - pginfo.next_hwpage = e_mr->umem->offset / hwpage_size; + pginfo.next_hwpage = ib_umem_offset(e_mr->umem) / hwpage_size; pginfo.u.usr.next_sg = pginfo.u.usr.region->sg_head.sgl; ret = ehca_reg_mr(shca, e_mr, (u64 *)virt, length, mr_access_flags, e_pd, &pginfo, &e_mr->ib.ib_mr.lkey, diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c b/drivers/infiniband/hw/ipath/ipath_mr.c index 5e61e9bff697..c7278f6a8217 100644 --- a/drivers/infiniband/hw/ipath/ipath_mr.c +++ b/drivers/infiniband/hw/ipath/ipath_mr.c @@ -214,7 +214,7 @@ struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, mr->mr.user_base = start; mr->mr.iova = virt_addr; mr->mr.length = length; - mr->mr.offset = umem->offset; + mr->mr.offset = ib_umem_offset(umem); mr->mr.access_flags = mr_access_flags; mr->mr.max_segs = n; mr->umem = umem; diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c index fef067c959fc..5192fb61e0be 100644 --- a/drivers/infiniband/hw/nes/nes_verbs.c +++ b/drivers/infiniband/hw/nes/nes_verbs.c @@ -2343,7 +2343,7 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, (unsigned long int)start, (unsigned long int)virt, (u32)length, region->offset, region->page_size); - skip_pages = ((u32)region->offset) >> 12; + skip_pages = ((u32)ib_umem_offset(region)) >> 12; if (ib_copy_from_udata(&req, udata, sizeof(req))) { ib_umem_release(region); @@ -2408,7 +2408,7 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, u64 start, u64 length, region_length -= skip_pages << 12; for (page_index = skip_pages; page_index < chunk_pages; page_index++) { skip_pa
[PATCH v2 02/17] IB/mlx5: Enhance UMR support to allow partial page table update
The current UMR interface doesn't allow partial updates to a memory region's page tables. This patch changes the interface to allow that. It also changes the way the UMR operation validates the memory region's state. When set, IB_SEND_UMR_FAIL_IF_FREE will cause the UMR operation to fail if the MKEY is in the free state. When it is unchecked the operation will check that it isn't in the free state. Signed-off-by: Haggai Eran Signed-off-by: Shachar Raindel --- drivers/infiniband/hw/mlx5/mlx5_ib.h | 15 ++ drivers/infiniband/hw/mlx5/mr.c | 23 + drivers/infiniband/hw/mlx5/qp.c | 96 +++- include/linux/mlx5/device.h | 9 4 files changed, 100 insertions(+), 43 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h index 29da55222070..53d19e6e69a4 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -111,6 +111,8 @@ struct mlx5_ib_pd { */ #define MLX5_IB_SEND_UMR_UNREG IB_SEND_RESERVED_START +#define MLX5_IB_SEND_UMR_FAIL_IF_FREE (IB_SEND_RESERVED_START << 1) +#define MLX5_IB_SEND_UMR_UPDATE_MTT (IB_SEND_RESERVED_START << 2) #define MLX5_IB_QPT_REG_UMRIB_QPT_RESERVED1 #define MLX5_IB_WR_UMR IB_WR_RESERVED1 @@ -206,6 +208,19 @@ enum mlx5_ib_qp_flags { MLX5_IB_QP_SIGNATURE_HANDLING = 1 << 1, }; +struct mlx5_umr_wr { + union { + u64 virt_addr; + u64 offset; + } target; + struct ib_pd *pd; + unsigned intpage_shift; + unsigned intnpages; + u32 length; + int access_flags; + u32 mkey; +}; + struct mlx5_shared_mr_info { int mr_id; struct ib_umem *umem; diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index 610500810f75..aee3527030ac 100644 --- a/drivers/infiniband/hw/mlx5/mr.c +++ b/drivers/infiniband/hw/mlx5/mr.c @@ -37,6 +37,7 @@ #include #include #include +#include #include "mlx5_ib.h" enum { @@ -675,6 +676,7 @@ static void prep_umr_reg_wqe(struct ib_pd *pd, struct ib_send_wr *wr, { struct mlx5_ib_dev *dev = to_mdev(pd->device); struct ib_mr *mr = dev->umrc.mr; + struct mlx5_umr_wr *umrwr = (struct mlx5_umr_wr *)&wr->wr.fast_reg; sg->addr = dma; sg->length = ALIGN(sizeof(u64) * n, 64); @@ -689,21 +691,24 @@ static void prep_umr_reg_wqe(struct ib_pd *pd, struct ib_send_wr *wr, wr->num_sge = 0; wr->opcode = MLX5_IB_WR_UMR; - wr->wr.fast_reg.page_list_len = n; - wr->wr.fast_reg.page_shift = page_shift; - wr->wr.fast_reg.rkey = key; - wr->wr.fast_reg.iova_start = virt_addr; - wr->wr.fast_reg.length = len; - wr->wr.fast_reg.access_flags = access_flags; - wr->wr.fast_reg.page_list = (struct ib_fast_reg_page_list *)pd; + + umrwr->npages = n; + umrwr->page_shift = page_shift; + umrwr->mkey = key; + umrwr->target.virt_addr = virt_addr; + umrwr->length = len; + umrwr->access_flags = access_flags; + umrwr->pd = pd; } static void prep_umr_unreg_wqe(struct mlx5_ib_dev *dev, struct ib_send_wr *wr, u32 key) { - wr->send_flags = MLX5_IB_SEND_UMR_UNREG; + struct mlx5_umr_wr *umrwr = (struct mlx5_umr_wr *)&wr->wr.fast_reg; + + wr->send_flags = MLX5_IB_SEND_UMR_UNREG | MLX5_IB_SEND_UMR_FAIL_IF_FREE; wr->opcode = MLX5_IB_WR_UMR; - wr->wr.fast_reg.rkey = key; + umrwr->mkey = key; } void mlx5_umr_cq_handler(struct ib_cq *cq, void *cq_context) diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c index e261a53f9a02..7f362afa1a38 100644 --- a/drivers/infiniband/hw/mlx5/qp.c +++ b/drivers/infiniband/hw/mlx5/qp.c @@ -70,15 +70,6 @@ static const u32 mlx5_ib_opcode[] = { [MLX5_IB_WR_UMR]= MLX5_OPCODE_UMR, }; -struct umr_wr { - u64 virt_addr; - struct ib_pd *pd; - unsigned intpage_shift; - unsigned intnpages; - u32 length; - int access_flags; - u32 mkey; -}; static int is_qp0(enum ib_qp_type qp_type) { @@ -1838,37 +1829,70 @@ static void set_frwr_umr_segment(struct mlx5_wqe_umr_ctrl_seg *umr, umr->mkey_mask = frwr_mkey_mask(); } +static __be64 get_umr_reg_mr_mask(void) +{ + u64 result; + + result = MLX5_MKEY_MASK_LEN | +MLX5_MKEY_MASK_PAGE_SIZE | +MLX5_MKEY_MASK_START_ADDR | +MLX5_MKEY_MASK_PD | +MLX5_M
[PATCH v2 06/17] IB/core: Add support for extended query device caps
From: Eli Cohen Add extensible query device capabilities verb to allow adding new features. ib_uverbs_ex_query_device is added and copy_query_dev_fields is used to copy capability fields to be used by both ib_uverbs_query_device and ib_uverbs_ex_query_device. Signed-off-by: Eli Cohen --- drivers/infiniband/core/uverbs.h | 1 + drivers/infiniband/core/uverbs_cmd.c | 121 ++ drivers/infiniband/core/uverbs_main.c | 3 +- include/rdma/ib_verbs.h | 5 +- include/uapi/rdma/ib_user_verbs.h | 12 +++- 5 files changed, 98 insertions(+), 44 deletions(-) diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h index 643c08a025a5..b716b0815644 100644 --- a/drivers/infiniband/core/uverbs.h +++ b/drivers/infiniband/core/uverbs.h @@ -258,5 +258,6 @@ IB_UVERBS_DECLARE_CMD(close_xrcd); IB_UVERBS_DECLARE_EX_CMD(create_flow); IB_UVERBS_DECLARE_EX_CMD(destroy_flow); +IB_UVERBS_DECLARE_EX_CMD(query_device); #endif /* UVERBS_H */ diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c index 5ba2a86aab6a..74ad0d0de92b 100644 --- a/drivers/infiniband/core/uverbs_cmd.c +++ b/drivers/infiniband/core/uverbs_cmd.c @@ -378,6 +378,52 @@ err: return ret; } +static void copy_query_dev_fields(struct ib_uverbs_file *file, + struct ib_uverbs_query_device_resp *resp, + struct ib_device_attr *attr) +{ + resp->fw_ver= attr->fw_ver; + resp->node_guid = file->device->ib_dev->node_guid; + resp->sys_image_guid= attr->sys_image_guid; + resp->max_mr_size = attr->max_mr_size; + resp->page_size_cap = attr->page_size_cap; + resp->vendor_id = attr->vendor_id; + resp->vendor_part_id= attr->vendor_part_id; + resp->hw_ver= attr->hw_ver; + resp->max_qp= attr->max_qp; + resp->max_qp_wr = attr->max_qp_wr; + resp->device_cap_flags = attr->device_cap_flags; + resp->max_sge = attr->max_sge; + resp->max_sge_rd= attr->max_sge_rd; + resp->max_cq= attr->max_cq; + resp->max_cqe = attr->max_cqe; + resp->max_mr= attr->max_mr; + resp->max_pd= attr->max_pd; + resp->max_qp_rd_atom= attr->max_qp_rd_atom; + resp->max_ee_rd_atom= attr->max_ee_rd_atom; + resp->max_res_rd_atom = attr->max_res_rd_atom; + resp->max_qp_init_rd_atom = attr->max_qp_init_rd_atom; + resp->max_ee_init_rd_atom = attr->max_ee_init_rd_atom; + resp->atomic_cap= attr->atomic_cap; + resp->max_ee= attr->max_ee; + resp->max_rdd = attr->max_rdd; + resp->max_mw= attr->max_mw; + resp->max_raw_ipv6_qp = attr->max_raw_ipv6_qp; + resp->max_raw_ethy_qp = attr->max_raw_ethy_qp; + resp->max_mcast_grp = attr->max_mcast_grp; + resp->max_mcast_qp_attach = attr->max_mcast_qp_attach; + resp->max_total_mcast_qp_attach = attr->max_total_mcast_qp_attach; + resp->max_ah= attr->max_ah; + resp->max_fmr = attr->max_fmr; + resp->max_map_per_fmr = attr->max_map_per_fmr; + resp->max_srq = attr->max_srq; + resp->max_srq_wr= attr->max_srq_wr; + resp->max_srq_sge = attr->max_srq_sge; + resp->max_pkeys = attr->max_pkeys; + resp->local_ca_ack_delay= attr->local_ca_ack_delay; + resp->phys_port_cnt = file->device->ib_dev->phys_port_cnt; +} + ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file, const char __user *buf, int in_len, int out_len) @@ -398,47 +444,7 @@ ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file, return ret; memset(&resp, 0, sizeof resp); - - resp.fw_ver= attr.fw_ver; - resp.node_guid = file->device->ib_dev->node_guid; - resp.sys_image_guid= attr.sys_image_guid; - resp.max_mr_size = attr.max_mr_size; - resp.page_size_cap = attr.page_size_cap; - resp.vendor_id = attr.vendor_id; - resp.vendor_part_id= attr.vendor_part_id; - resp.hw_ver= attr.hw_ver; - resp.max_qp= attr.max_qp; - resp.max_qp_wr = attr.max_qp_wr; - resp.device_cap_flags = attr.device_cap_flags; - resp.max_sge = attr.max_sge; - resp.max_sge_rd= attr.max_sge_rd; - resp.max_cq= attr.max_cq; - resp.max_cqe
Re: [PATCH] infiniband-diags: add rdma-ndd daemon
On 11/10/2014 1:32 PM, Weiny, Ira wrote: > I think changing the default is a worthwhile change. In addition, alternate > admin policies are aided by the > general use of the %h specifier. > > 1) SM's which periodically scan the Node Description always get up to > date hostname info. > 2) Up to date hostname info is provided even if a user space daemons > fails. > Note: OpenSM has an update node description feature for this > condition. > 3) Low level diag tools always get up to date hostname info There is a node description local changes trap which should cause an SM to reread the updated NodeDescription. This is assuming the "SMA" is compliant and issues such trap when ND changes and this occurs after LinkUp. ALso, there is a new optional (SMA) feature (@ 1.3) to set the NodeDescription. Not sure if this helps for this. > Do you believe that " " is an unreasonable default? > > Roland, Hal, do you have any input? I was concerned with having the format be self identifying so that it can be easily distinguishable from other known formats being used in the field. As you wrote, given that RedHat is already using this format, we are already "living" with this. -- Hal -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/10] xprtrdma: Display async errors
On 11/9/2014 3:15 AM, Chuck Lever wrote: An async error upcall is a hard error, and should be reported in the system log. Could be useful to others... Any chance you put this in ib_core for all of us? Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FMR Support in multi-function environment
> In SRIOV, FMR is supported only for the PF, not for VFs (since this > feature requires writing directly to mapped ICM memory). Hi, Thank you so much for pointing to the exact code!! I have a related question. I was trying to figure out the use case for FMR in this environment(where in only PF supports FMR). As per my understanding if an application wants to register huge amounts of memory and wants to avoid the overhead of SW2HW_MPT HCR command, it can do so using the alloc fmr verb. Now in the SRIOV case, the application sits on top of the VF driver and I was curious as to how the VF communicates with the PF driver to register memory/map using FMR. More specifically, I was trying to understand how the following function gets called by the VF driver: int mlx4_map_phys_fmr(struct mlx4_dev *dev, struct mlx4_fmr *fmr, u64 *page_list, int npages, u64 iova, u32 *lkey, u32 *rkey) { u32 key; int i, err; err = mlx4_check_fmr(fmr, page_list, npages, iova); if (err) return err; ++fmr->maps; key = key_to_hw_index(fmr->mr.key); key += dev->caps.num_mpts; *lkey = *rkey = fmr->mr.key = hw_index_to_key(key); *(u8 *) fmr->mpt = MLX4_MPT_STATUS_SW; /* Make sure MPT status is visible before writing MTT entries */ wmb(); dma_sync_single_for_cpu(&dev->pdev->dev, fmr->dma_handle, npages * sizeof(u64), DMA_TO_DEVICE); for (i = 0; i < npages; ++i) fmr->mtts[i] = cpu_to_be64(page_list[i] | MLX4_MTT_FLAG_PRESENT); dma_sync_single_for_device(&dev->pdev->dev, fmr->dma_handle, npages * sizeof(u64), DMA_TO_DEVICE); fmr->mpt->key= cpu_to_be32(key); fmr->mpt->lkey = cpu_to_be32(key); fmr->mpt->length = cpu_to_be64(npages * (1ull << fmr->page_shift)); fmr->mpt->start = cpu_to_be64(iova); /* Make MTT entries are visible before setting MPT status */ wmb(); *(u8 *) fmr->mpt = MLX4_MPT_STATUS_HW; /* Make sure MPT status is visible before consumer can use FMR */ wmb(); return 0; } Because the way i understood, VF can communicate with PF driver by posting VHCR commands which cause an event to be generated on the PF side. I can see _WRAPPER calls to handle those cases. As there doesn't seem to be FMR related VHCR command(virtual/para-virtual command), I was struggling to understand how the flow happens for FMR from application->kernel->VF-driver->PF-driver I would be much grateful, if you can help me understand this. Thanks so much!! Your replies really helped me improve my understanding. Best Regards, Bob On Tue, Nov 11, 2014 at 4:24 PM, Jack Morgenstein wrote: > On Mon, 10 Nov 2014 19:58:46 +0530 > Bob Biloxi wrote: > >> Hi, >> >> Is FMR (Fast Memory Regions) supported in a multi-function mode? > > In SRIOV, FMR is supported only for the PF, not for VFs (since this > feature requires writing directly to mapped ICM memory). > > You can see this in file drivers/infiniband/hw/mlx4/main.c, function > mlx4_ib_add() : > > > if (!mlx4_is_slave(ibdev->dev)) { > ibdev->ib_dev.alloc_fmr = mlx4_ib_fmr_alloc; > ibdev->ib_dev.map_phys_fmr = mlx4_ib_map_phys_fmr; > ibdev->ib_dev.unmap_fmr = mlx4_ib_unmap_fmr; > ibdev->ib_dev.dealloc_fmr = mlx4_ib_fmr_dealloc; > } > > i.e., the fmr functions are not put into the device virtual function > table for slave (= VF) devices. > > -Jack > >> >> If yes, I couldn't find the source code for the same in the mlx4 >> codebase. Can anyone please point me to the right location... >> >> What I was trying to understand is this: >> >> Suppose a VF driver wants to register large amount of memory using >> FMR, will it be able to do so using the mlx4 code. >> >> Or FMR is supported only in dedicated mode? >> >> >> Thanks >> >> Best Regards, >> Bob >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" >> in the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FMR Support in multi-function environment
On Mon, 10 Nov 2014 19:58:46 +0530 Bob Biloxi wrote: > Hi, > > Is FMR (Fast Memory Regions) supported in a multi-function mode? In SRIOV, FMR is supported only for the PF, not for VFs (since this feature requires writing directly to mapped ICM memory). You can see this in file drivers/infiniband/hw/mlx4/main.c, function mlx4_ib_add() : if (!mlx4_is_slave(ibdev->dev)) { ibdev->ib_dev.alloc_fmr = mlx4_ib_fmr_alloc; ibdev->ib_dev.map_phys_fmr = mlx4_ib_map_phys_fmr; ibdev->ib_dev.unmap_fmr = mlx4_ib_unmap_fmr; ibdev->ib_dev.dealloc_fmr = mlx4_ib_fmr_dealloc; } i.e., the fmr functions are not put into the device virtual function table for slave (= VF) devices. -Jack > > If yes, I couldn't find the source code for the same in the mlx4 > codebase. Can anyone please point me to the right location... > > What I was trying to understand is this: > > Suppose a VF driver wants to register large amount of memory using > FMR, will it be able to do so using the mlx4 code. > > Or FMR is supported only in dedicated mode? > > > Thanks > > Best Regards, > Bob > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" > in the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ibv_reg_mr fails to register memory
Hi All, I was trying to implement an application that support rdma using librdmacm. When the process is trying to register the buffer using ibv_reg_mr, it failes after registering some number buffer. We are using Mellanox Technologies MT27500 Family [ConnectX-3] cards. Reconfigured options are : 1) cat /sys/module/mlx4_core/parameters/log_num_mtt 24 2) cat /sys/module/mlx4_core/parameters/log_mtts_per_seg 7 Can anyone point out reason behind ibv_reg_mr fails ? Thanks & Regards Rafi KC -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 for-next 0/9] Peer-Direct support
On Tue, Oct 28, 2014 at 8:29 PM, Coffman, Jerrie L wrote: > [...] to obtain the source for review. Steps are as follows: > Download and extract the latest OFED-3.12-1 release (rc3 in this case) from > OpenFabrics.org > cd OFED-3.12-1-rc3/SRPMS/ > rpm2cpio compat-rdma-3.12-1.1.g9594cac.src.rpm | cpio -vid > tar xzf compat-rdma-3.12.tgz > cd compat-rdma-3.12 > ofed_scripts/ofed_patch.sh --with-patchdir=tech-preview/xeon-phi/ > The code of interest is located in the IB proxy server located in > drivers/infiniband/ibp/drv directory [...] I bet busy upstream kernel maintainers/reviewers would not go that far... can you make it easily reviewable? that is, either send as patches to this list or much easier put the relevant piece in public git tree (github?) > Possible upstreaming plans for CCL direct are still being discussed. Since > the CCL direct code depends on a low level MIC driver called SCIF, we would > have to wait until the group at Intel that owns that driver gets it upstream > before attempting to upstream CCL direct [..] I do see these three config directives set in my upstream clone, and checking reveals they were added in 3.13 # Intel MIC Bus Driver CONFIG_INTEL_MIC_BUS=m # Intel MIC Host Driver CONFIG_INTEL_MIC_HOST=m # Intel MIC Card Driver CONFIG_INTEL_MIC_CARD=m Can you point to the missing elements in the chart present @ this LWN post http://lwn.net/Articles/564795/ Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html