Re: [PATCH V2 for-next 0/9] Peer-Direct support

2014-11-11 Thread Coffman, Jerrie L
On Tue, Nov 11, 2014 at 12:44 AM, Or Gerlitz  wrote:
> [...] can you make it easily reviewable? that is, either send as patches to 
> this list or much easier put the relevant piece in public git tree (github?)

While it does take some effort to obtain the code, there are currently no plans 
to submit patches or create a public git tree for CCL direct until all of the 
lower level software is upstream.

> Can you point to the missing elements in the chart present @ this LWN post 
> http://lwn.net/Articles/564795/

The CCL direct code depends on a lower level API called the Symmetric 
Communications Interface (SCIF).  SCIF abstracts the details of communicating 
over the PCIe bus while providing an API that is symmetric between the host and 
coprocessor.  The MIC drivers referenced by the LWN post do not include the 
SCIF API.  Once SCIF is upstream, patches for CCL direct can be submitted.

The point is that there are actual consumers of peer direct that could use it 
if it were merged upstream.

-jerrie

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] DAPL support on s390x platform

2014-11-11 Thread Roland Dreier
On Wed, Nov 5, 2014 at 7:04 AM, Utz Bacher  wrote:
> (B) Status of patches
> 1. kernel code -- the new system call: reviewed, acked and accepted by the
> s390x maintainer Martin Schwidefsky (2014/10/13), so we will have that
> system call in the s390x kernel.
> 2. libibverbs -- define barriers on s390x: Looking for your feedback. We
> understand there have been no general objections so far.
> 3. libmlx4 -- provide MMIO abstraction: reviewed by the Mellanox
> maintainers and we understand they would apply this once you give the go
> for the overall set.
> Previously, a patch to DAPL to build on s390x has been accepted already
> (Arlin Davis, 2014/09/02).
>
>   We gave your concern on MMIO handling on s390x serious consideration from
> various angles, but the page fault handler does not appear workable. OTOH,
> Mellanox is fine with the MMIO abstraction in libmlx4, and we didn't hear
> of significant other concerns. With that, could you please consider the
> patch set again to add s390x to the list of supported platforms? Happy to
> repost the patches for convenience.

If Mellanox is willing to take on the maintenance burden of changing
all MMIO access to an inline function, and if you're willing to take
on the burden of knowing that every new adapter you support means
tracking down and convincing the maintainer of the driver library,
then I'm OK with adding the simple barrier patch to libibverbs.  Could
you please send the latest version of that patch to me?

It does seem a little strange to be adding a new system call to
simulate kernel bypass, but I guess you do what you gotta do...

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] xprtrdma: Display async errors

2014-11-11 Thread Or Gerlitz
On Tue, Nov 11, 2014 at 8:49 PM, Sagi Grimberg  wrote:
> On 11/11/2014 6:52 PM, Chuck Lever wrote:
>>
>>
>>> On Nov 11, 2014, at 8:30 AM, Sagi Grimberg 
>>> wrote:
>>>
 On 11/9/2014 3:15 AM, Chuck Lever wrote:
 An async error upcall is a hard error, and should be reported in
 the system log.

>>>
>>> Could be useful to others... Any chance you put this in ib_core for all
>>> of us?
>>
>>
>> Eventually. We certainly wouldn't want copies of this array of strings
>> to appear many times in the kernel. That would be a waste of space.
>>
>> I have a similar patch that adds an array for CQ status codes, and
>> xprtrdma has a string array already for connection status. Are those
>> also interesting?
>>
>
> Yep, also RDMA_CM events. Would certainly help people avoid source
> code navigation to understand what is going on...

Oh yes, Chuck, good if you can pick  this up, AFAIRemeber most of the
strings are already in the RDS code (net/rds) - please re-factor them
from there into some IB core helpers, thanks alot!!
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] xprtrdma: Display async errors

2014-11-11 Thread Sagi Grimberg

On 11/11/2014 6:52 PM, Chuck Lever wrote:



On Nov 11, 2014, at 8:30 AM, Sagi Grimberg  wrote:


On 11/9/2014 3:15 AM, Chuck Lever wrote:
An async error upcall is a hard error, and should be reported in
the system log.



Could be useful to others... Any chance you put this in ib_core for all
of us?


Eventually. We certainly wouldn't want copies of this array of strings
to appear many times in the kernel. That would be a waste of space.

I have a similar patch that adds an array for CQ status codes, and
xprtrdma has a string array already for connection status. Are those
also interesting?



Yep, also RDMA_CM events. Would certainly help people avoid source
code navigation to understand what is going on...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] xprtrdma: Display async errors

2014-11-11 Thread Chuck Lever

> On Nov 11, 2014, at 8:30 AM, Sagi Grimberg  wrote:
> 
>> On 11/9/2014 3:15 AM, Chuck Lever wrote:
>> An async error upcall is a hard error, and should be reported in
>> the system log.
>> 
> 
> Could be useful to others... Any chance you put this in ib_core for all
> of us?

Eventually. We certainly wouldn't want copies of this array of strings
to appear many times in the kernel. That would be a waste of space.

I have a similar patch that adds an array for CQ status codes, and
xprtrdma has a string array already for connection status. Are those
also interesting?--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 15/17] IB/mlx5: Handle page faults

2014-11-11 Thread Haggai Eran
This patch implement a page fault handler (leaving the pages pinned as
of time being). The page fault handler handles initiator and responder
page faults for UD/RC transports, for send/receive operations, as well as
RDMA read/write initiator support.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/odp.c | 408 +++
 include/linux/mlx5/qp.h  |   7 +
 2 files changed, 415 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 63bbdba396f1..bd1dbe5ebc15 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -30,6 +30,9 @@
  * SOFTWARE.
  */
 
+#include 
+#include 
+
 #include "mlx5_ib.h"
 
 struct workqueue_struct *mlx5_ib_page_fault_wq;
@@ -85,12 +88,417 @@ static void mlx5_ib_page_fault_resume(struct mlx5_ib_qp 
*qp,
   qp->mqp.qpn);
 }
 
+/*
+ * Handle a single data segment in a page-fault WQE.
+ *
+ * Returns number of pages retrieved on success. The caller will continue to
+ * the next data segment.
+ * Can return the following error codes:
+ * -EAGAIN to designate a temporary error. The caller will abort handling the
+ *  page fault and resolve it.
+ * -EFAULT when there's an error mapping the requested pages. The caller will
+ *  abort the page fault handling and possibly move the QP to an error state.
+ * On other errors the QP should also be closed with an error.
+ */
+static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
+struct mlx5_ib_pfault *pfault,
+u32 key, u64 io_virt, size_t bcnt,
+u32 *bytes_mapped)
+{
+   struct mlx5_ib_dev *mib_dev = to_mdev(qp->ibqp.pd->device);
+   int srcu_key;
+   unsigned int current_seq;
+   u64 start_idx;
+   int npages = 0, ret = 0;
+   struct mlx5_ib_mr *mr;
+   u64 access_mask = ODP_READ_ALLOWED_BIT;
+
+   srcu_key = srcu_read_lock(&mib_dev->mr_srcu);
+   mr = mlx5_ib_odp_find_mr_lkey(mib_dev, key);
+   /*
+* If we didn't find the MR, it means the MR was closed while we were
+* handling the ODP event. In this case we return -EFAULT so that the
+* QP will be closed.
+*/
+   if (!mr || !mr->ibmr.pd) {
+   pr_err("Failed to find relevant mr for lkey=0x%06x, probably 
the MR was destroyed\n",
+  key);
+   ret = -EFAULT;
+   goto srcu_unlock;
+   }
+   if (!mr->umem->odp_data) {
+   pr_debug("skipping non ODP MR (lkey=0x%06x) in page fault 
handler.\n",
+key);
+   if (bytes_mapped)
+   *bytes_mapped +=
+   (bcnt - pfault->mpfault.bytes_committed);
+   goto srcu_unlock;
+   }
+   if (mr->ibmr.pd != qp->ibqp.pd) {
+   pr_err("Page-fault with different PDs for QP and MR.\n");
+   ret = -EFAULT;
+   goto srcu_unlock;
+   }
+
+   current_seq = ACCESS_ONCE(mr->umem->odp_data->notifiers_seq);
+
+   /*
+* Avoid branches - this code will perform correctly
+* in all iterations (in iteration 2 and above,
+* bytes_committed == 0).
+*/
+   io_virt += pfault->mpfault.bytes_committed;
+   bcnt -= pfault->mpfault.bytes_committed;
+
+   start_idx = (io_virt - (mr->mmr.iova & PAGE_MASK)) >> PAGE_SHIFT;
+
+   if (mr->umem->writable)
+   access_mask |= ODP_WRITE_ALLOWED_BIT;
+   npages = ib_umem_odp_map_dma_pages(mr->umem, io_virt, bcnt,
+  access_mask, current_seq);
+   if (npages < 0) {
+   ret = npages;
+   goto srcu_unlock;
+   }
+
+   if (npages > 0) {
+   mutex_lock(&mr->umem->odp_data->umem_mutex);
+   /*
+* No need to check whether the MTTs really belong to
+* this MR, since ib_umem_odp_map_dma_pages already
+* checks this.
+*/
+   ret = mlx5_ib_update_mtt(mr, start_idx, npages, 0);
+   mutex_unlock(&mr->umem->odp_data->umem_mutex);
+   if (ret < 0) {
+   pr_err("Failed to update mkey page tables\n");
+   goto srcu_unlock;
+   }
+
+   if (bytes_mapped) {
+   u32 new_mappings = npages * PAGE_SIZE -
+   (io_virt - round_down(io_virt, PAGE_SIZE));
+   *bytes_mapped += min_t(u32, new_mappings, bcnt);
+   }
+   }
+
+srcu_unlock:
+   srcu_read_unlock(&mib_dev->mr_srcu, srcu_key);
+   pfault->mpfault.bytes_committed = 0;
+   return ret ? ret : npages;
+}
+
+/**
+ * Parse a series of data segments for page fault handling.
+ *
+ * @qp

[PATCH v2 17/17] IB/mlx5: Implement on demand paging by adding support for MMU notifiers

2014-11-11 Thread Haggai Eran
* Implement the relevant invalidation functions (zap MTTs as needed)
* Implement interlocking (and rollback in the page fault handlers) for cases of 
a racing notifier and fault.
* With this patch we can now enable the capability bits for supporting RC
  send/receive/RDMA read/RDMA write, and UD send.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/main.c|   4 ++
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   3 +
 drivers/infiniband/hw/mlx5/mr.c  |  79 +++--
 drivers/infiniband/hw/mlx5/odp.c | 128 ---
 4 files changed, 198 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index a801baa79c8e..8a87404e9c76 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -574,6 +574,10 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
goto out_count;
}
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   context->ibucontext.invalidate_range = &mlx5_ib_invalidate_range;
+#endif
+
INIT_LIST_HEAD(&context->db_page_list);
mutex_init(&context->db_page_mutex);
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index c6ceec3e3d6a..83f22fe297c8 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -325,6 +325,7 @@ struct mlx5_ib_mr {
struct mlx5_ib_dev *dev;
struct mlx5_create_mkey_mbox_out out;
struct mlx5_core_sig_ctx*sig;
+   int live;
 };
 
 struct mlx5_ib_fast_reg_page_list {
@@ -629,6 +630,8 @@ int __init mlx5_ib_odp_init(void);
 void mlx5_ib_odp_cleanup(void);
 void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp);
 void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp);
+void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
+ unsigned long end);
 
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 static inline int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 9c9e16cca043..a2dd7bfc129b 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "mlx5_ib.h"
 
@@ -54,6 +55,18 @@ static 
DEFINE_MUTEX(mlx5_ib_update_mtt_emergency_buffer_mutex);
 
 static int clean_mr(struct mlx5_ib_mr *mr);
 
+static int destroy_mkey(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
+{
+   int err = mlx5_core_destroy_mkey(dev->mdev, &mr->mmr);
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   /* Wait until all page fault handlers using the mr complete. */
+   synchronize_srcu(&dev->mr_srcu);
+#endif
+
+   return err;
+}
+
 static int order2idx(struct mlx5_ib_dev *dev, int order)
 {
struct mlx5_mr_cache *cache = &dev->cache;
@@ -188,7 +201,7 @@ static void remove_keys(struct mlx5_ib_dev *dev, int c, int 
num)
ent->cur--;
ent->size--;
spin_unlock_irq(&ent->lock);
-   err = mlx5_core_destroy_mkey(dev->mdev, &mr->mmr);
+   err = destroy_mkey(dev, mr);
if (err)
mlx5_ib_warn(dev, "failed destroy mkey\n");
else
@@ -479,7 +492,7 @@ static void clean_keys(struct mlx5_ib_dev *dev, int c)
ent->cur--;
ent->size--;
spin_unlock_irq(&ent->lock);
-   err = mlx5_core_destroy_mkey(dev->mdev, &mr->mmr);
+   err = destroy_mkey(dev, mr);
if (err)
mlx5_ib_warn(dev, "failed destroy mkey\n");
else
@@ -809,6 +822,8 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct 
ib_umem *umem,
mr->mmr.size = len;
mr->mmr.pd = to_mpd(pd)->pdn;
 
+   mr->live = 1;
+
 unmap_dma:
up(&umrc->sem);
dma_unmap_single(ddev, dma, size, DMA_TO_DEVICE);
@@ -994,6 +1009,7 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, u64 
virt_addr,
goto err_2;
}
mr->umem = umem;
+   mr->live = 1;
mlx5_vfree(in);
 
mlx5_ib_dbg(dev, "mkey = 0x%x\n", mr->mmr.key);
@@ -1071,10 +1087,47 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
mr->ibmr.lkey = mr->mmr.key;
mr->ibmr.rkey = mr->mmr.key;
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   if (umem->odp_data) {
+   /*
+* This barrier prevents the compiler from moving the
+* setting of umem->odp_data->private to point to our
+* MR, before reg_umr finished, to ensure that the MR
+* initialization have finished before starting to
+* handle invalidations.
+ 

[PATCH v2 16/17] IB/mlx5: Add support for RDMA read/write responder page faults

2014-11-11 Thread Haggai Eran
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/odp.c | 79 
 1 file changed, 79 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index bd1dbe5ebc15..936a6cd4ecc7 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -35,6 +35,8 @@
 
 #include "mlx5_ib.h"
 
+#define MAX_PREFETCH_LEN (4*1024*1024U)
+
 struct workqueue_struct *mlx5_ib_page_fault_wq;
 
 #define COPY_ODP_BIT_MLX_TO_IB(reg, ib_caps, field_name, bit_name) do {
\
@@ -490,6 +492,80 @@ resolve_page_fault:
free_page((unsigned long)buffer);
 }
 
+static int pages_in_range(u64 address, u32 length)
+{
+   return (ALIGN(address + length, PAGE_SIZE) -
+   (address & PAGE_MASK)) >> PAGE_SHIFT;
+}
+
+static void mlx5_ib_mr_rdma_pfault_handler(struct mlx5_ib_qp *qp,
+  struct mlx5_ib_pfault *pfault)
+{
+   struct mlx5_pagefault *mpfault = &pfault->mpfault;
+   u64 address;
+   u32 length;
+   u32 prefetch_len = mpfault->bytes_committed;
+   int prefetch_activated = 0;
+   u32 rkey = mpfault->rdma.r_key;
+   int ret;
+
+   /* The RDMA responder handler handles the page fault in two parts.
+* First it brings the necessary pages for the current packet
+* (and uses the pfault context), and then (after resuming the QP)
+* prefetches more pages. The second operation cannot use the pfault
+* context and therefore uses the dummy_pfault context allocated on
+* the stack */
+   struct mlx5_ib_pfault dummy_pfault = {};
+
+   dummy_pfault.mpfault.bytes_committed = 0;
+
+   mpfault->rdma.rdma_va += mpfault->bytes_committed;
+   mpfault->rdma.rdma_op_len -= min(mpfault->bytes_committed,
+mpfault->rdma.rdma_op_len);
+   mpfault->bytes_committed = 0;
+
+   address = mpfault->rdma.rdma_va;
+   length  = mpfault->rdma.rdma_op_len;
+
+   /* For some operations, the hardware cannot tell the exact message
+* length, and in those cases it reports zero. Use prefetch
+* logic. */
+   if (length == 0) {
+   prefetch_activated = 1;
+   length = mpfault->rdma.packet_size;
+   prefetch_len = min(MAX_PREFETCH_LEN, prefetch_len);
+   }
+
+   ret = pagefault_single_data_segment(qp, pfault, rkey, address, length,
+   NULL);
+   if (ret == -EAGAIN) {
+   /* We're racing with an invalidation, don't prefetch */
+   prefetch_activated = 0;
+   } else if (ret < 0 || pages_in_range(address, length) > ret) {
+   mlx5_ib_page_fault_resume(qp, pfault, 1);
+   return;
+   }
+
+   mlx5_ib_page_fault_resume(qp, pfault, 0);
+
+   /* At this point, there might be a new pagefault already arriving in
+* the eq, switch to the dummy pagefault for the rest of the
+* processing. We're still OK with the objects being alive as the
+* work-queue is being fenced. */
+
+   if (prefetch_activated) {
+   ret = pagefault_single_data_segment(qp, &dummy_pfault, rkey,
+   address,
+   prefetch_len,
+   NULL);
+   if (ret < 0) {
+   pr_warn("Prefetch failed (ret = %d, prefetch_activated 
= %d) for QPN %d, address: 0x%.16llx, length = 0x%.16x\n",
+   ret, prefetch_activated,
+   qp->ibqp.qp_num, address, prefetch_len);
+   }
+   }
+}
+
 void mlx5_ib_mr_pfault_handler(struct mlx5_ib_qp *qp,
   struct mlx5_ib_pfault *pfault)
 {
@@ -499,6 +575,9 @@ void mlx5_ib_mr_pfault_handler(struct mlx5_ib_qp *qp,
case MLX5_PFAULT_SUBTYPE_WQE:
mlx5_ib_mr_wqe_pfault_handler(qp, pfault);
break;
+   case MLX5_PFAULT_SUBTYPE_RDMA:
+   mlx5_ib_mr_rdma_pfault_handler(qp, pfault);
+   break;
default:
pr_warn("Invalid page fault event subtype: 0x%x\n",
event_subtype);
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 13/17] IB/mlx5: Add mlx5_ib_update_mtt to update page tables after creation

2014-11-11 Thread Haggai Eran
The new function allows updating the page tables of a memory region after it
was created. This can be used to handle page faults and page invalidations.

Since mlx5_ib_update_mtt will need to work from within page invalidation, so
it must not block on memory allocation. It employs an atomic memory allocation
mechanism that is used as a fallback when kmalloc(GFP_ATOMIC) fails.

In order to reuse code from mlx5_ib_populate_pas, the patch splits this
function and add the needed parameters.

Signed-off-by: Haggai Eran 
Signed-off-by: Shachar Raindel 
---
 drivers/infiniband/hw/mlx5/mem.c |  19 +++--
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   5 ++
 drivers/infiniband/hw/mlx5/mr.c  | 132 ++-
 include/linux/mlx5/device.h  |   1 +
 4 files changed, 149 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index 5f7b30147180..b56e4c5593ee 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -140,12 +140,16 @@ static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
  * dev - mlx5_ib device
  * umem - umem to use to fill the pages
  * page_shift - determines the page size used in the resulting array
+ * offset - offset into the umem to start from,
+ *  only implemented for ODP umems
+ * num_pages - total number of pages to fill
  * pas - bus addresses array to fill
  * access_flags - access flags to set on all present pages.
  use enum mlx5_ib_mtt_access_flags for this.
  */
-void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
- int page_shift, __be64 *pas, int access_flags)
+void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
+   int page_shift, size_t offset, size_t num_pages,
+   __be64 *pas, int access_flags)
 {
unsigned long umem_page_shift = ilog2(umem->page_size);
int shift = page_shift - umem_page_shift;
@@ -160,13 +164,11 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct 
ib_umem *umem,
const bool odp = umem->odp_data != NULL;
 
if (odp) {
-   int num_pages = ib_umem_num_pages(umem);
-
WARN_ON(shift != 0);
WARN_ON(access_flags != (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE));
 
for (i = 0; i < num_pages; ++i) {
-   dma_addr_t pa = umem->odp_data->dma_list[i];
+   dma_addr_t pa = umem->odp_data->dma_list[offset + i];
 
pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
}
@@ -194,6 +196,13 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct 
ib_umem *umem,
}
 }
 
+void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
+ int page_shift, __be64 *pas, int access_flags)
+{
+   return __mlx5_ib_populate_pas(dev, umem, page_shift, 0,
+ ib_umem_num_pages(umem), pas,
+ access_flags);
+}
 int mlx5_ib_get_buf_offset(u64 addr, int page_shift, u32 *offset)
 {
u64 page_size;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 83c1690e9dd0..6856e27bfb6a 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -527,6 +527,8 @@ struct ib_mr *mlx5_ib_get_dma_mr(struct ib_pd *pd, int acc);
 struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
  u64 virt_addr, int access_flags,
  struct ib_udata *udata);
+int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index,
+  int npages, int zap);
 int mlx5_ib_dereg_mr(struct ib_mr *ibmr);
 int mlx5_ib_destroy_mr(struct ib_mr *ibmr);
 struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
@@ -558,6 +560,9 @@ int mlx5_ib_init_fmr(struct mlx5_ib_dev *dev);
 void mlx5_ib_cleanup_fmr(struct mlx5_ib_dev *dev);
 void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift,
int *ncont, int *order);
+void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
+   int page_shift, size_t offset, size_t num_pages,
+   __be64 *pas, int access_flags);
 void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
  int page_shift, __be64 *pas, int access_flags);
 void mlx5_ib_copy_pas(u64 *old, u64 *new, int step, int num);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index fcd531e0758b..e9675325af41 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -44,9 +44,13 @@ enum {
MAX_PENDING_REG_MR = 8,
 };
 
-enum {
-   MLX5_UMR_ALIGN  = 2048
-};
+#define MLX5_UMR_ALIGN 2048
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+static __b

[PATCH v2 10/17] net/mlx5_core: Add support for page faults events and low level handling

2014-11-11 Thread Haggai Eran
* Add a handler function pointer in the mlx5_core_qp struct for page fault
  events. Handle page fault events by calling the handler function, if not
  NULL.
* Add on-demand paging capability query command.
* Export command for resuming QPs after page faults.
* Add various constants related to paging support.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/mr.c  |   6 +-
 drivers/infiniband/hw/mlx5/qp.c  |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c |  13 ++-
 drivers/net/ethernet/mellanox/mlx5/core/fw.c |  39 +
 drivers/net/ethernet/mellanox/mlx5/core/qp.c | 119 +++
 include/linux/mlx5/device.h  |  58 -
 include/linux/mlx5/driver.h  |  12 +++
 include/linux/mlx5/qp.h  |  55 +
 8 files changed, 299 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index aee3527030ac..d69db8d7d227 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -147,7 +147,7 @@ static int add_keys(struct mlx5_ib_dev *dev, int c, int num)
mr->order = ent->order;
mr->umred = 1;
mr->dev = dev;
-   in->seg.status = 1 << 6;
+   in->seg.status = MLX5_MKEY_STATUS_FREE;
in->seg.xlt_oct_size = cpu_to_be32((npages + 1) / 2);
in->seg.qpn_mkey7_0 = cpu_to_be32(0xff << 8);
in->seg.flags = MLX5_ACCESS_MODE_MTT | MLX5_PERM_UMR_EN;
@@ -1033,7 +1033,7 @@ struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
goto err_free;
}
 
-   in->seg.status = 1 << 6; /* free */
+   in->seg.status = MLX5_MKEY_STATUS_FREE;
in->seg.xlt_oct_size = cpu_to_be32(ndescs);
in->seg.qpn_mkey7_0 = cpu_to_be32(0xff << 8);
in->seg.flags_pd = cpu_to_be32(to_mpd(pd)->pdn);
@@ -1148,7 +1148,7 @@ struct ib_mr *mlx5_ib_alloc_fast_reg_mr(struct ib_pd *pd,
goto err_free;
}
 
-   in->seg.status = 1 << 6; /* free */
+   in->seg.status = MLX5_MKEY_STATUS_FREE;
in->seg.xlt_oct_size = cpu_to_be32((max_page_list_len + 1) / 2);
in->seg.qpn_mkey7_0 = cpu_to_be32(0xff << 8);
in->seg.flags = MLX5_PERM_UMR_EN | MLX5_ACCESS_MODE_MTT;
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 455d40779112..d61e4ef73c34 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -1982,7 +1982,7 @@ static void set_mkey_segment(struct mlx5_mkey_seg *seg, 
struct ib_send_wr *wr,
 {
memset(seg, 0, sizeof(*seg));
if (li) {
-   seg->status = 1 << 6;
+   seg->status = MLX5_MKEY_STATUS_FREE;
return;
}
 
@@ -2003,7 +2003,7 @@ static void set_reg_mkey_segment(struct mlx5_mkey_seg 
*seg, struct ib_send_wr *w
 
memset(seg, 0, sizeof(*seg));
if (wr->send_flags & MLX5_IB_SEND_UMR_UNREG) {
-   seg->status = 1 << 6;
+   seg->status = MLX5_MKEY_STATUS_FREE;
return;
}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index ad2c96a02a53..44cc16d4eff7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -157,6 +157,8 @@ static const char *eqe_type_str(u8 type)
return "MLX5_EVENT_TYPE_CMD";
case MLX5_EVENT_TYPE_PAGE_REQUEST:
return "MLX5_EVENT_TYPE_PAGE_REQUEST";
+   case MLX5_EVENT_TYPE_PAGE_FAULT:
+   return "MLX5_EVENT_TYPE_PAGE_FAULT";
default:
return "Unrecognized event";
}
@@ -279,6 +281,11 @@ static int mlx5_eq_int(struct mlx5_core_dev *dev, struct 
mlx5_eq *eq)
}
break;
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   case MLX5_EVENT_TYPE_PAGE_FAULT:
+   mlx5_eq_pagefault(dev, eqe);
+   break;
+#endif
 
default:
mlx5_core_warn(dev, "Unhandled event 0x%x on EQ 0x%x\n",
@@ -446,8 +453,12 @@ void mlx5_eq_cleanup(struct mlx5_core_dev *dev)
 int mlx5_start_eqs(struct mlx5_core_dev *dev)
 {
struct mlx5_eq_table *table = &dev->priv.eq_table;
+   u32 async_event_mask = MLX5_ASYNC_EVENT_MASK;
int err;
 
+   if (dev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG)
+   async_event_mask |= (1ull << MLX5_EVENT_TYPE_PAGE_FAULT);
+
err = mlx5_create_map_eq(dev, &table->cmd_eq, MLX5_EQ_VEC_CMD,
 MLX5_NUM_CMD_EQE, 1ull << MLX5_EVENT_TYPE_CMD,
 "mlx5_cmd_eq", &dev->priv.uuari.uars[0]);
@@ -459,7 +470,7 @@ int mlx5_start_eqs(struct mlx5_core_dev *dev)
mlx5_cmd_use_

[PATCH v2 05/17] IB/mlx5: Add function to read WQE from user-space

2014-11-11 Thread Haggai Eran
Add a helper function mlx5_ib_read_user_wqe to read information from
user-space owned work queues. The function will be used in a later patch by
the page-fault handling code in mlx5_ib.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  2 +
 drivers/infiniband/hw/mlx5/qp.c  | 73 
 include/linux/mlx5/qp.h  |  3 ++
 3 files changed, 78 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 53d19e6e69a4..14a0311eaa1c 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -503,6 +503,8 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr 
*wr,
 int mlx5_ib_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr,
  struct ib_recv_wr **bad_wr);
 void *mlx5_get_send_wqe(struct mlx5_ib_qp *qp, int n);
+int mlx5_ib_read_user_wqe(struct mlx5_ib_qp *qp, int send, int wqe_index,
+ void *buffer, u32 length);
 struct ib_cq *mlx5_ib_create_cq(struct ib_device *ibdev, int entries,
int vector, struct ib_ucontext *context,
struct ib_udata *udata);
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 7f362afa1a38..455d40779112 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -101,6 +101,79 @@ void *mlx5_get_send_wqe(struct mlx5_ib_qp *qp, int n)
return get_wqe(qp, qp->sq.offset + (n << MLX5_IB_SQ_STRIDE));
 }
 
+/**
+ * mlx5_ib_read_user_wqe() - Copy a user-space WQE to kernel space.
+ *
+ * @qp: QP to copy from.
+ * @send: copy from the send queue when non-zero, use the receive queue
+ *   otherwise.
+ * @wqe_index:  index to start copying from. For send work queues, the
+ * wqe_index is in units of MLX5_SEND_WQE_BB.
+ * For receive work queue, it is the number of work queue
+ * element in the queue.
+ * @buffer: destination buffer.
+ * @length: maximum number of bytes to copy.
+ *
+ * Copies at least a single WQE, but may copy more data.
+ *
+ * Return: the number of bytes copied, or an error code.
+ */
+int mlx5_ib_read_user_wqe(struct mlx5_ib_qp *qp, int send, int wqe_index,
+ void *buffer, u32 length)
+{
+   struct ib_device *ibdev = qp->ibqp.device;
+   struct mlx5_ib_dev *dev = to_mdev(ibdev);
+   struct mlx5_ib_wq *wq = send ? &qp->sq : &qp->rq;
+   size_t offset;
+   size_t wq_end;
+   struct ib_umem *umem = qp->umem;
+   u32 first_copy_length;
+   int wqe_length;
+   int copied;
+   int ret;
+
+   if (wq->wqe_cnt == 0) {
+   mlx5_ib_dbg(dev, "mlx5_ib_read_user_wqe for a QP with wqe_cnt 
== 0. qp_type: 0x%x\n",
+   qp->ibqp.qp_type);
+   return -EINVAL;
+   }
+
+   offset = wq->offset + ((wqe_index % wq->wqe_cnt) << wq->wqe_shift);
+   wq_end = wq->offset + (wq->wqe_cnt << wq->wqe_shift);
+
+   if (send && length < sizeof(struct mlx5_wqe_ctrl_seg))
+   return -EINVAL;
+
+   if (offset > umem->length ||
+   (send && offset + sizeof(struct mlx5_wqe_ctrl_seg) > umem->length))
+   return -EINVAL;
+
+   first_copy_length = min_t(u32, offset + length, wq_end) - offset;
+   copied = ib_umem_copy_from(umem, offset, buffer, first_copy_length);
+   if (copied < first_copy_length)
+   return copied;
+
+   if (send) {
+   struct mlx5_wqe_ctrl_seg *ctrl = buffer;
+   int ds = be32_to_cpu(ctrl->qpn_ds) & MLX5_WQE_CTRL_DS_MASK;
+
+   wqe_length = ds * MLX5_WQE_DS_UNITS;
+   } else {
+   wqe_length = 1 << wq->wqe_shift;
+   }
+
+   if (wqe_length <= first_copy_length)
+   return first_copy_length;
+
+   ret = ib_umem_copy_from(umem, wq->offset, buffer + first_copy_length,
+   wqe_length - first_copy_length);
+   if (ret < 0)
+   return ret;
+   copied += ret;
+
+   return copied;
+}
+
 static void mlx5_ib_qp_event(struct mlx5_core_qp *qp, int type)
 {
struct ib_qp *ibqp = &to_mibqp(qp)->ibqp;
diff --git a/include/linux/mlx5/qp.h b/include/linux/mlx5/qp.h
index 3fa075daeb1d..67f4b9660b06 100644
--- a/include/linux/mlx5/qp.h
+++ b/include/linux/mlx5/qp.h
@@ -189,6 +189,9 @@ struct mlx5_wqe_ctrl_seg {
__be32  imm;
 };
 
+#define MLX5_WQE_CTRL_DS_MASK 0x3f
+#define MLX5_WQE_DS_UNITS 16
+
 struct mlx5_wqe_xrc_seg {
__be32  xrc_srqn;
u8  rsvd[12];
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 14/17] IB/mlx5: Page faults handling infrastructure

2014-11-11 Thread Haggai Eran
* Refactor MR registration and cleanup, and fix reg_pages accounting.
* Create a work queue to handle page fault events in a kthread context.
* Register a fault handler to get events from the core for each QP.

The registered fault handler is empty in this patch, and only a later patch
implements it.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/main.c|  31 +++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  67 +++-
 drivers/infiniband/hw/mlx5/mr.c  |  45 +++
 drivers/infiniband/hw/mlx5/odp.c | 145 +++
 drivers/infiniband/hw/mlx5/qp.c  |  26 ++-
 include/linux/mlx5/driver.h  |   2 +-
 6 files changed, 294 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index e6d775f2446d..a801baa79c8e 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -864,7 +864,7 @@ static ssize_t show_reg_pages(struct device *device,
struct mlx5_ib_dev *dev =
container_of(device, struct mlx5_ib_dev, ib_dev.dev);
 
-   return sprintf(buf, "%d\n", dev->mdev->priv.reg_pages);
+   return sprintf(buf, "%d\n", atomic_read(&dev->mdev->priv.reg_pages));
 }
 
 static ssize_t show_hca(struct device *device, struct device_attribute *attr,
@@ -1389,16 +1389,19 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
goto err_eqs;
 
mutex_init(&dev->cap_mask_mutex);
-   spin_lock_init(&dev->mr_lock);
 
err = create_dev_resources(&dev->devr);
if (err)
goto err_eqs;
 
-   err = ib_register_device(&dev->ib_dev, NULL);
+   err = mlx5_ib_odp_init_one(dev);
if (err)
goto err_rsrc;
 
+   err = ib_register_device(&dev->ib_dev, NULL);
+   if (err)
+   goto err_odp;
+
err = create_umr_res(dev);
if (err)
goto err_dev;
@@ -1420,6 +1423,9 @@ err_umrc:
 err_dev:
ib_unregister_device(&dev->ib_dev);
 
+err_odp:
+   mlx5_ib_odp_remove_one(dev);
+
 err_rsrc:
destroy_dev_resources(&dev->devr);
 
@@ -1435,8 +1441,10 @@ err_dealloc:
 static void mlx5_ib_remove(struct mlx5_core_dev *mdev, void *context)
 {
struct mlx5_ib_dev *dev = context;
+
ib_unregister_device(&dev->ib_dev);
destroy_umrc_res(dev);
+   mlx5_ib_odp_remove_one(dev);
destroy_dev_resources(&dev->devr);
free_comp_eqs(dev);
ib_dealloc_device(&dev->ib_dev);
@@ -1450,15 +1458,30 @@ static struct mlx5_interface mlx5_ib_interface = {
 
 static int __init mlx5_ib_init(void)
 {
+   int err;
+
if (deprecated_prof_sel != 2)
pr_warn("prof_sel is deprecated for mlx5_ib, set it for 
mlx5_core\n");
 
-   return mlx5_register_interface(&mlx5_ib_interface);
+   err = mlx5_ib_odp_init();
+   if (err)
+   return err;
+
+   err = mlx5_register_interface(&mlx5_ib_interface);
+   if (err)
+   goto clean_odp;
+
+   return err;
+
+clean_odp:
+   mlx5_ib_odp_cleanup();
+   return err;
 }
 
 static void __exit mlx5_ib_cleanup(void)
 {
mlx5_unregister_interface(&mlx5_ib_interface);
+   mlx5_ib_odp_cleanup();
 }
 
 module_init(mlx5_ib_init);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 6856e27bfb6a..c6ceec3e3d6a 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -149,6 +149,29 @@ enum {
MLX5_QP_EMPTY
 };
 
+/*
+ * Connect-IB can trigger up to four concurrent pagefaults
+ * per-QP.
+ */
+enum mlx5_ib_pagefault_context {
+   MLX5_IB_PAGEFAULT_RESPONDER_READ,
+   MLX5_IB_PAGEFAULT_REQUESTOR_READ,
+   MLX5_IB_PAGEFAULT_RESPONDER_WRITE,
+   MLX5_IB_PAGEFAULT_REQUESTOR_WRITE,
+   MLX5_IB_PAGEFAULT_CONTEXTS
+};
+
+static inline enum mlx5_ib_pagefault_context
+   mlx5_ib_get_pagefault_context(struct mlx5_pagefault *pagefault)
+{
+   return pagefault->flags & (MLX5_PFAULT_REQUESTOR | MLX5_PFAULT_WRITE);
+}
+
+struct mlx5_ib_pfault {
+   struct work_struct  work;
+   struct mlx5_pagefault   mpfault;
+};
+
 struct mlx5_ib_qp {
struct ib_qpibqp;
struct mlx5_core_qp mqp;
@@ -194,6 +217,21 @@ struct mlx5_ib_qp {
 
/* Store signature errors */
boolsignature_en;
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   /*
+* A flag that is true for QP's that are in a state that doesn't
+* allow page faults, and shouldn't schedule any more faults.
+*/
+   int disable_page_faults;
+   /*
+* The disable_page_faults_lock protects a QP's disable_page_faults
+* field, allowing for a thread to atomically check whether the QP
+* allows page faults, and if so schedule a page fault.
+   

[PATCH v2 01/17] IB/mlx5: Remove per-MR pas and dma pointers

2014-11-11 Thread Haggai Eran
Since UMR code now uses its own context struct on the stack, the pas and dma
pointers for the UMR operation that remained in the mlx5_ib_mr struct are not
necessary.  This patch removes them.

Fixes: a74d24168d2d ("IB/mlx5: Refactor UMR to have its own context struct")
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  2 --
 drivers/infiniband/hw/mlx5/mr.c  | 21 -
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 386780f0d1e1..29da55222070 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -261,8 +261,6 @@ struct mlx5_ib_mr {
struct list_headlist;
int order;
int umred;
-   __be64  *pas;
-   dma_addr_t  dma;
int npages;
struct mlx5_ib_dev *dev;
struct mlx5_create_mkey_mbox_out out;
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 8ee7cb46e059..610500810f75 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -740,6 +740,8 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct 
ib_umem *umem,
struct mlx5_ib_mr *mr;
struct ib_sge sg;
int size = sizeof(u64) * npages;
+   __be64 *mr_pas;
+   dma_addr_t dma;
int err = 0;
int i;
 
@@ -758,25 +760,26 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, 
struct ib_umem *umem,
if (!mr)
return ERR_PTR(-EAGAIN);
 
-   mr->pas = kmalloc(size + MLX5_UMR_ALIGN - 1, GFP_KERNEL);
-   if (!mr->pas) {
+   mr_pas = kmalloc(size + MLX5_UMR_ALIGN - 1, GFP_KERNEL);
+   if (!mr_pas) {
err = -ENOMEM;
goto free_mr;
}
 
mlx5_ib_populate_pas(dev, umem, page_shift,
-mr_align(mr->pas, MLX5_UMR_ALIGN), 1);
+mr_align(mr_pas, MLX5_UMR_ALIGN), 1);
 
-   mr->dma = dma_map_single(ddev, mr_align(mr->pas, MLX5_UMR_ALIGN), size,
-DMA_TO_DEVICE);
-   if (dma_mapping_error(ddev, mr->dma)) {
+   dma = dma_map_single(ddev, mr_align(mr_pas, MLX5_UMR_ALIGN), size,
+DMA_TO_DEVICE);
+   if (dma_mapping_error(ddev, dma)) {
err = -ENOMEM;
goto free_pas;
}
 
memset(&wr, 0, sizeof(wr));
wr.wr_id = (u64)(unsigned long)&umr_context;
-   prep_umr_reg_wqe(pd, &wr, &sg, mr->dma, npages, mr->mmr.key, 
page_shift, virt_addr, len, access_flags);
+   prep_umr_reg_wqe(pd, &wr, &sg, dma, npages, mr->mmr.key, page_shift,
+virt_addr, len, access_flags);
 
mlx5_ib_init_umr_context(&umr_context);
down(&umrc->sem);
@@ -798,10 +801,10 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, 
struct ib_umem *umem,
 
 unmap_dma:
up(&umrc->sem);
-   dma_unmap_single(ddev, mr->dma, size, DMA_TO_DEVICE);
+   dma_unmap_single(ddev, dma, size, DMA_TO_DEVICE);
 
 free_pas:
-   kfree(mr->pas);
+   kfree(mr_pas);
 
 free_mr:
if (err) {
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 07/17] IB/core: Add flags for on demand paging support

2014-11-11 Thread Haggai Eran
From: Sagi Grimberg 

* Add a configuration option for enable on-demand paging support in the
  infiniband subsystem (CONFIG_INFINIBAND_ON_DEMAND_PAGING). In a later patch,
  this configuration option will select the MMU_NOTIFIER configuration option
  to enable mmu notifiers.
* Add a flag for on demand paging (ODP) support in the IB device capabilities.
* Add a flag to request ODP MR in the access flags to reg_mr.
* Fail registrations done with the ODP flag when the low-level driver doesn't
  support this.
* Change the conditions in which an MR will be writable to explicitly
  specify the access flags. This is to avoid making an MR writable just
  because it is an ODP MR.
* Add a ODP capabilities to the extended query device verb.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/Kconfig   | 10 ++
 drivers/infiniband/core/umem.c   |  8 +---
 drivers/infiniband/core/uverbs_cmd.c | 25 +
 include/rdma/ib_verbs.h  | 28 ++--
 include/uapi/rdma/ib_user_verbs.h| 16 
 5 files changed, 82 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 77089399359b..089a2c2af329 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -38,6 +38,16 @@ config INFINIBAND_USER_MEM
depends on INFINIBAND_USER_ACCESS != n
default y
 
+config INFINIBAND_ON_DEMAND_PAGING
+   bool "InfiniBand on-demand paging support"
+   depends on INFINIBAND_USER_MEM
+   default y
+   ---help---
+ On demand paging support for the InfiniBand subsystem.
+ Together with driver support this allows registration of
+ memory regions without pinning their pages, fetching the
+ pages on demand instead.
+
 config INFINIBAND_ADDR_TRANS
bool
depends on INFINIBAND
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 77bec75963e7..a140b2d4d94e 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -107,13 +107,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem->page_size = PAGE_SIZE;
umem->pid   = get_task_pid(current, PIDTYPE_PID);
/*
-* We ask for writable memory if any access flags other than
-* "remote read" are set.  "Local write" and "remote write"
+* We ask for writable memory if any of the following
+* access flags are set.  "Local write" and "remote write"
 * obviously require write access.  "Remote atomic" can do
 * things like fetch and add, which will modify memory, and
 * "MW bind" can change permissions by binding a window.
 */
-   umem->writable  = !!(access & ~IB_ACCESS_REMOTE_READ);
+   umem->writable  = !!(access &
+   (IB_ACCESS_LOCAL_WRITE   | IB_ACCESS_REMOTE_WRITE |
+IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_MW_BIND));
 
/* We assume the memory is from hugetlb until proved otherwise */
umem->hugetlb   = 1;
diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index 74ad0d0de92b..46b60086a4bf 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -953,6 +953,18 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file,
goto err_free;
}
 
+   if (cmd.access_flags & IB_ACCESS_ON_DEMAND) {
+   struct ib_device_attr attr;
+
+   ret = ib_query_device(pd->device, &attr);
+   if (ret || !(attr.device_cap_flags &
+   IB_DEVICE_ON_DEMAND_PAGING)) {
+   pr_debug("ODP support not available\n");
+   ret = -EINVAL;
+   goto err_put;
+   }
+   }
+
mr = pd->device->reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va,
 cmd.access_flags, &udata);
if (IS_ERR(mr)) {
@@ -3286,6 +3298,19 @@ int ib_uverbs_ex_query_device(struct ib_uverbs_file 
*file,
copy_query_dev_fields(file, &resp.base, &attr);
resp.comp_mask = 0;
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   if (cmd.comp_mask & IB_USER_VERBS_EX_QUERY_DEVICE_ODP) {
+   resp.odp_caps.general_caps = attr.odp_caps.general_caps;
+   resp.odp_caps.per_transport_caps.rc_odp_caps =
+   attr.odp_caps.per_transport_caps.rc_odp_caps;
+   resp.odp_caps.per_transport_caps.uc_odp_caps =
+   attr.odp_caps.per_transport_caps.uc_odp_caps;
+   resp.odp_caps.per_transport_caps.ud_odp_caps =
+   attr.odp_caps.per_transport_caps.ud_odp_caps;
+   resp.comp_mask |= IB_USER_VERBS_EX_QUERY_DEVICE_ODP;
+   }
+#endif
+
err = ib_copy_to_udata(ucore, &res

[PATCH v2 11/17] IB/mlx5: Implement the ODP capability query verb

2014-11-11 Thread Haggai Eran
The patch adds infrastructure to query ODP capabilities in the
mlx5 driver. The code will read the capabilities from the device, and enable
only those capabilities that both the driver and the device supports.
At this point ODP is not supported, so no capability is copied from the
device, but the patch exposes the global ODP device capability bit.

Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/Makefile  |  1 +
 drivers/infiniband/hw/mlx5/main.c| 10 ++
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 
 drivers/infiniband/hw/mlx5/odp.c | 60 
 4 files changed, 83 insertions(+)
 create mode 100644 drivers/infiniband/hw/mlx5/odp.c

diff --git a/drivers/infiniband/hw/mlx5/Makefile 
b/drivers/infiniband/hw/mlx5/Makefile
index 4ea0135af484..27a70159e2ea 100644
--- a/drivers/infiniband/hw/mlx5/Makefile
+++ b/drivers/infiniband/hw/mlx5/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_MLX5_INFINIBAND)  += mlx5_ib.o
 
 mlx5_ib-y :=   main.o cq.o doorbell.o qp.o mem.o srq.o mr.o ah.o mad.o
+mlx5_ib-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += odp.o
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 1ba6c42e4df8..e6d775f2446d 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -244,6 +244,12 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
   props->max_mcast_grp;
props->max_map_per_fmr = INT_MAX; /* no limit in ConnectIB */
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   if (dev->mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG)
+   props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
+   props->odp_caps = dev->odp_caps;
+#endif
+
 out:
kfree(in_mad);
kfree(out_mad);
@@ -1321,6 +1327,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) |
(1ull << IB_USER_VERBS_CMD_CREATE_XSRQ) |
(1ull << IB_USER_VERBS_CMD_OPEN_QP);
+   dev->ib_dev.uverbs_ex_cmd_mask =
+   (1ull << IB_USER_VERBS_EX_CMD_QUERY_DEVICE);
 
dev->ib_dev.query_device= mlx5_ib_query_device;
dev->ib_dev.query_port  = mlx5_ib_query_port;
@@ -1366,6 +1374,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
dev->ib_dev.free_fast_reg_page_list  = mlx5_ib_free_fast_reg_page_list;
dev->ib_dev.check_mr_status = mlx5_ib_check_mr_status;
 
+   mlx5_ib_internal_query_odp_caps(dev);
+
if (mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_XRC) {
dev->ib_dev.alloc_xrcd = mlx5_ib_alloc_xrcd;
dev->ib_dev.dealloc_xrcd = mlx5_ib_dealloc_xrcd;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 14a0311eaa1c..cc50fce8cca7 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -390,6 +390,9 @@ struct mlx5_ib_dev {
struct mlx5_mr_cachecache;
struct timer_list   delay_timer;
int fill_delay;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   struct ib_odp_caps  odp_caps;
+#endif
 };
 
 static inline struct mlx5_ib_cq *to_mibcq(struct mlx5_core_cq *mcq)
@@ -559,6 +562,15 @@ void mlx5_umr_cq_handler(struct ib_cq *cq, void 
*cq_context);
 int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
struct ib_mr_status *mr_status);
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev);
+#else
+static inline int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev)
+{
+   return 0;
+}
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+
 static inline void init_query_mad(struct ib_smp *mad)
 {
mad->base_version  = 1;
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
new file mode 100644
index ..66c39ee16aff
--- /dev/null
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -0,0 +1,60 @@
+/*
+ * Copyright (c) 2014 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *  

[PATCH v2 09/17] IB/core: Implement support for MMU notifiers regarding on demand paging regions

2014-11-11 Thread Haggai Eran
* Add an interval tree implementation for ODP umems. Create an interval tree
  for each ucontext (including a count of the number of ODP MRs in this
  context, semaphore, etc.), and register ODP umems in the interval tree.
* Add MMU notifiers handling functions, using the interval tree to notify only
  the relevant umems and underlying MRs.
* Register to receive MMU notifier events from the MM subsystem upon ODP MR
  registration (and unregister accordingly).
* Add a completion object to synchronize the destruction of ODP umems.
* Add mechanism to abort page faults when there's a concurrent invalidation.

The way we synchronize between concurrent invalidations and page faults is by
keeping a counter of currently running invalidations, and a sequence number
that is incremented whenever an invalidation is caught. The page fault code
checks the counter and also verifies that the sequence number hasn't
progressed before it updates the umem's page tables. This is similar to what
the kvm module does.

In order to prevent the case where we register a umem in the middle of an
ongoing notifier, we also keep a per ucontext counter of the total number of
active mmu notifiers. We only enable new umems when all the running notifiers
complete.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
Signed-off-by: Yuval Dagan 
---
 drivers/infiniband/Kconfig|   1 +
 drivers/infiniband/core/Makefile  |   2 +-
 drivers/infiniband/core/umem.c|   2 +-
 drivers/infiniband/core/umem_odp.c| 380 +-
 drivers/infiniband/core/umem_rbtree.c |  94 +
 drivers/infiniband/core/uverbs_cmd.c  |  17 ++
 include/rdma/ib_umem_odp.h|  65 +-
 include/rdma/ib_verbs.h   |  19 ++
 8 files changed, 567 insertions(+), 13 deletions(-)
 create mode 100644 drivers/infiniband/core/umem_rbtree.c

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 089a2c2af329..b899531498eb 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -41,6 +41,7 @@ config INFINIBAND_USER_MEM
 config INFINIBAND_ON_DEMAND_PAGING
bool "InfiniBand on-demand paging support"
depends on INFINIBAND_USER_MEM
+   select MMU_NOTIFIER
default y
---help---
  On demand paging support for the InfiniBand subsystem.
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index c58f7913c560..acf736764445 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=   ib_uverbs.o 
ib_ucm.o \
 ib_core-y :=   packer.o ud_header.o verbs.o sysfs.o \
device.o fmr_pool.o cache.o netlink.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
-ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
 
 ib_mad-y :=mad.o smi.o agent.o mad_rmpp.o
 
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 45d7794c7a2b..2a173ae3522e 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -72,7 +72,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
  * ib_umem_get - Pin and DMA map userspace memory.
  *
  * If access flags indicate ODP memory, avoid pinning. Instead, stores
- * the mm for future page fault handling.
+ * the mm for future page fault handling in conjunction with MMU notifiers.
  *
  * @context: userspace context to pin memory for
  * @addr: userspace virtual address to start at
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index b1a6a44439a2..6095872549e7 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -41,26 +41,235 @@
 #include 
 #include 
 
+static void ib_umem_notifier_start_account(struct ib_umem *item)
+{
+   mutex_lock(&item->odp_data->umem_mutex);
+
+   /* Only update private counters for this umem if it has them.
+* Otherwise skip it. All page faults will be delayed for this umem. */
+   if (item->odp_data->mn_counters_active) {
+   int notifiers_count = item->odp_data->notifiers_count++;
+
+   if (notifiers_count == 0)
+   /* Initialize the completion object for waiting on
+* notifiers. Since notifier_count is zero, no one
+* should be waiting right now. */
+   reinit_completion(&item->odp_data->notifier_completion);
+   }
+   mutex_unlock(&item->odp_data->umem_mutex);
+}
+
+static void ib_umem_notifier_end_account(struct ib_umem *item)
+{
+   mutex_lock(&item->odp_data->umem_mutex);
+
+   /* Only update private counters for this umem if it has them.
+* Otherwise skip it. All page faults will be de

[PATCH v2 08/17] IB/core: Add support for on demand paging regions

2014-11-11 Thread Haggai Eran
From: Shachar Raindel 

* Extend the umem struct to keep the ODP related data.
* Allocate and initialize the ODP related information in the umem
  (page_list, dma_list) and freeing as needed in the end of the run.
* Store a reference to the process PID struct in the ucontext. Used to
  safely obtain the task_struct and the mm during fault handling, without
  preventing the task destruction if needed.
* Add 2 helper functions: ib_umem_odp_map_dma_pages and
  ib_umem_odp_unmap_dma_pages. These functions get the DMA addresses of
  specific pages of the umem (and, currently, pin them).
* Support for page faults only - IB core will keep the reference on the pages
  used and call put_page when freeing an ODP umem area. Invalidations support
  will be added in a later patch.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
Signed-off-by: Majd Dibbiny 
---
 drivers/infiniband/core/Makefile  |   1 +
 drivers/infiniband/core/umem.c|  24 +++
 drivers/infiniband/core/umem_odp.c| 308 ++
 drivers/infiniband/core/uverbs_cmd.c  |   5 +
 drivers/infiniband/core/uverbs_main.c |   2 +
 include/rdma/ib_umem.h|   2 +
 include/rdma/ib_umem_odp.h|  97 +++
 include/rdma/ib_verbs.h   |   2 +
 8 files changed, 441 insertions(+)
 create mode 100644 drivers/infiniband/core/umem_odp.c
 create mode 100644 include/rdma/ib_umem_odp.h

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index ffd0af6734af..c58f7913c560 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=   ib_uverbs.o 
ib_ucm.o \
 ib_core-y :=   packer.o ud_header.o verbs.o sysfs.o \
device.o fmr_pool.o cache.o netlink.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
 
 ib_mad-y :=mad.o smi.o agent.o mad_rmpp.o
 
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a140b2d4d94e..45d7794c7a2b 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "uverbs.h"
 
@@ -69,6 +70,10 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
 
 /**
  * ib_umem_get - Pin and DMA map userspace memory.
+ *
+ * If access flags indicate ODP memory, avoid pinning. Instead, stores
+ * the mm for future page fault handling.
+ *
  * @context: userspace context to pin memory for
  * @addr: userspace virtual address to start at
  * @size: length of region to pin
@@ -117,6 +122,17 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
(IB_ACCESS_LOCAL_WRITE   | IB_ACCESS_REMOTE_WRITE |
 IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_MW_BIND));
 
+   if (access & IB_ACCESS_ON_DEMAND) {
+   ret = ib_umem_odp_get(context, umem);
+   if (ret) {
+   kfree(umem);
+   return ERR_PTR(ret);
+   }
+   return umem;
+   }
+
+   umem->odp_data = NULL;
+
/* We assume the memory is from hugetlb until proved otherwise */
umem->hugetlb   = 1;
 
@@ -237,6 +253,11 @@ void ib_umem_release(struct ib_umem *umem)
struct task_struct *task;
unsigned long diff;
 
+   if (umem->odp_data) {
+   ib_umem_odp_release(umem);
+   return;
+   }
+
__ib_umem_release(umem->context->device, umem, 1);
 
task = get_pid_task(umem->pid, PIDTYPE_PID);
@@ -285,6 +306,9 @@ int ib_umem_page_count(struct ib_umem *umem)
int n;
struct scatterlist *sg;
 
+   if (umem->odp_data)
+   return ib_umem_num_pages(umem);
+
shift = ilog2(umem->page_size);
 
n = 0;
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
new file mode 100644
index ..b1a6a44439a2
--- /dev/null
+++ b/drivers/infiniband/core/umem_odp.c
@@ -0,0 +1,308 @@
+/*
+ * Copyright (c) 2014 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ * 

[PATCH v2 00/17] On demand paging

2014-11-11 Thread Haggai Eran
Hi Roland,

Following your comments, we have modified the patch set to eliminate the
possibility of a negative notifiers counter, and removed the atomic accesses
where possible.  The new set have been rebased against upstream, and also
contains some minor fixes detailed below.

We have isolated 5 patches that we think can be taken as a preliminary patches
for the rest of the series. These are the first 5 patches:
IB/mlx5: Remove per-MR pas and dma pointers
IB/mlx5: Enhance UMR support to allow partial page table update
IB/core: Replace ib_umem's offset field with a full address
IB/core: Add umem function to read data from user-space
IB/mlx5: Add function to read WQE from user-space

Regards,
Haggai

Changes from V1: http://www.spinics.net/lists/linux-rdma/msg20734.html
- Rebased against latest upstream (3.18-rc2).
- Added patch 1: remove the mr dma and pas fields which are no longer needed.
- Replace extended query device patch 1 with Eli Cohen's recent submission from
  the extended atomic series [1].
- Patch 3: respect umem's page size when calculating offset and start address.
- Patch 8: fix error handling in ib_umem_odp_map_dma_pages
- Patch 9:
  - Add a global mmu notifier counter (per ucontext) to prevent the race that 
existed in v1.
  - Make accesses to the per-umem notifier counters non-atomic (use 
ACCESS_ONCE).
  - Rename ucontext->umem_mutex as ucontext->umem_rwsem to reflect it being a 
semaphore.
- Patch 15: fix error handling in pagefault_single_data_segment
- Patch 17: timeout when waiting for an active mmu notifier to complete
- Add RC RDMA read support to the patch-set.
- Minor fixes.

Changes from V0: http://marc.info/?l=linux-rdma&m=139375790322547&w=2

- Rebased against latest upstream / for-next branch.
- Removed dependency on patches that were accepted upstream.
- Removed pre-patches that were accepted upstream [2].
- Add extended uverb call for querying device (patch 1) and use kernel device
  attributes to report ODP capabilities through the new uverb entry instead of
  having a special verb.
- Allow upgrading page access permissions during page faults.
- Minor fixes to issues that came up during regression testing of the patches.

The following set of patches implements on-demand paging (ODP) support
in the RDMA stack and in the mlx5_ib Infiniband driver.

What is on-demand paging?

Applications register memory with an RDMA adapter using system calls,
and subsequently post IO operations that refer to the corresponding
virtual addresses directly to HW. Until now, this was achieved by
pinning the memory during the registration calls. The goal of on demand
paging is to avoid pinning the pages of registered memory regions (MRs).
This will allow users the same flexibility they get when swapping any
other part of their processes address spaces. Instead of requiring the
entire MR to fit in physical memory, we can allow the MR to be larger,
and only fit the current working set in physical memory.

This can make programming with RDMA much simpler. Today, developers that
are working with more data than their RAM can hold need either to
deregister and reregister memory regions throughout their process's
life, or keep a single memory region and copy the data to it. On demand
paging will allow these developers to register a single MR at the
beginning of their process's life, and let the operating system manage
which pages needs to be fetched at a given time. In the future, we might
be able to provide a single memory access key for each process that
would provide the entire process's address as one large memory region,
and the developers wouldn't need to register memory regions at all.

How does page faults generally work?

With pinned memory regions, the driver would map the virtual addresses
to bus addresses, and pass these addresses to the HCA to associate them
with the new MR. With ODP, the driver is now allowed to mark some of the
pages in the MR as not-present. When the HCA attempts to perform memory
access for a communication operation, it notices the page is not
present, and raises a page fault event to the driver. In addition, the
HCA performs whatever operation is required by the transport protocol to
suspend communication until the page fault is resolved.

Upon receiving the page fault interrupt, the driver first needs to know
on which virtual address the page fault occurred, and on what memory
key. When handling send/receive operations, this information is inside
the work queue. The driver reads the needed work queue elements, and
parses them to gather the address and memory key. For other RDMA
operations, the event generated by the HCA only contains the virtual
address and rkey, as there are no work queue elements involved.

Having the rkey, the driver can find the relevant memory region in its
data structures, and calculate the actual pages needed to complete the
operation. It then uses get_user_pages to retrieve the needed pages back
to the

[PATCH v2 12/17] IB/mlx5: Changes in memory region creation to support on-demand paging

2014-11-11 Thread Haggai Eran
This patch wraps together several changes needed for on-demand paging support
in the mlx5_ib_populate_pas function, and when registering memory regions.

* Instead of accepting a UMR bit telling the function to enable all access
  flags, the function now accepts the access flags themselves.
* For on-demand paging memory regions, fill the memory tables from the
  correct list, and enable/disable the access flags per-page according to
  whether the page is present.
* A new bit is set to enable writing of access flags when using the firmware
  create_mkey command.
* Disable contig pages when on-demand paging is enabled.

In addition the patch changes the UMR code to use PTR_ALIGN instead of our own
macro.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/mem.c | 58 ++--
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 +++-
 drivers/infiniband/hw/mlx5/mr.c  | 33 +++-
 include/linux/mlx5/device.h  |  3 ++
 4 files changed, 88 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index dae07eae9507..5f7b30147180 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -32,6 +32,7 @@
 
 #include 
 #include 
+#include 
 #include "mlx5_ib.h"
 
 /* @umem: umem object to scan
@@ -57,6 +58,17 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int 
*count, int *shift,
int entry;
unsigned long page_shift = ilog2(umem->page_size);
 
+   /* With ODP we must always match OS page size. */
+   if (umem->odp_data) {
+   *count = ib_umem_page_count(umem);
+   *shift = PAGE_SHIFT;
+   *ncont = *count;
+   if (order)
+   *order = ilog2(roundup_pow_of_two(*count));
+
+   return;
+   }
+
addr = addr >> page_shift;
tmp = (unsigned long)addr;
m = find_first_bit(&tmp, sizeof(tmp));
@@ -108,8 +120,32 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, 
int *count, int *shift,
*count = i;
 }
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
+{
+   u64 mtt_entry = umem_dma & ODP_DMA_ADDR_MASK;
+
+   if (umem_dma & ODP_READ_ALLOWED_BIT)
+   mtt_entry |= MLX5_IB_MTT_READ;
+   if (umem_dma & ODP_WRITE_ALLOWED_BIT)
+   mtt_entry |= MLX5_IB_MTT_WRITE;
+
+   return mtt_entry;
+}
+#endif
+
+/*
+ * Populate the given array with bus addresses from the umem.
+ *
+ * dev - mlx5_ib device
+ * umem - umem to use to fill the pages
+ * page_shift - determines the page size used in the resulting array
+ * pas - bus addresses array to fill
+ * access_flags - access flags to set on all present pages.
+ use enum mlx5_ib_mtt_access_flags for this.
+ */
 void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
- int page_shift, __be64 *pas, int umr)
+ int page_shift, __be64 *pas, int access_flags)
 {
unsigned long umem_page_shift = ilog2(umem->page_size);
int shift = page_shift - umem_page_shift;
@@ -120,6 +156,23 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct 
ib_umem *umem,
int len;
struct scatterlist *sg;
int entry;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   const bool odp = umem->odp_data != NULL;
+
+   if (odp) {
+   int num_pages = ib_umem_num_pages(umem);
+
+   WARN_ON(shift != 0);
+   WARN_ON(access_flags != (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE));
+
+   for (i = 0; i < num_pages; ++i) {
+   dma_addr_t pa = umem->odp_data->dma_list[i];
+
+   pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
+   }
+   return;
+   }
+#endif
 
i = 0;
for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
@@ -128,8 +181,7 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct 
ib_umem *umem,
for (k = 0; k < len; k++) {
if (!(i & mask)) {
cur = base + (k << umem_page_shift);
-   if (umr)
-   cur |= 3;
+   cur |= access_flags;
 
pas[i >> shift] = cpu_to_be64(cur);
mlx5_ib_dbg(dev, "pas[%d] 0x%llx\n",
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index cc50fce8cca7..83c1690e9dd0 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -268,6 +268,13 @@ struct mlx5_ib_xrcd {
u32 xrcdn;
 };
 
+enum mlx5_ib_mtt_access_flags {
+   MLX5_IB_MTT_READ  = (1 << 0),
+   MLX5_IB_MTT_WRITE = (1 << 1),
+};
+
+#define MLX5_IB_MTT_PRESENT (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE)
+
 

[PATCH v2 04/17] IB/core: Add umem function to read data from user-space

2014-11-11 Thread Haggai Eran
In some drivers there's a need to read data from a user space area that
was pinned using ib_umem, when running from a different process context.

The ib_umem_copy_from function allows reading data from the physical pages
pinned in the ib_umem struct.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/umem.c | 26 ++
 include/rdma/ib_umem.h |  2 ++
 2 files changed, 28 insertions(+)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index e0f883292374..77bec75963e7 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -292,3 +292,29 @@ int ib_umem_page_count(struct ib_umem *umem)
return n;
 }
 EXPORT_SYMBOL(ib_umem_page_count);
+
+/*
+ * Copy from the given ib_umem's pages to the given buffer.
+ *
+ * umem - the umem to copy from
+ * offset - offset to start copying from
+ * dst - destination buffer
+ * length - buffer length
+ *
+ * Returns the number of copied bytes, or an error code.
+ */
+int ib_umem_copy_from(struct ib_umem *umem, size_t offset, void *dst,
+ size_t length)
+{
+   size_t end = offset + length;
+
+   if (offset > umem->length || end > umem->length || end < offset) {
+   pr_err("ib_umem_copy_from not in range. offset: %zd umem 
length: %zd end: %zd\n",
+  offset, umem->length, end);
+   return -EINVAL;
+   }
+
+   return sg_pcopy_to_buffer(umem->sg_head.sgl, umem->nmap, dst, length,
+   offset + ib_umem_offset(umem));
+}
+EXPORT_SYMBOL(ib_umem_copy_from);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 7ed6d4ff58dc..ee897724cbf8 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -84,6 +84,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
size_t size, int access, int dmasync);
 void ib_umem_release(struct ib_umem *umem);
 int ib_umem_page_count(struct ib_umem *umem);
+int ib_umem_copy_from(struct ib_umem *umem, size_t start, void *dst,
+ size_t length);
 
 #else /* CONFIG_INFINIBAND_USER_MEM */
 
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 03/17] IB/core: Replace ib_umem's offset field with a full address

2014-11-11 Thread Haggai Eran
In order to allow umems that do not pin memory we need the umem to keep track
of its region's address.

This makes the offset field redundant, and so this patch removes it.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/umem.c   |  6 +++---
 drivers/infiniband/hw/amso1100/c2_provider.c |  2 +-
 drivers/infiniband/hw/ehca/ehca_mrmw.c   |  2 +-
 drivers/infiniband/hw/ipath/ipath_mr.c   |  2 +-
 drivers/infiniband/hw/nes/nes_verbs.c|  4 ++--
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |  2 +-
 drivers/infiniband/hw/qib/qib_mr.c   |  2 +-
 include/rdma/ib_umem.h   | 25 -
 8 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index df0c4f605a21..e0f883292374 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -103,7 +103,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
 
umem->context   = context;
umem->length= size;
-   umem->offset= addr & ~PAGE_MASK;
+   umem->address   = addr;
umem->page_size = PAGE_SIZE;
umem->pid   = get_task_pid(current, PIDTYPE_PID);
/*
@@ -132,7 +132,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
if (!vma_list)
umem->hugetlb = 0;
 
-   npages = PAGE_ALIGN(size + umem->offset) >> PAGE_SHIFT;
+   npages = ib_umem_num_pages(umem);
 
down_write(¤t->mm->mmap_sem);
 
@@ -246,7 +246,7 @@ void ib_umem_release(struct ib_umem *umem)
if (!mm)
goto out;
 
-   diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT;
+   diff = ib_umem_num_pages(umem);
 
/*
 * We may be called with the mm's mmap_sem already held.  This
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c 
b/drivers/infiniband/hw/amso1100/c2_provider.c
index 2d5cbf4363e4..bdf3507810cb 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -476,7 +476,7 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
 c2mr->umem->page_size,
 i,
 length,
-c2mr->umem->offset,
+ib_umem_offset(c2mr->umem),
 &kva,
 c2_convert_access(acc),
 c2mr);
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c 
b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 3488e8c9fcb4..f914b30999f8 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -399,7 +399,7 @@ reg_user_mr_fallback:
pginfo.num_kpages = num_kpages;
pginfo.num_hwpages = num_hwpages;
pginfo.u.usr.region = e_mr->umem;
-   pginfo.next_hwpage = e_mr->umem->offset / hwpage_size;
+   pginfo.next_hwpage = ib_umem_offset(e_mr->umem) / hwpage_size;
pginfo.u.usr.next_sg = pginfo.u.usr.region->sg_head.sgl;
ret = ehca_reg_mr(shca, e_mr, (u64 *)virt, length, mr_access_flags,
  e_pd, &pginfo, &e_mr->ib.ib_mr.lkey,
diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c 
b/drivers/infiniband/hw/ipath/ipath_mr.c
index 5e61e9bff697..c7278f6a8217 100644
--- a/drivers/infiniband/hw/ipath/ipath_mr.c
+++ b/drivers/infiniband/hw/ipath/ipath_mr.c
@@ -214,7 +214,7 @@ struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
mr->mr.user_base = start;
mr->mr.iova = virt_addr;
mr->mr.length = length;
-   mr->mr.offset = umem->offset;
+   mr->mr.offset = ib_umem_offset(umem);
mr->mr.access_flags = mr_access_flags;
mr->mr.max_segs = n;
mr->umem = umem;
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c 
b/drivers/infiniband/hw/nes/nes_verbs.c
index fef067c959fc..5192fb61e0be 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -2343,7 +2343,7 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, 
u64 start, u64 length,
(unsigned long int)start, (unsigned long int)virt, 
(u32)length,
region->offset, region->page_size);
 
-   skip_pages = ((u32)region->offset) >> 12;
+   skip_pages = ((u32)ib_umem_offset(region)) >> 12;
 
if (ib_copy_from_udata(&req, udata, sizeof(req))) {
ib_umem_release(region);
@@ -2408,7 +2408,7 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, 
u64 start, u64 length,
region_length -= skip_pages << 12;
for (page_index = skip_pages; page_index < 
chunk_pages; page_index++) {
skip_pa

[PATCH v2 02/17] IB/mlx5: Enhance UMR support to allow partial page table update

2014-11-11 Thread Haggai Eran
The current UMR interface doesn't allow partial updates to a memory region's
page tables. This patch changes the interface to allow that.

It also changes the way the UMR operation validates the memory region's state.
When set, IB_SEND_UMR_FAIL_IF_FREE will cause the UMR operation to fail if the
MKEY is in the free state. When it is unchecked the operation will check that
it isn't in the free state.

Signed-off-by: Haggai Eran 
Signed-off-by: Shachar Raindel 
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 15 ++
 drivers/infiniband/hw/mlx5/mr.c  | 23 +
 drivers/infiniband/hw/mlx5/qp.c  | 96 +++-
 include/linux/mlx5/device.h  |  9 
 4 files changed, 100 insertions(+), 43 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 29da55222070..53d19e6e69a4 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -111,6 +111,8 @@ struct mlx5_ib_pd {
  */
 
 #define MLX5_IB_SEND_UMR_UNREG IB_SEND_RESERVED_START
+#define MLX5_IB_SEND_UMR_FAIL_IF_FREE (IB_SEND_RESERVED_START << 1)
+#define MLX5_IB_SEND_UMR_UPDATE_MTT (IB_SEND_RESERVED_START << 2)
 #define MLX5_IB_QPT_REG_UMRIB_QPT_RESERVED1
 #define MLX5_IB_WR_UMR IB_WR_RESERVED1
 
@@ -206,6 +208,19 @@ enum mlx5_ib_qp_flags {
MLX5_IB_QP_SIGNATURE_HANDLING   = 1 << 1,
 };
 
+struct mlx5_umr_wr {
+   union {
+   u64 virt_addr;
+   u64 offset;
+   } target;
+   struct ib_pd   *pd;
+   unsigned intpage_shift;
+   unsigned intnpages;
+   u32 length;
+   int access_flags;
+   u32 mkey;
+};
+
 struct mlx5_shared_mr_info {
int mr_id;
struct ib_umem  *umem;
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 610500810f75..aee3527030ac 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "mlx5_ib.h"
 
 enum {
@@ -675,6 +676,7 @@ static void prep_umr_reg_wqe(struct ib_pd *pd, struct 
ib_send_wr *wr,
 {
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct ib_mr *mr = dev->umrc.mr;
+   struct mlx5_umr_wr *umrwr = (struct mlx5_umr_wr *)&wr->wr.fast_reg;
 
sg->addr = dma;
sg->length = ALIGN(sizeof(u64) * n, 64);
@@ -689,21 +691,24 @@ static void prep_umr_reg_wqe(struct ib_pd *pd, struct 
ib_send_wr *wr,
wr->num_sge = 0;
 
wr->opcode = MLX5_IB_WR_UMR;
-   wr->wr.fast_reg.page_list_len = n;
-   wr->wr.fast_reg.page_shift = page_shift;
-   wr->wr.fast_reg.rkey = key;
-   wr->wr.fast_reg.iova_start = virt_addr;
-   wr->wr.fast_reg.length = len;
-   wr->wr.fast_reg.access_flags = access_flags;
-   wr->wr.fast_reg.page_list = (struct ib_fast_reg_page_list *)pd;
+
+   umrwr->npages = n;
+   umrwr->page_shift = page_shift;
+   umrwr->mkey = key;
+   umrwr->target.virt_addr = virt_addr;
+   umrwr->length = len;
+   umrwr->access_flags = access_flags;
+   umrwr->pd = pd;
 }
 
 static void prep_umr_unreg_wqe(struct mlx5_ib_dev *dev,
   struct ib_send_wr *wr, u32 key)
 {
-   wr->send_flags = MLX5_IB_SEND_UMR_UNREG;
+   struct mlx5_umr_wr *umrwr = (struct mlx5_umr_wr *)&wr->wr.fast_reg;
+
+   wr->send_flags = MLX5_IB_SEND_UMR_UNREG | MLX5_IB_SEND_UMR_FAIL_IF_FREE;
wr->opcode = MLX5_IB_WR_UMR;
-   wr->wr.fast_reg.rkey = key;
+   umrwr->mkey = key;
 }
 
 void mlx5_umr_cq_handler(struct ib_cq *cq, void *cq_context)
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index e261a53f9a02..7f362afa1a38 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -70,15 +70,6 @@ static const u32 mlx5_ib_opcode[] = {
[MLX5_IB_WR_UMR]= MLX5_OPCODE_UMR,
 };
 
-struct umr_wr {
-   u64 virt_addr;
-   struct ib_pd   *pd;
-   unsigned intpage_shift;
-   unsigned intnpages;
-   u32 length;
-   int access_flags;
-   u32 mkey;
-};
 
 static int is_qp0(enum ib_qp_type qp_type)
 {
@@ -1838,37 +1829,70 @@ static void set_frwr_umr_segment(struct 
mlx5_wqe_umr_ctrl_seg *umr,
umr->mkey_mask = frwr_mkey_mask();
 }
 
+static __be64 get_umr_reg_mr_mask(void)
+{
+   u64 result;
+
+   result = MLX5_MKEY_MASK_LEN |
+MLX5_MKEY_MASK_PAGE_SIZE   |
+MLX5_MKEY_MASK_START_ADDR  |
+MLX5_MKEY_MASK_PD  |
+MLX5_M

[PATCH v2 06/17] IB/core: Add support for extended query device caps

2014-11-11 Thread Haggai Eran
From: Eli Cohen 

Add extensible query device capabilities verb to allow adding new features.
ib_uverbs_ex_query_device is added and copy_query_dev_fields is used to copy
capability fields to be used by both ib_uverbs_query_device and
ib_uverbs_ex_query_device.

Signed-off-by: Eli Cohen 
---
 drivers/infiniband/core/uverbs.h  |   1 +
 drivers/infiniband/core/uverbs_cmd.c  | 121 ++
 drivers/infiniband/core/uverbs_main.c |   3 +-
 include/rdma/ib_verbs.h   |   5 +-
 include/uapi/rdma/ib_user_verbs.h |  12 +++-
 5 files changed, 98 insertions(+), 44 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index 643c08a025a5..b716b0815644 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -258,5 +258,6 @@ IB_UVERBS_DECLARE_CMD(close_xrcd);
 
 IB_UVERBS_DECLARE_EX_CMD(create_flow);
 IB_UVERBS_DECLARE_EX_CMD(destroy_flow);
+IB_UVERBS_DECLARE_EX_CMD(query_device);
 
 #endif /* UVERBS_H */
diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index 5ba2a86aab6a..74ad0d0de92b 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -378,6 +378,52 @@ err:
return ret;
 }
 
+static void copy_query_dev_fields(struct ib_uverbs_file *file,
+ struct ib_uverbs_query_device_resp *resp,
+ struct ib_device_attr *attr)
+{
+   resp->fw_ver= attr->fw_ver;
+   resp->node_guid = file->device->ib_dev->node_guid;
+   resp->sys_image_guid= attr->sys_image_guid;
+   resp->max_mr_size   = attr->max_mr_size;
+   resp->page_size_cap = attr->page_size_cap;
+   resp->vendor_id = attr->vendor_id;
+   resp->vendor_part_id= attr->vendor_part_id;
+   resp->hw_ver= attr->hw_ver;
+   resp->max_qp= attr->max_qp;
+   resp->max_qp_wr = attr->max_qp_wr;
+   resp->device_cap_flags  = attr->device_cap_flags;
+   resp->max_sge   = attr->max_sge;
+   resp->max_sge_rd= attr->max_sge_rd;
+   resp->max_cq= attr->max_cq;
+   resp->max_cqe   = attr->max_cqe;
+   resp->max_mr= attr->max_mr;
+   resp->max_pd= attr->max_pd;
+   resp->max_qp_rd_atom= attr->max_qp_rd_atom;
+   resp->max_ee_rd_atom= attr->max_ee_rd_atom;
+   resp->max_res_rd_atom   = attr->max_res_rd_atom;
+   resp->max_qp_init_rd_atom   = attr->max_qp_init_rd_atom;
+   resp->max_ee_init_rd_atom   = attr->max_ee_init_rd_atom;
+   resp->atomic_cap= attr->atomic_cap;
+   resp->max_ee= attr->max_ee;
+   resp->max_rdd   = attr->max_rdd;
+   resp->max_mw= attr->max_mw;
+   resp->max_raw_ipv6_qp   = attr->max_raw_ipv6_qp;
+   resp->max_raw_ethy_qp   = attr->max_raw_ethy_qp;
+   resp->max_mcast_grp = attr->max_mcast_grp;
+   resp->max_mcast_qp_attach   = attr->max_mcast_qp_attach;
+   resp->max_total_mcast_qp_attach = attr->max_total_mcast_qp_attach;
+   resp->max_ah= attr->max_ah;
+   resp->max_fmr   = attr->max_fmr;
+   resp->max_map_per_fmr   = attr->max_map_per_fmr;
+   resp->max_srq   = attr->max_srq;
+   resp->max_srq_wr= attr->max_srq_wr;
+   resp->max_srq_sge   = attr->max_srq_sge;
+   resp->max_pkeys = attr->max_pkeys;
+   resp->local_ca_ack_delay= attr->local_ca_ack_delay;
+   resp->phys_port_cnt = file->device->ib_dev->phys_port_cnt;
+}
+
 ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file,
   const char __user *buf,
   int in_len, int out_len)
@@ -398,47 +444,7 @@ ssize_t ib_uverbs_query_device(struct ib_uverbs_file *file,
return ret;
 
memset(&resp, 0, sizeof resp);
-
-   resp.fw_ver= attr.fw_ver;
-   resp.node_guid = file->device->ib_dev->node_guid;
-   resp.sys_image_guid= attr.sys_image_guid;
-   resp.max_mr_size   = attr.max_mr_size;
-   resp.page_size_cap = attr.page_size_cap;
-   resp.vendor_id = attr.vendor_id;
-   resp.vendor_part_id= attr.vendor_part_id;
-   resp.hw_ver= attr.hw_ver;
-   resp.max_qp= attr.max_qp;
-   resp.max_qp_wr = attr.max_qp_wr;
-   resp.device_cap_flags  = attr.device_cap_flags;
-   resp.max_sge   = attr.max_sge;
-   resp.max_sge_rd= attr.max_sge_rd;
-   resp.max_cq= attr.max_cq;
-   resp.max_cqe  

Re: [PATCH] infiniband-diags: add rdma-ndd daemon

2014-11-11 Thread Hal Rosenstock
On 11/10/2014 1:32 PM, Weiny, Ira wrote:
> I think changing the default is a worthwhile change.  In addition, alternate 
> admin policies are aided by the 
> general use of the %h specifier.
> 
>   1) SM's which periodically scan the Node Description always get up to 
> date hostname info.
>   2) Up to date hostname info is provided even if a user space daemons 
> fails.
>   Note: OpenSM has an update node description feature for this 
> condition.
>   3) Low level diag tools always get up to date hostname info

There is a node description local changes trap which should cause an SM
to reread the updated NodeDescription. This is assuming the "SMA" is
compliant and issues such trap when ND changes and this occurs after LinkUp.

ALso, there is a new optional (SMA) feature (@ 1.3) to set the
NodeDescription. Not sure if this helps for this.

> Do you believe that " " is an unreasonable default?
> 
> Roland, Hal, do you have any input?

I was concerned with having the format be self identifying so that it
can be easily distinguishable from other known formats being used in the
field.

As you wrote, given that RedHat is already using this format, we are
already "living" with this.

-- Hal
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/10] xprtrdma: Display async errors

2014-11-11 Thread Sagi Grimberg

On 11/9/2014 3:15 AM, Chuck Lever wrote:

An async error upcall is a hard error, and should be reported in
the system log.



Could be useful to others... Any chance you put this in ib_core for all
of us?

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FMR Support in multi-function environment

2014-11-11 Thread Bob Biloxi
> In SRIOV, FMR is supported only for the PF, not for VFs (since this
> feature requires writing directly to mapped ICM memory).

Hi,

Thank you so much for pointing to the exact code!!

I have a related question. I was trying to figure out the use case for
FMR in this environment(where in only PF supports FMR).


As per my understanding if an application wants to register huge
amounts of memory and wants to avoid the overhead of SW2HW_MPT HCR
command, it can do so using the alloc fmr verb.
Now in the SRIOV case, the application sits on top of the VF driver
and I was curious as to how the VF communicates with the PF driver to
register memory/map using FMR.

More specifically, I was trying to understand how the following
function gets called by the VF driver:


int mlx4_map_phys_fmr(struct mlx4_dev *dev, struct mlx4_fmr *fmr, u64
*page_list,
 int npages, u64 iova, u32 *lkey, u32 *rkey)
{
u32 key;
int i, err;

err = mlx4_check_fmr(fmr, page_list, npages, iova);
if (err)
return err;

++fmr->maps;

key = key_to_hw_index(fmr->mr.key);
key += dev->caps.num_mpts;
*lkey = *rkey = fmr->mr.key = hw_index_to_key(key);

*(u8 *) fmr->mpt = MLX4_MPT_STATUS_SW;

/* Make sure MPT status is visible before writing MTT entries */
wmb();

dma_sync_single_for_cpu(&dev->pdev->dev, fmr->dma_handle,
npages * sizeof(u64), DMA_TO_DEVICE);

for (i = 0; i < npages; ++i)
fmr->mtts[i] = cpu_to_be64(page_list[i] | MLX4_MTT_FLAG_PRESENT);

dma_sync_single_for_device(&dev->pdev->dev, fmr->dma_handle,
  npages * sizeof(u64), DMA_TO_DEVICE);

fmr->mpt->key= cpu_to_be32(key);
fmr->mpt->lkey   = cpu_to_be32(key);
fmr->mpt->length = cpu_to_be64(npages * (1ull << fmr->page_shift));
fmr->mpt->start  = cpu_to_be64(iova);

/* Make MTT entries are visible before setting MPT status */
wmb();

*(u8 *) fmr->mpt = MLX4_MPT_STATUS_HW;

/* Make sure MPT status is visible before consumer can use FMR */
wmb();

return 0;
}

Because the way i understood, VF can communicate with PF driver by
posting VHCR commands which cause an event to be generated on the PF
side. I can see _WRAPPER calls to handle those cases.

As there doesn't seem to be FMR related VHCR
command(virtual/para-virtual command), I was struggling to understand
how the flow happens for FMR from
application->kernel->VF-driver->PF-driver

I would be much grateful, if you can help me understand this.


Thanks so much!! Your replies really helped me improve my understanding.



Best Regards,
Bob





On Tue, Nov 11, 2014 at 4:24 PM, Jack Morgenstein
 wrote:
> On Mon, 10 Nov 2014 19:58:46 +0530
> Bob Biloxi  wrote:
>
>> Hi,
>>
>> Is FMR (Fast Memory Regions) supported in a multi-function mode?
>
> In SRIOV, FMR is supported only for the PF, not for VFs (since this
> feature requires writing directly to mapped ICM memory).
>
> You can see this in file drivers/infiniband/hw/mlx4/main.c, function
> mlx4_ib_add() :
>
>
> if (!mlx4_is_slave(ibdev->dev)) {
> ibdev->ib_dev.alloc_fmr = mlx4_ib_fmr_alloc;
> ibdev->ib_dev.map_phys_fmr  = mlx4_ib_map_phys_fmr;
> ibdev->ib_dev.unmap_fmr = mlx4_ib_unmap_fmr;
> ibdev->ib_dev.dealloc_fmr   = mlx4_ib_fmr_dealloc;
> }
>
> i.e., the fmr functions are not put into the device virtual function
> table for slave (= VF) devices.
>
> -Jack
>
>>
>> If yes, I couldn't find the source code for the same in the mlx4
>> codebase. Can anyone please point me to the right location...
>>
>> What I was trying to understand is this:
>>
>> Suppose a VF driver wants to register large amount of memory using
>> FMR, will it be able to do so using the mlx4 code.
>>
>> Or FMR is supported only in dedicated mode?
>>
>>
>> Thanks
>>
>> Best Regards,
>> Bob
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
>> in the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FMR Support in multi-function environment

2014-11-11 Thread Jack Morgenstein
On Mon, 10 Nov 2014 19:58:46 +0530
Bob Biloxi  wrote:

> Hi,
> 
> Is FMR (Fast Memory Regions) supported in a multi-function mode?

In SRIOV, FMR is supported only for the PF, not for VFs (since this
feature requires writing directly to mapped ICM memory).

You can see this in file drivers/infiniband/hw/mlx4/main.c, function
mlx4_ib_add() :


if (!mlx4_is_slave(ibdev->dev)) {
ibdev->ib_dev.alloc_fmr = mlx4_ib_fmr_alloc;
ibdev->ib_dev.map_phys_fmr  = mlx4_ib_map_phys_fmr;
ibdev->ib_dev.unmap_fmr = mlx4_ib_unmap_fmr;
ibdev->ib_dev.dealloc_fmr   = mlx4_ib_fmr_dealloc;
}

i.e., the fmr functions are not put into the device virtual function
table for slave (= VF) devices.

-Jack

> 
> If yes, I couldn't find the source code for the same in the mlx4
> codebase. Can anyone please point me to the right location...
> 
> What I was trying to understand is this:
> 
> Suppose a VF driver wants to register large amount of memory using
> FMR, will it be able to do so using the mlx4 code.
> 
> Or FMR is supported only in dedicated mode?
> 
> 
> Thanks
> 
> Best Regards,
> Bob
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> in the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ibv_reg_mr fails to register memory

2014-11-11 Thread Mohammed Rafi K C
Hi All,

I was trying to implement an application that support rdma using
librdmacm. When the process is trying to register the buffer using
ibv_reg_mr, it failes after registering  some number buffer.

We are using Mellanox Technologies MT27500 Family [ConnectX-3] cards. 

Reconfigured options are :

1) cat /sys/module/mlx4_core/parameters/log_num_mtt
24
2) cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
7

Can anyone point out reason behind ibv_reg_mr fails ?

Thanks & Regards
Rafi KC

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 for-next 0/9] Peer-Direct support

2014-11-11 Thread Or Gerlitz
On Tue, Oct 28, 2014 at 8:29 PM, Coffman, Jerrie L
 wrote:

> [...] to obtain the source for review.  Steps are as follows:
> Download and extract the latest OFED-3.12-1 release (rc3 in this case) from 
> OpenFabrics.org
> cd OFED-3.12-1-rc3/SRPMS/
> rpm2cpio compat-rdma-3.12-1.1.g9594cac.src.rpm | cpio -vid
> tar xzf compat-rdma-3.12.tgz
> cd compat-rdma-3.12
> ofed_scripts/ofed_patch.sh --with-patchdir=tech-preview/xeon-phi/
> The code of interest is located in the IB proxy server located in 
> drivers/infiniband/ibp/drv directory [...]

I bet busy upstream kernel maintainers/reviewers would not go that
far... can you make it easily reviewable? that is, either send as
patches to this list or much easier put the relevant piece in public
git tree (github?)

> Possible upstreaming plans for CCL direct are still being discussed.  Since 
> the CCL direct code depends on a low level MIC driver called SCIF, we would 
> have to wait until the group at Intel that owns that driver gets it upstream 
> before attempting to upstream CCL direct [..]

I do see these three config directives set in my upstream clone, and
checking reveals they were
added in 3.13

# Intel MIC Bus Driver
CONFIG_INTEL_MIC_BUS=m
# Intel MIC Host Driver
CONFIG_INTEL_MIC_HOST=m
# Intel MIC Card Driver
CONFIG_INTEL_MIC_CARD=m

Can you point to the missing elements in the chart present @ this LWN
post http://lwn.net/Articles/564795/

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html