Re: Status of ummunot branch?

2013-06-05 Thread Haggai Eran
On 04/06/2013 20:04, Jason Gunthorpe wrote:
 Thus, I assume, on-demand allows pages that are 'absent' in the larger
 page table to generate faults to the CPU?
Yes, that's correct.

 So how does lifetime work here?
 
  - Can you populate the larger page table as soon as registration
happens, relying on mmu notifier and HCA faults to keep it
consistent?
We prefer not to keep the entire page table in sync, since we want to
allow registration of larger portions of the virtual address space, and
much of that memory isn't needed by the HCA.

  - After a fault happens are the faulted pages pinned?
After a page fault happens the faulted pages are mapped in using
get_user_pages, but they are immediately released.

 How does lifetime work here? What happens when the kernel wants to
 evict a page that has currently ongoing RDMA?
If the kernel tries to evict a page that is currently ongoing RDMA, the
driver will update the HCA before the kernel can free the page. If the
RDMA operation is still ongoing, it will trigger a page fault.

 What happens if user space munmaps something while the remote is
 doing RDMA to it?
We want to allow the user to register memory areas that are unmapped. We
only require that the user have some VMA backing the addresses used for
RDMA operations, during the course of these operations. If the user
munmaps something in the middle of an RDMA operation, this will trigger
a page fault, which will in turn close the QP doing the operation with
an error.

  - If I recall the presentation, the fault-in operation was very slow,
what is the cause for this?
Page faults involve stopping the QP, reading the WQE to get the page
ranges needed, bringing the pages to memory using get_user_pages,
updating the HCA's page table (and flushing its caches) and resuming the
QP. With short messages, the commands sent to the device are dominant,
while with larger messages, get_user_pages becomes dominant.

 
 He was very concerned about what the size of the TLB on the HCA,
 and therefore what the actual run-time behavior would be for
 sending around large messages via MPI -- i.e., would RDMA'ing 1GB
 messages now incur this
 HCA-must-reload-its-TLB-and-therefore-incur-RNR-NAKs behavior?

 We have a mechanism to prefetch the pages needed for a large message
 upon the first page fault, which can also help amortizing the cost of
 the page fault for larger messages.
 
 My reaction was that a pre-fault WR is needed to make this performant.
 
 But, I also don't fully understand why we need so many faults from the
 HCA in the first place. If you've properly solved the lifetime issues
 then the initial registration can meaningfully pre-initialize the page
 table in many cases, and computing the physical address of a page
 should not be so expensive.

We have implemented a prefetching verb, but I think that in many cases,
with smart enough prefetching logic in the page fault handler, it won't
be needed.

Haggai
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of ummunot branch?

2013-06-05 Thread Haggai Eran
On 04/06/2013 23:13, Jeff Squyres (jsquyres) wrote:
 On Jun 4, 2013, at 4:50 AM, Haggai Eran hagg...@mellanox.com wrote:
 
 Does this mean that an MPI implementation still has to register memory upon 
 usage, and maintain its own registered memory cache?
 Yes. However, since registration doesn't pin memory, you can leave
 registered memory regions in the cache for longer periods, and you can
 register larger memory regions without needing to back them with
 physical memory.
 
 Hmm; I'm confused.  How does this fix the MPI-needs-to-intercept-freed-memory 
 problem?
Well, there is no problem if an application frees registered memory (in
an on-demand paging memory region) and that memory is returned to the
OS. The OS will invalidate these pages, and the HCA will no longer be
able to use them. This means that the registration cache doesn't have to
de-register memory immediately when it is freed.

Haggai
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] RDMA/ocrdma: removed use_cnt for queues.

2013-06-05 Thread Gottumukkala, Naresh
Hi Roland,

Can we get this patch approved from you ? Can you please let us know your 
feedback ?

Thanks,
Naresh.

-Original Message-
From: Naresh 
Sent: Tuesday, May 28, 2013 3:43 PM
To: linux-rdma@vger.kernel.org
Cc: lnx-roce; Naresh
Subject: [PATCH] RDMA/ocrdma: removed use_cnt for queues.

From: Naresh Gottumukkala bgottumukk...@emulex.com

Removed use_cnt. Rely on OFED stack to keep track of the use count.

Signed-off-by: Naresh Gottumukkala bgottumukk...@emulex.com
---
 drivers/infiniband/hw/ocrdma/ocrdma.h   |  4 ---
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c|  1 -
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c | 39 +
 3 files changed, 1 insertion(+), 43 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h 
b/drivers/infiniband/hw/ocrdma/ocrdma.h
index 48970af..21d99f6 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma.h
@@ -97,7 +97,6 @@ struct ocrdma_queue_info {
u16 id; /* qid, where to ring the doorbell. */
u16 head, tail;
bool created;
-   atomic_t used;  /* Number of valid elements in the queue */
 };
 
 struct ocrdma_eq {
@@ -198,7 +197,6 @@ struct ocrdma_cq {
struct ocrdma_ucontext *ucontext;
dma_addr_t pa;
u32 len;
-   atomic_t use_cnt;
 
/* head of all qp's sq and rq for which cqes need to be flushed
 * by the software.
@@ -210,7 +208,6 @@ struct ocrdma_pd {
struct ib_pd ibpd;
struct ocrdma_dev *dev;
struct ocrdma_ucontext *uctx;
-   atomic_t use_cnt;
u32 id;
int num_dpp_qp;
u32 dpp_page;
@@ -246,7 +243,6 @@ struct ocrdma_srq {
 
struct ocrdma_qp_hwq_info rq;
struct ocrdma_pd *pd;
-   atomic_t use_cnt;
u32 id;
u64 *rqe_wr_id_tbl;
u32 *idx_bit_fields;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
index 71942af..910b706 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
@@ -128,7 +128,6 @@ static inline struct ocrdma_mqe *ocrdma_get_mqe(struct 
ocrdma_dev *dev)  static inline void ocrdma_mq_inc_head(struct ocrdma_dev *dev) 
 {
dev-mq.sq.head = (dev-mq.sq.head + 1)  (OCRDMA_MQ_LEN - 1);
-   atomic_inc(dev-mq.sq.used);
 }
 
 static inline void *ocrdma_get_mqe_rsp(struct ocrdma_dev *dev) diff --git 
a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index b29a424..38c145b 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -398,7 +398,6 @@ struct ib_pd *ocrdma_alloc_pd(struct ib_device *ibdev,
kfree(pd);
return ERR_PTR(status);
}
-   atomic_set(pd-use_cnt, 0);
 
if (udata  context) {
status = ocrdma_copy_pd_uresp(pd, context, udata); @@ -419,12 
+418,6 @@ int ocrdma_dealloc_pd(struct ib_pd *ibpd)
int status;
u64 usr_db;
 
-   if (atomic_read(pd-use_cnt)) {
-   ocrdma_err(%s(%d) pd=0x%x is in use.\n,
-  __func__, dev-id, pd-id);
-   status = -EFAULT;
-   goto dealloc_err;
-   }
status = ocrdma_mbx_dealloc_pd(dev, pd);
if (pd-uctx) {
u64 dpp_db = dev-nic_info.dpp_unmapped_addr + @@ -436,7 +429,6 
@@ int ocrdma_dealloc_pd(struct ib_pd *ibpd)
ocrdma_del_mmap(pd-uctx, usr_db, dev-nic_info.db_page_size);
}
kfree(pd);
-dealloc_err:
return status;
 }
 
@@ -474,7 +466,6 @@ static struct ocrdma_mr *ocrdma_alloc_lkey(struct ib_pd 
*ibpd,
return ERR_PTR(-ENOMEM);
}
mr-pd = pd;
-   atomic_inc(pd-use_cnt);
mr-ibmr.lkey = mr-hwmr.lkey;
if (mr-hwmr.remote_wr || mr-hwmr.remote_rd)
mr-ibmr.rkey = mr-hwmr.lkey;
@@ -664,7 +655,6 @@ struct ib_mr *ocrdma_reg_user_mr(struct ib_pd *ibpd, u64 
start, u64 len,
if (status)
goto mbx_err;
mr-pd = pd;
-   atomic_inc(pd-use_cnt);
mr-ibmr.lkey = mr-hwmr.lkey;
if (mr-hwmr.remote_wr || mr-hwmr.remote_rd)
mr-ibmr.rkey = mr-hwmr.lkey;
@@ -689,7 +679,6 @@ int ocrdma_dereg_mr(struct ib_mr *ib_mr)
if (mr-hwmr.fr_mr == 0)
ocrdma_free_mr_pbl_tbl(dev, mr-hwmr);
 
-   atomic_dec(mr-pd-use_cnt);
/* it could be user registered memory. */
if (mr-umem)
ib_umem_release(mr-umem);
@@ -752,7 +741,6 @@ struct ib_cq *ocrdma_create_cq(struct ib_device *ibdev, int 
entries, int vector,
 
spin_lock_init(cq-cq_lock);
spin_lock_init(cq-comp_handler_lock);
-   atomic_set(cq-use_cnt, 0);
INIT_LIST_HEAD(cq-sq_head);
INIT_LIST_HEAD(cq-rq_head);
cq-dev = dev;
@@ -799,9 +787,6 @@ int ocrdma_destroy_cq(struct ib_cq *ibcq)
struct ocrdma_cq *cq = 

[PATCH] osm_sm_state_mgr.c Don't clear IS_SM bit when changing state to NOT_ACTIVE

2013-06-05 Thread Line Holen
The SM is still operational even though it is in this state. Other
SMs will not know about our presence when IS_SM is cleared and will
therefor not attempt to enable us again.

Signed-off-by: Line Holen line.ho...@oracle.com

---

diff --git a/opensm/osm_sm_state_mgr.c b/opensm/osm_sm_state_mgr.c
index c996ea2..11defdd 100644
--- a/opensm/osm_sm_state_mgr.c
+++ b/opensm/osm_sm_state_mgr.c
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2013 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -330,7 +331,6 @@ ib_api_status_t osm_sm_state_mgr_process(osm_sm_t * sm,
 */
sm-p_subn-sm_state = IB_SMINFO_STATE_NOTACTIVE;
osm_report_sm_state(sm);
-   osm_vendor_set_sm(sm-mad_ctrl.h_bind, FALSE);
break;
case OSM_SM_SIGNAL_HANDOVER:
/*
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Log changes related to event subscription and forwarding

2013-06-05 Thread Line Holen
Signed-off-by: Line Holen line.ho...@oracle.com

---

diff --git a/opensm/osm_inform.c b/opensm/osm_inform.c
index 19bbe72..ef51953 100644
--- a/opensm/osm_inform.c
+++ b/opensm/osm_inform.c
@@ -305,10 +305,12 @@ static ib_api_status_t send_report(IN osm_infr_t * 
p_infr_rec,/* the informinfo
/* HACK: who switches or uses the src and dest GIDs in the grh_info ?? 
*/
 
/* it is better to use LIDs since the GIDs might not be there for SMI 
traps */
-   OSM_LOG(p_log, OSM_LOG_DEBUG, Forwarding Notice Event from LID:%u
-to InformInfo LID:%u TID:0x%X\n,
+   OSM_LOG(p_log, OSM_LOG_VERBOSE, Forwarding Notice Event from LID %u
+to InformInfo LID %u GUID 0x% PRIx64 , TID 0x%X\n,
cl_ntoh16(p_ntc-issuer_lid),
-   cl_ntoh16(p_infr_rec-report_addr.dest_lid), trap_fwd_trans_id);
+   cl_ntoh16(p_infr_rec-report_addr.dest_lid),
+   
cl_ntoh64(p_infr_rec-inform_record.subscriber_gid.unicast.interface_id),
+   trap_fwd_trans_id);
 
/* get the MAD to send */
p_report_madw = osm_mad_pool_get(p_infr_rec-sa-p_mad_pool,
diff --git a/opensm/osm_sa_informinfo.c b/opensm/osm_sa_informinfo.c
index 0b3e1f8..f32b88b 100644
--- a/opensm/osm_sa_informinfo.c
+++ b/opensm/osm_sa_informinfo.c
@@ -3,6 +3,7 @@
  * Copyright (c) 2002-2006 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2009 HNR Consulting. All rights reserved.
+ * Copyright (c) 2013 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -544,6 +545,10 @@ static void infr_rcv_process_set_method(osm_sa_t * sa, IN 
osm_madw_t * p_madw)
goto Exit;
}
 
+   OSM_LOG(sa-p_log, OSM_LOG_VERBOSE,
+   Adding event subscription for port 0x% PRIx64 
\n,
+   
cl_ntoh64(inform_info_rec.inform_record.subscriber_gid.unicast.interface_id));
+
/* Add this new osm_infr_t object to subnet object */
osm_infr_insert_to_db(sa-p_subn, sa-p_log, p_infr);
} else
@@ -561,9 +566,13 @@ static void infr_rcv_process_set_method(osm_sa_t * sa, IN 
osm_madw_t * p_madw)
p_recvd_inform_info-subscribe = 0;
osm_sa_send_error(sa, p_madw, IB_SA_MAD_STATUS_REQ_INVALID);
goto Exit;
-   } else
+   } else {
/* Delete this object from the subnet list of informs */
+   OSM_LOG(sa-p_log, OSM_LOG_VERBOSE,
+   Removing event subscription for port 0x% PRIx64 \n,
+   
cl_ntoh64(inform_info_rec.inform_record.subscriber_gid.unicast.interface_id));
osm_infr_remove_from_db(sa-p_subn, sa-p_log, p_infr);
+   }
 
cl_plock_release(sa-p_lock);
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of ummunot branch?

2013-06-05 Thread Jeff Squyres (jsquyres)
On Jun 5, 2013, at 12:14 AM, Haggai Eran hagg...@mellanox.com wrote:

 Hmm; I'm confused.  How does this fix the 
 MPI-needs-to-intercept-freed-memory problem?
 Well, there is no problem if an application frees registered memory (in
 an on-demand paging memory region) and that memory is returned to the
 OS. The OS will invalidate these pages, and the HCA will no longer be
 able to use them. This means that the registration cache doesn't have to
 de-register memory immediately when it is freed.


(must... resist... urge... to... throw... furniture...)

This is why features should not be introduced to solve MPI problems without an 
understanding of what the MPI problems are.  :-)  Please go talk to the 
Mellanox MPI team.

Forgive me for being frustrated; memory registration and all the pain that it 
entails was highlighted as ***the #1 problem*** by *5 major MPI 
implementations* at the Sonoma 2009 workshop (see 
https://www.openfabrics.org/resources/document-downloads/presentations/doc_download/301-mpi-update-and-requirements-panel-all-presentations.html,
 starting at slide 7 in the openmpi slide deck).  

Why don't we have something like ummunotify yet?
Why don't we have non-blocking memory registration yet?
...etc.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libibverbs: A possible solution for allowing arbitrary MTU values.

2013-06-05 Thread Jeff Squyres
Set the IBV_MTU_* enums equal to their values (e.g., IBV_MTU_1024 =
1024), and then pass MTU values around as int's.  Legacy applications
will use the enum values, but newer applications can use any int for
values that do not currently exist in the enum set (e.g., 1500, 9000).

The obvious drawback is that this will break ABI; applications will
need to be recompiled.

(if this approach/patch is acceptable, I will submit a corresponding
patch for the kernel side)

Signed-off-by: Jeff Squyres jsquy...@cisco.com
---
 examples/devinfo.c | 18 +-
 examples/pingpong.c| 12 
 examples/pingpong.h|  1 -
 examples/rc_pingpong.c |  8 
 examples/srq_pingpong.c|  8 
 examples/uc_pingpong.c |  8 
 include/infiniband/verbs.h | 16 
 man/ibv_modify_qp.3|  2 +-
 man/ibv_query_port.3   |  4 ++--
 man/ibv_query_qp.3 |  2 +-
 10 files changed, 33 insertions(+), 46 deletions(-)

diff --git a/examples/devinfo.c b/examples/devinfo.c
index ff078e4..f46deca 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -111,16 +111,16 @@ static const char *atomic_cap_str(enum ibv_atomic_cap 
atom_cap)
}
 }
 
-static const char *mtu_str(enum ibv_mtu max_mtu)
+static const char *mtu_str(int max_mtu)
 {
-   switch (max_mtu) {
-   case IBV_MTU_256:  return 256;
-   case IBV_MTU_512:  return 512;
-   case IBV_MTU_1024: return 1024;
-   case IBV_MTU_2048: return 2048;
-   case IBV_MTU_4096: return 4096;
-   default:   return invalid MTU;
-   }
+   static char str[16];
+
+   if (max_mtu  0)
+   snprintf(str, sizeof(str), %d, max_mtu);
+   else
+   strncpy(str, invalid MTU, sizeof(str));
+
+   return str;
 }
 
 static const char *width_str(uint8_t width)
diff --git a/examples/pingpong.c b/examples/pingpong.c
index 90732ef..d1c22c9 100644
--- a/examples/pingpong.c
+++ b/examples/pingpong.c
@@ -36,18 +36,6 @@
 #include stdio.h
 #include string.h
 
-enum ibv_mtu pp_mtu_to_enum(int mtu)
-{
-   switch (mtu) {
-   case 256:  return IBV_MTU_256;
-   case 512:  return IBV_MTU_512;
-   case 1024: return IBV_MTU_1024;
-   case 2048: return IBV_MTU_2048;
-   case 4096: return IBV_MTU_4096;
-   default:   return -1;
-   }
-}
-
 uint16_t pp_get_local_lid(struct ibv_context *context, int port)
 {
struct ibv_port_attr attr;
diff --git a/examples/pingpong.h b/examples/pingpong.h
index 9cdc03e..91d217b 100644
--- a/examples/pingpong.h
+++ b/examples/pingpong.h
@@ -35,7 +35,6 @@
 
 #include infiniband/verbs.h
 
-enum ibv_mtu pp_mtu_to_enum(int mtu);
 uint16_t pp_get_local_lid(struct ibv_context *context, int port);
 int pp_get_port_info(struct ibv_context *context, int port,
 struct ibv_port_attr *attr);
diff --git a/examples/rc_pingpong.c b/examples/rc_pingpong.c
index 15494a1..8a5318b 100644
--- a/examples/rc_pingpong.c
+++ b/examples/rc_pingpong.c
@@ -78,7 +78,7 @@ struct pingpong_dest {
 };
 
 static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn,
- enum ibv_mtu mtu, int sl,
+ int mtu, int sl,
  struct pingpong_dest *dest, int sgid_idx)
 {
struct ibv_qp_attr attr = {
@@ -209,7 +209,7 @@ out:
 }
 
 static struct pingpong_dest *pp_server_exch_dest(struct pingpong_context *ctx,
-int ib_port, enum ibv_mtu mtu,
+int ib_port, int mtu,
 int port, int sl,
 const struct pingpong_dest 
*my_dest,
 int sgid_idx)
@@ -547,7 +547,7 @@ int main(int argc, char *argv[])
int  port = 18515;
int  ib_port = 1;
int  size = 4096;
-   enum ibv_mtu mtu = IBV_MTU_1024;
+   int  mtu = 1024;
int  rx_depth = 500;
int  iters = 1000;
int  use_event = 0;
@@ -608,7 +608,7 @@ int main(int argc, char *argv[])
break;
 
case 'm':
-   mtu = pp_mtu_to_enum(strtol(optarg, NULL, 0));
+   mtu = strtol(optarg, NULL, 0);
if (mtu  0) {
usage(argv[0]);
return 1;
diff --git a/examples/srq_pingpong.c b/examples/srq_pingpong.c
index 6e00f8c..f1eb879 100644
--- a/examples/srq_pingpong.c
+++ b/examples/srq_pingpong.c
@@ -81,7 +81,7 @@ struct pingpong_dest {
union ibv_gid gid;
 };
 
-static int pp_connect_ctx(struct pingpong_context *ctx, int port, enum ibv_mtu 
mtu,
+static int pp_connect_ctx(struct pingpong_context *ctx, int 

Re: Status of ummunot branch?

2013-06-05 Thread Haggai Eran
On 05/06/2013 15:45, Jeff Squyres (jsquyres) wrote:
 On Jun 5, 2013, at 12:14 AM, Haggai Eran hagg...@mellanox.com wrote:
 
 Hmm; I'm confused.  How does this fix the 
 MPI-needs-to-intercept-freed-memory problem?
 Well, there is no problem if an application frees registered memory (in
 an on-demand paging memory region) and that memory is returned to the
 OS. The OS will invalidate these pages, and the HCA will no longer be
 able to use them. This means that the registration cache doesn't have to
 de-register memory immediately when it is freed.
 
 
 (must... resist... urge... to... throw... furniture...)
(ducking and taking cover :-) )

 
 This is why features should not be introduced to solve MPI problems without 
 an understanding of what the MPI problems are.  :-)  Please go talk to the 
 Mellanox MPI team.
 
 Forgive me for being frustrated; memory registration and all the pain that it 
 entails was highlighted as ***the #1 problem*** by *5 major MPI 
 implementations* at the Sonoma 2009 workshop (see 
 https://www.openfabrics.org/resources/document-downloads/presentations/doc_download/301-mpi-update-and-requirements-panel-all-presentations.html,
  starting at slide 7 in the openmpi slide deck).  
Perhaps I'm missing something, but I believe ODP deals with the first
two problems in the list (slide 8), even if it doesn't solve them
completely.

You no longer need to do dangerous tricks to catch free, munmap, sbrk.
As I explained above, these operations can work on an ODP MR without
allowing the HCA use the invalidated mappings.

In the future we want to implement an implicit memory region covering
the entire process address space, thus eliminating the need for memory
registration almost completely (you might still want memory
registration, or memory windows, in order to control permissions of
remote operations).

We can also allow fork to work with our implementation. Copy-on-write
will work with ODP regions by invalidating the HCA's page tables before
modifying the pages to be read-only. A page fault from the HCA can then
refill the pages, or even break COW in case of a write.

 Why don't we have something like ummunotify yet?
I think that the problem we are trying to solve is better handled inside
the kernel. If you are going to change the HCA's memory mappings, you'd
have to go through the kernel anyway.

Haggai
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libmlx4 v6 1/2] libmlx4: Infra-structure changes to support verbs extensions

2013-06-05 Thread Steve Wise

On 6/4/2013 3:01 PM, Steve Wise wrote:

On 6/4/2013 2:46 PM, Hefty, Sean wrote:

+
   #ifdef HAVE_IBV_REGISTER_DRIVER
   static __attribute__((constructor)) void mlx4_register_driver(void)
   {
-ibv_register_driver(mlx4, mlx4_driver_init);
+verbs_register_driver(mlx4, mlx4_driver_init);
+
   }
   #else
Shouldn't ibv_register_driver() need to be called in the lib 
constructor

function if HAVE_IBV_REGISTER_DRIVER is not defined?
?  If HAVE_IBV_REGISTER_DRIVER is not defined, then we can't call 
ibv_register_driver...


I thought HAVE_IBV_REGISTER_DRIVER was something new for deciding if 
the lib should call verbs_register_driver().


We should just remove the HAVE_IBV_... check completely, since with 
this change, libmlx4 requires an updated version of libibverbs.


Ah.  I was thinking it would use the old interface if it was compiled 
against a libibverbs that didn't support the extensions.







So old provider libs will work with the new libibverbs but new provider 
libs will not work with the old libibverbs?   Is there no way around 
this?  That dependency can be painful.




--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libibverbs: A possible solution for allowing arbitrary MTU values.

2013-06-05 Thread Jason Gunthorpe
On Wed, Jun 05, 2013 at 06:00:23AM -0700, Jeff Squyres wrote:
 Set the IBV_MTU_* enums equal to their values (e.g., IBV_MTU_1024 =
 1024), and then pass MTU values around as int's.  Legacy applications
 will use the enum values, but newer applications can use any int for
 values that do not currently exist in the enum set (e.g., 1500, 9000).
 
 The obvious drawback is that this will break ABI; applications will
 need to be recompiled.

No, this too big of an ABI break, and silent at that..

The IBA values have to continue to be accepted and exported in all
cases so the ABI stays the same, which is what I thought was agreed
on??

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of ummunot branch?

2013-06-05 Thread Jeff Squyres (jsquyres)
On Jun 5, 2013, at 6:39 AM, Haggai Eran hagg...@mellanox.com wrote:

 Perhaps I'm missing something, but I believe ODP deals with the first
 two problems in the list (slide 8), even if it doesn't solve them
 completely.

Unfortunately, it does not.  If we could register(0 ... 2^64) and never have to 
worry about registered memory, that might be cool (depending on how that 
actually works) -- more below.

See this blog post that describes the freed registered memory issue:


http://blogs.cisco.com/performance/registered-memory-rma-rdma-and-mpi-implementations/

and consider the following valid user code:

a = malloc(x);// a gets (va=0x100, pa=0x12345) back from malloc
MPI_Send(a, ...); // MPI registers 0x100 for len=x, and saves (0x100,x) in reg 
cache
free(a);
a = malloc(x);// a gets (va=0x100, pa=0x98765) back from malloc
MPI_Send(a, ...); // MPI sees a=0x100 and things that it is already registered
// ...kaboom

In short, MPI has to intercept free/sbrk/whatever so that it can update its 
registration cache.

 In the future we want to implement an implicit memory region covering
 the entire process address space, thus eliminating the need for memory
 registration almost completely (you might still want memory
 registration, or memory windows, in order to control permissions of
 remote operations).

This would be great, as long as it's fast, transparent, and has no subtle 
implementation effects (like causing additional RNR NAKs for pages that are 
still in memory, which, according to your descriptions, it sounds like it 
won't).

 We can also allow fork to work with our implementation. Copy-on-write
 will work with ODP regions by invalidating the HCA's page tables before
 modifying the pages to be read-only. A page fault from the HCA can then
 refill the pages, or even break COW in case of a write.

That would be cool, too.  fork() has been a continuing problem -- solving that 
problem would be wonderful.

If this ODP stuff becomes a new verb, it would be good:

- if these fork-fixing / register-infinite capabilities can be queried at run 
time (maybe on ibv_device_cap_flags?) so that ULPs can know to use this 
functionality
- if driver owners can get a heads up so that they can know to implement it

 Why don't we have something like ummunotify yet?
 I think that the problem we are trying to solve is better handled inside
 the kernel. If you are going to change the HCA's memory mappings, you'd
 have to go through the kernel anyway.

If/when you allow registering all memory, then I think you're right -- the 
MPI-must-intercept-free/sbrk-whatever issue may go away (that's why I started 
this thread asking about register(0 .. 2^64)).  But without that, unless I'm 
missing something, I don't think it solves the MPI-must-catch-free-sbrk-etc. 
issues...?  And therefore, having some kind of ummunotify-like functionality as 
a verb would be a Very Good Thing.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libmlx4 v6 1/2] libmlx4: Infra-structure changes to support verbs extensions

2013-06-05 Thread Jason Gunthorpe
On Wed, Jun 05, 2013 at 09:54:33AM -0500, Steve Wise wrote:
 Ah.  I was thinking it would use the old interface if it was
 compiled against a libibverbs that didn't support the extensions.

 So old provider libs will work with the new libibverbs but new
 provider libs will not work with the old libibverbs?   Is there no
 way around this?  That dependency can be painful.

providers can use dlopen/dlsym tricks, or perhaps weak symbols to
discover the new libibverbs symbols. Nobody has had an interest in
working on that problem though.

My original thought when putting all this together was that the one
time synchronized update to the extendable interface was manageable.

.. but seeing now that the providers are linking to other new symbols
beyond the init (eg the cmd family) it seems this will be beyond just
a one time thing. :(

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libibverbs: A possible solution for allowing arbitrary MTU values.

2013-06-05 Thread Jeff Squyres (jsquyres)
On Jun 5, 2013, at 9:46 AM, Jason Gunthorpe jguntho...@obsidianresearch.com 
wrote:

 No, this too big of an ABI break, and silent at that..
 
 The IBA values have to continue to be accepted and exported in all
 cases so the ABI stays the same, which is what I thought was agreed
 on??


Can this go to a libibverbs 2.0, where it would be palatable to have an ABI 
break?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] RDMA/ocrdma: removed use_cnt for queues.

2013-06-05 Thread Roland Dreier
On Wed, Jun 5, 2013 at 1:50 AM, Gottumukkala, Naresh
b.a.l.nraju.gottumukk...@emulex.com wrote:
 Can we get this patch approved from you ? Can you please let us know your 
 feedback ?

Yes, looks fine.  I'll merge it for 3.11.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of ummunot branch?

2013-06-05 Thread Jason Gunthorpe
On Wed, Jun 05, 2013 at 04:53:48PM +, Jeff Squyres (jsquyres) wrote:
 On Jun 5, 2013, at 6:39 AM, Haggai Eran hagg...@mellanox.com wrote:
 
  Perhaps I'm missing something, but I believe ODP deals with the first
  two problems in the list (slide 8), even if it doesn't solve them
  completely.
 
 Unfortunately, it does not.  If we could register(0 ... 2^64) and
 never have to worry about registered memory, that might be cool
 (depending on how that actually works) -- more below.
 
 See this blog post that describes the freed registered memory issue:
 
 
 http://blogs.cisco.com/performance/registered-memory-rma-rdma-and-mpi-implementations/
 
 and consider the following valid user code:
 
 a = malloc(x);// a gets (va=0x100, pa=0x12345) back from malloc
 MPI_Send(a, ...); // MPI registers 0x100 for len=x, and saves (0x100,x) in 
 reg cache
 free(a);
 a = malloc(x);// a gets (va=0x100, pa=0x98765) back from malloc
 MPI_Send(a, ...); // MPI sees a=0x100 and things that it is already registered
 // ...kaboom
 
 In short, MPI has to intercept free/sbrk/whatever so that it can
 update its registration cache.

ODP is supposed to completely solve this problem. The HCA's view and
Kernels view of virtual to physical mapping becomes 100% synchronized,
and there is no 'kaboom'. The kernel updates the HCA after the free,
and after the 2nd malloc to 100% match the current virtual memory map
in the process.

MPI still has to register the memory in the first place..

.. and somehow stuff has to be managed to avoid HCA page faults in
   common cases
.. and the feature must be discoverable
.. and and and ..

The biggest issue to me is going to be efficiently prefetching receive
buffers so that RNR acks are avoided in all common cases...

 solves the MPI-must-catch-free-sbrk-etc. issues...?  And therefore,
 having some kind of ummunotify-like functionality as a verb would be
 a Very Good Thing.

AFAIK the ummunotify user space API was nak'd by the core kernel
guys. I got the impression people thought it would be acceptable as a
rdma API, not a general API. So it is waiting on someone to recast the
function within verbs to make progress...

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libibverbs: A possible solution for allowing arbitrary MTU values.

2013-06-05 Thread Jason Gunthorpe
On Wed, Jun 05, 2013 at 05:01:37PM +, Jeff Squyres (jsquyres) wrote:
 On Jun 5, 2013, at 9:46 AM, Jason Gunthorpe jguntho...@obsidianresearch.com 
 wrote:
 
  No, this too big of an ABI break, and silent at that..
  
  The IBA values have to continue to be accepted and exported in all
  cases so the ABI stays the same, which is what I thought was agreed
  on??
 
 Can this go to a libibverbs 2.0, where it would be palatable to have
 an ABI break?

The concept of a libibverbs 2.0 has been NAK's by pretty much everyone
involved. This is why we are suffering with the complex extension
mechanism.

The mixed approach that was brought up, where values like 1500 were
passed as 1500, and values like 1024 were passed as 3 seemed doable to
me. Did you see a problem with it for your use?

Thoughts:
 - 1024 and 3 both mean 1024, the library must accept both values,
   it should only ever return 3 though.
 - 1500/etc means 1500, the libray can return that.
 - Make a ibv_from/to_mtu inline function to translate from bytes to
   the encoded MTU value.
 - Switch ibv_mtu from a enum to a typedef int ibv_mtu

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libibverbs: A possible solution for allowing arbitrary MTU values.

2013-06-05 Thread Jeff Squyres (jsquyres)
On Jun 5, 2013, at 10:19 AM, Jason Gunthorpe jguntho...@obsidianresearch.com 
wrote:

 The concept of a libibverbs 2.0 has been NAK's by pretty much everyone
 involved. This is why we are suffering with the complex extension
 mechanism.

Are you saying that libibverbs must always always always be backwards 
compatible, and there will never be an ABI break at any version in the future?

 The mixed approach that was brought up, where values like 1500 were
 passed as 1500, and values like 1024 were passed as 3 seemed doable to
 me. Did you see a problem with it for your use?

It just seems overly complex in terms of implementation.

 Thoughts:
 - 1024 and 3 both mean 1024, the library must accept both values,
   it should only ever return 3 though.

Why?  If the caller can pass in 1024, it seems like 1024 should be able to be 
passed out, too.

 - 1500/etc means 1500, the libray can return that.
 - Make a ibv_from/to_mtu inline function to translate from bytes to
   the encoded MTU value.
 - Switch ibv_mtu from a enum to a typedef int ibv_mtu

That also breaks ABI, doesn't it?

 Jason


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of ummunot branch?

2013-06-05 Thread Jeff Squyres (jsquyres)
On Jun 5, 2013, at 10:14 AM, Jason Gunthorpe jguntho...@obsidianresearch.com 
wrote:

 a = malloc(x);// a gets (va=0x100, pa=0x12345) back from malloc
 MPI_Send(a, ...); // MPI registers 0x100 for len=x, and saves (0x100,x) in 
 reg cache
 free(a);
 a = malloc(x);// a gets (va=0x100, pa=0x98765) back from malloc
 MPI_Send(a, ...); // MPI sees a=0x100 and things that it is already 
 registered
 // ...kaboom
 
 ODP is supposed to completely solve this problem. The HCA's view and
 Kernels view of virtual to physical mapping becomes 100% synchronized,
 and there is no 'kaboom'. The kernel updates the HCA after the free,
 and after the 2nd malloc to 100% match the current virtual memory map
 in the process.

Are you saying that the 2nd malloc will magically be registered (with the new 
physical address)?

 AFAIK the ummunotify user space API was nak'd by the core kernel
 guys.

It was NAK'ed by Linus, saying fix your own network stack; this is not needed 
in the general purpose part of the kernel (remember that Roland initially 
developed this as a standalone, non-IB-related kernel module).  

 I got the impression people thought it would be acceptable as a
 rdma API, not a general API. So it is waiting on someone to recast the
 function within verbs to make progress...

'zactly.  Roland has this ummunot branch in his git tree, where he is in the 
middle of incorporating this functionality from the original ummunotify 
standalone kernel module into libibverbs and ibcore.

I started this thread asking the status of that branch.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libibverbs: A possible solution for allowing arbitrary MTU values.

2013-06-05 Thread Jason Gunthorpe
On Wed, Jun 05, 2013 at 06:02:25PM +, Jeff Squyres (jsquyres) wrote:
 On Jun 5, 2013, at 10:19 AM, Jason Gunthorpe 
 jguntho...@obsidianresearch.com wrote:
 
  The concept of a libibverbs 2.0 has been NAK's by pretty much everyone
  involved. This is why we are suffering with the complex extension
  mechanism.
 
 Are you saying that libibverbs must always always always be
 backwards compatible, and there will never be an ABI break at any
 version in the future?

I won't say never, but this is what people want. Bumping the soname is
seen as too difficult now.

  The mixed approach that was brought up, where values like 1500 were
  passed as 1500, and values like 1024 were passed as 3 seemed doable to
  me. Did you see a problem with it for your use?
 
 It just seems overly complex in terms of implementation.

Right. Preserving the ABI really is complex..

  Thoughts:
  - 1024 and 3 both mean 1024, the library must accept both values,
it should only ever return 3 though.
 
 Why?  If the caller can pass in 1024, it seems like 1024 should be
 able to be passed out, too.

If the caller passes in 1024 then it is probably OK to return 1024,
but you have to keep track of that specially. That seems more complex
than just always returning 3. 3 is guarenteed compatible with all
users.

Old users will test directly against 3.
New users will call ibv_from_mtu which tests against 3 as well.

  - 1500/etc means 1500, the libray can return that.
  - Make a ibv_from/to_mtu inline function to translate from bytes to
the encoded MTU value.
  - Switch ibv_mtu from a enum to a typedef int ibv_mtu
 
 That also breaks ABI, doesn't it?

No, the change from 'enum ibv_mtu' to int is ABI compatible, we have
done those changes in the past. The underlying type for 'enum ibv_mtu'
is well defined by the various ELF ABI documents.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of ummunot branch?

2013-06-05 Thread Jason Gunthorpe
On Wed, Jun 05, 2013 at 06:10:11PM +, Jeff Squyres (jsquyres) wrote:
 On Jun 5, 2013, at 10:14 AM, Jason Gunthorpe 
 jguntho...@obsidianresearch.com wrote:
 
  a = malloc(x);// a gets (va=0x100, pa=0x12345) back from malloc
  MPI_Send(a, ...); // MPI registers 0x100 for len=x, and saves (0x100,x) in 
  reg cache
  free(a);
  a = malloc(x);// a gets (va=0x100, pa=0x98765) back from malloc
  MPI_Send(a, ...); // MPI sees a=0x100 and things that it is already 
  registered
  // ...kaboom
  
  ODP is supposed to completely solve this problem. The HCA's view and
  Kernels view of virtual to physical mapping becomes 100% synchronized,
  and there is no 'kaboom'. The kernel updates the HCA after the free,
  and after the 2nd malloc to 100% match the current virtual memory map
  in the process.
 
 Are you saying that the 2nd malloc will magically be registered
 (with the new physical address)?

Yes, that is the whole point.

ODP fundamentally fixes the *bug* where the HCA's view of process
memory can become inconsistent with the kernel's view.

'magically be registered' is the wrong way to think about it - the
registration of VA=0x100 is simply kept, and any change to the
underlying physical mapping of the VA is synchronized with the HCA.

 'zactly.  Roland has this ummunot branch in his git tree, where he
 is in the middle of incorporating this functionality from the
 original ummunotify standalone kernel module into libibverbs and
 ibcore.

Right, this was discussed at the Enterprise Summit a few weeks
ago. I'm sure Roland would welcome patches...

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libibverbs: A possible solution for allowing arbitrary MTU values.

2013-06-05 Thread Jeff Squyres (jsquyres)
On Jun 5, 2013, at 11:11 AM, Jason Gunthorpe jguntho...@obsidianresearch.com 
wrote:

 I won't say never, but this is what people want. Bumping the soname is
 seen as too difficult now.

Gotcha.  

Ok, so my patch is a non-starter.

 Thoughts:
 - 1024 and 3 both mean 1024, the library must accept both values,
  it should only ever return 3 though.
 
 Why?  If the caller can pass in 1024, it seems like 1024 should be
 able to be passed out, too.
 
 If the caller passes in 1024 then it is probably OK to return 1024,
 but you have to keep track of that specially. That seems more complex
 than just always returning 3. 3 is guarenteed compatible with all
 users.
 
 Old users will test directly against 3.
 New users will call ibv_from_mtu which tests against 3 as well.


Ok.

I'll take a to-do to work up a new patch -- probably not until next week.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of ummunot branch?

2013-06-05 Thread Jeff Squyres (jsquyres)
On Jun 5, 2013, at 11:18 AM, Jason Gunthorpe jguntho...@obsidianresearch.com 
wrote:

 Are you saying that the 2nd malloc will magically be registered
 (with the new physical address)?
 
 Yes, that is the whole point.

Interesting.

 ODP fundamentally fixes the *bug* where the HCA's view of process
 memory can become inconsistent with the kernel's view.

Hum.  I was under the impression that with today's code (i.e., not ODP), if you

a = malloc(N);
ibv_reg_mr(..., a, N, ...);
free(a);

(assuming that the memory actually left the process at free)

Then the relevant kernel verbs driver was notified, and would unregister that 
device.  ...but I'm an MPI guy, not a kernel guy -- it seems like you're saying 
that my impression was wrong (which doesn't currently matter because we 
intercept free/sbrk and unregister such memory, anyway).

 'magically be registered' is the wrong way to think about it - the
 registration of VA=0x100 is simply kept, and any change to the
 underlying physical mapping of the VA is synchronized with the HCA.

What happens if you:

a = malloc(N * page_size);
ibv_reg_mr(..., a, N * page_size, ...);
free(a);
// incoming RDMA arrives targeted at buffer a

Or if you:

a = malloc(N * page_size);
ibv_reg_mr(..., a, N * page_size, ...);
free(a);
a = malloc(N / 2 * page_size);
// incoming RDMA arrives targeted at buffer a that is of length (N*page_size)

It does seem quite odd, abstractly speaking, that a registration would survive 
a free/re-malloc (which is arguably a different buffer).

That being said, it still seems like MPI needs a registration cache.  It is 
several good steps forward if we don't need to intercept free/sbrk/whatever, 
but when MPI_Send(buf, ...) is invoked, we still have to check that the entire 
buf is registered.  If ibv_reg_mr(..., 0, 2^64, ...) was supported, that would 
obviate the entire need for registration caches.  That would be wonderful.

 Right, this was discussed at the Enterprise Summit a few weeks
 ago. I'm sure Roland would welcome patches...


That's why I asked at the beginning of this thread.  He didn't provide any 
details about what still needs to be done, though.  :-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] libibverbs: A possible solution for allowing arbitrary MTU values.

2013-06-05 Thread Hefty, Sean
  The concept of a libibverbs 2.0 has been NAK's by pretty much everyone
  involved. This is why we are suffering with the complex extension
  mechanism.
 
 Are you saying that libibverbs must always always always be backwards
 compatible, and there will never be an ABI break at any version in the future?

I don't think this change is worth breaking the ABI.

But, I have started looking at what a version 2.0 could be.  I have a desire 
to merge the separate libraries (verbs, rdmacm, umad) together; but the 
feedback was that it didn't seem worth it if it simply exported the same APIs.  
So I expanded my scope to unify those APIs, determine the best way to extend 
the verbs cmd APIs (used by the vendor libraries), include things like 
collective operations, support vendor specific calls, etc.  I think you end up 
with a new library, which would need a lot more thought and discussion.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of ummunot branch?

2013-06-05 Thread Jason Gunthorpe
On Wed, Jun 05, 2013 at 06:45:13PM +, Jeff Squyres (jsquyres) wrote:

 Hum.  I was under the impression that with today's code (i.e., not ODP), if 
 you
 
 a = malloc(N);
 ibv_reg_mr(..., a, N, ...);
 free(a);
 
 (assuming that the memory actually left the process at free)
 
 Then the relevant kernel verbs driver was notified, and would
 unregister that device.  ...but I'm an MPI guy, not a kernel guy --
 it seems like you're saying that my impression was wrong (which
 doesn't currently matter because we intercept free/sbrk and
 unregister such memory, anyway).

Sadly no, what happens is that once you do ibv_reg_mr that 'HCA
virtual address' is forever tied to the physical memory under the
'process virtual address' *at that moment* forever.

So in the case above, RDMA can continue after the free, and it
continues to hit the same *physical* memory that it always hit, but
due to the free the process has lost access to that memory (the kernel
keeps the physical memory reserved for RDMA purposes until unreg
though).

This is fundamentally why you need to intercept mmap/munmap/sbrk - if
the process's VM mapping is changed through those syscalls then the
HCA's VM and the process VM becomes de-synchronized.

  'magically be registered' is the wrong way to think about it - the
  registration of VA=0x100 is simply kept, and any change to the
  underlying physical mapping of the VA is synchronized with the HCA.
 
 What happens if you:
 
 a = malloc(N * page_size);
 ibv_reg_mr(..., a, N * page_size, ...);
 free(a);
 // incoming RDMA arrives targeted at buffer a

Haggai should comment on this, but my impression/expectation was
you'll get a remote protection fault/

 Or if you:
 
 a = malloc(N * page_size);
 ibv_reg_mr(..., a, N * page_size, ...);
 free(a);
 a = malloc(N / 2 * page_size);
 // incoming RDMA arrives targeted at buffer a that is of length (N*page_size)

again, I expect a remote protection fault.

Noting of course, both of these cases are only true if the underlying
VM is manipulated in a way that makes the pages unmapped (eg
mmap/munmap, not free)

I would also assume that attempts to RDMA write read only pages
protection fault as well.

 It does seem quite odd, abstractly speaking, that a registration
 would survive a free/re-malloc (which is arguably a different
 buffer).

Not at all: the purpose of the registration is to allow access via
RDMA to a portion of the process's address space. The address space
doesn't change, but what it is mapped to can vary.

So - the ODP semantics make much more sense, so much so I'm not sure
we need a ODP flag at all, but that can be discussed when the patches
are proposed...

 That being said, it still seems like MPI needs a registration cache.
 It is several good steps forward if we don't need to intercept
 free/sbrk/whatever, but when MPI_Send(buf, ...) is invoked, we still
 have to check that the entire buf is registered.  If ibv_reg_mr(...,
 0, 2^64, ...) was supported, that would obviate the entire need for
 registration caches.  That would be wonderful.

Yes, except that this shifts around where the registration overhead
ends up. Basically the HCA driver now has the registration cache you
had in MPI, and all the same overheads still exist. No free lunch
here :(

Haggai: A verb to resize a registration would probably be a helpful
step. MPI could maintain one registration that covers the sbrk
region and one registration that covers the heap, much easier than
searching tables and things.

Also bear in mind that all RDMA access protections will be disabled if
you register the entire process VM, the remote(s) can scribble/read
everything..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] ibsim: Fix PerformanceSet parsing corner case

2013-06-05 Thread Hal Rosenstock
On 2/7/2013 7:45 PM, Albert Chu wrote:
 Parse of attribute did not properly remove whitespace before it.  So
 
 PerformanceSet H-0002c90300325280 PortCounters.SymbolErrorCounter=3
 
 would work but
 
 PerformanceSet H-0002c90300325280  PortCounters.SymbolErrorCounter=3\
 
 would not.
 
 Signed-off-by: Albert Chu ch...@llnl.gov

Thanks. Applied.

-- Hal
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ibsim: Output error on bad input to PerformanceSet

2013-06-05 Thread Hal Rosenstock
On 2/7/2013 7:45 PM, Albert Chu wrote:
 Signed-off-by: Albert Chu ch...@llnl.gov

Thanks. Applied.

-- Hal
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of ummunot branch?

2013-06-05 Thread Jeff Squyres (jsquyres)
On Jun 5, 2013, at 12:05 PM, Jason Gunthorpe jguntho...@obsidianresearch.com 
wrote:

 It does seem quite odd, abstractly speaking, that a registration
 would survive a free/re-malloc (which is arguably a different
 buffer).
 
 Not at all: the purpose of the registration is to allow access via
 RDMA to a portion of the process's address space. The address space
 doesn't change, but what it is mapped to can vary.

I still think it's really weird.  When I do this:

a = malloc(N);
ibv_reg_mr(..., a, N, ...);
free(a);
b = malloc(M);

If b just happens to be partially or wholly registered by some quirk of the 
malloc() system (i.e., some/all of the virtual address space in b happens to 
have been covered by a prior malloc/ibv_reg_mr)... that's just weird.

 If ibv_reg_mr(...,
 0, 2^64, ...) was supported, that would obviate the entire need for
 registration caches.  That would be wonderful.
 
 Yes, except that this shifts around where the registration overhead
 ends up. Basically the HCA driver now has the registration cache you
 had in MPI, and all the same overheads still exist.

There's fewer verbs drivers than applications, right?

 Haggai: A verb to resize a registration would probably be a helpful
 step. MPI could maintain one registration that covers the sbrk
 region and one registration that covers the heap, much easier than
 searching tables and things.

If we still have to register buffers piecemeal, a non-blocking registration 
verb would be quite helpful.

 Also bear in mind that all RDMA access protections will be disabled if
 you register the entire process VM, the remote(s) can scribble/read
 everything..


No problem for MPI/HPC...  :-)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of ummunot branch?

2013-06-05 Thread Haggai Eran
On 05/06/2013 22:05, Jason Gunthorpe wrote:
 On Wed, Jun 05, 2013 at 06:45:13PM +, Jeff Squyres (jsquyres) wrote:
 What happens if you:

 a = malloc(N * page_size);
 ibv_reg_mr(..., a, N * page_size, ...);
 free(a);
 // incoming RDMA arrives targeted at buffer a
 
 Haggai should comment on this, but my impression/expectation was
 you'll get a remote protection fault/
 
 Or if you:

 a = malloc(N * page_size);
 ibv_reg_mr(..., a, N * page_size, ...);
 free(a);
 a = malloc(N / 2 * page_size);
 // incoming RDMA arrives targeted at buffer a that is of length (N*page_size)
 
 again, I expect a remote protection fault.
 
 Noting of course, both of these cases are only true if the underlying
 VM is manipulated in a way that makes the pages unmapped (eg
 mmap/munmap, not free)

That's right. If pages are unmapped and a remote operation tries to
access them the QP will be closed with a protection error.

 
 I would also assume that attempts to RDMA write read only pages
 protection fault as well.
Right.

 Haggai: A verb to resize a registration would probably be a helpful
 step. MPI could maintain one registration that covers the sbrk
 region and one registration that covers the heap, much easier than
 searching tables and things.

That's a nice idea. Even without this verb, I think it is possible to
develop a registration cache that covers those regions though. When you
find out you have some part of your region not registered, you can
register a new, larger region that covers everything you need. For new
operations you only use the newer region. Once the previous, smaller
region is not used, you de-register it.

What do you think?

Haggai

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html