I can confirm that neither leak is causing my imb hang. Unless there is another
frag leak somewhere (haven't found one) the lockup was simply due to running
out of registered memory. So, I see no need to push for a 1.4.6 unless a btl
other than ugni hits the bug.
Setting an rcache limit doesn't eliminate the hang. I will continue to
investigate next week.
-Nathan
On Thu, 1 Mar 2012, Jeffrey Squyres wrote:
...or in 1.5.5.
How soon will you be able to tell if it fixes some hangs?
On Mar 1, 2012, at 10:56 AM, Nathan Hjelm wrote:
Found a pretty nasty frag leak (and a minor one) in ob1 (see commit below). If
this fix addresses some hangs we are seeing on infiniband LANL might want a
1.4.6 rolled (or a faster rollout for 1.6.0).
-Nathan
---------- Forwarded message ----------
Date: Thu, 1 Mar 2012 08:53:39 -0700
From: hje...@osl.iu.edu
Reply-To: de...@open-mpi.org
To: s...@open-mpi.org
Subject: [OMPI svn] svn:open-mpi r26077
Author: hjelmn
Date: 2012-03-01 10:53:39 EST (Thu, 01 Mar 2012)
New Revision: 26077
URL: https://svn.open-mpi.org/trac/ompi/changeset/26077
Log:
ob1: fix two fragment leaks
- MAJOR! get src descriptor leaks if mca_bml_base_send fails
- minor. descriptor leaked in mca_pml_send_request_start_copy if the btl
returns OMPI_ERR_RESOURCE_BUSY.
Text files modified:
trunk/ompi/mca/pml/ob1/pml_ob1_sendreq.c | 27 ++++++++++++++++-----------
1 files changed, 16 insertions(+), 11 deletions(-)
Modified: trunk/ompi/mca/pml/ob1/pml_ob1_sendreq.c
==============================================================================
--- trunk/ompi/mca/pml/ob1/pml_ob1_sendreq.c (original)
+++ trunk/ompi/mca/pml/ob1/pml_ob1_sendreq.c 2012-03-01 10:53:39 EST (Thu,
01 Mar 2012)
@@ -1,3 +1,4 @@
+/* -*- Mode: C; c-basic-offset:4 ; indent-tabs-mode:nil -*- */
/*
* Copyright (c) 2004-2005 The Trustees of Indiana University and Indiana
* University Research and Technology
@@ -12,6 +13,8 @@
* Copyright (c) 2008 UT-Battelle, LLC. All rights reserved.
* Copyright (c) 2010 Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2012 NVIDIA Corporation. All rights reserved.
+ * Copyright (c) 2012 Los Alamos National Security, LLC. All rights
+ * reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -546,15 +549,14 @@
}
return OMPI_SUCCESS;
}
- switch(OPAL_SOS_GET_ERROR_CODE(rc)) {
- case OMPI_ERR_RESOURCE_BUSY:
- /* No more resources. Allow the upper level to queue the send */
- rc = OMPI_ERR_OUT_OF_RESOURCE;
- break;
- default:
- mca_bml_base_free(bml_btl, des);
- break;
+
+ if (OMPI_ERR_RESOURCE_BUSY == OPAL_SOS_GET_ERROR_CODE(rc)) {
+ /* No more resources. Allow the upper level to queue the send */
+ rc = OMPI_ERR_OUT_OF_RESOURCE;
}
+
+ mca_bml_base_free (bml_btl, des);
+
return rc;
}
@@ -631,7 +633,7 @@
* operation is achieved.
*/
- mca_btl_base_descriptor_t* des;
+ mca_btl_base_descriptor_t *des, *src = NULL;
mca_btl_base_segment_t* segment;
mca_pml_ob1_hdr_t* hdr;
bool need_local_cb = false;
@@ -640,7 +642,6 @@
bml_btl = sendreq->req_rdma[0].bml_btl;
if((sendreq->req_rdma_cnt == 1) && (bml_btl->btl_flags & (MCA_BTL_FLAGS_GET
| MCA_BTL_FLAGS_CUDA_GET))) {
mca_mpool_base_registration_t* reg = sendreq->req_rdma[0].btl_reg;
- mca_btl_base_descriptor_t* src;
size_t i;
size_t old_position =
sendreq->req_send.req_base.req_convertor.bConverted;
@@ -781,6 +782,10 @@
return OMPI_SUCCESS;
}
mca_bml_base_free(bml_btl, des);
+ if (NULL != src) {
+ mca_bml_base_free (bml_btl, src);
+ }
+
return rc;
}
@@ -1144,7 +1149,7 @@
0,
&frag->rdma_length,
MCA_BTL_DES_FLAGS_BTL_OWNERSHIP |
- MCA_BTL_DES_FLAGS_PUT,
+ MCA_BTL_DES_FLAGS_PUT,
&des );
if( OPAL_UNLIKELY(NULL == des) ) {
_______________________________________________
svn mailing list
s...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel