Re: [OMPI devel] iof / oob issues
BTW, the fix didn't occur over the weekend because of some merging issues. I also didn't explain the problem well; you may see some clipped output from your program or the orted may hang while everything is shutting down. This is especially likely to occur for very short applications. The problem is actually in the oob; the orted gets into a state where it's waiting for some IOF OOB callbacks to occur for messages that were already successfully sent, but the callbacks never occur due to... well, it's a long story. The IOF is basically spinning during the orted shutdown waiting for pending OOB callbacks that will never occur. I can explain in more detail if anyone cares, but hopefully Brian will be able to work the fix in within the next few days. On Jul 13, 2007, at 5:04 PM, Jeff Squyres wrote: FYI: there is an issue on the OMPI trunk right now that the tail end of output from applications may get clipped. The fix is coming this weekend. If you care, I'll explain, but I just wanted to give everyone heads up that if you see the tail end of your stdout/ stderr not show up, it's probably not your fault. :-) -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems
Re: [OMPI devel] iof / oob issues
Just to further clarify the clarification... ;-) This condition has existed for the last several months. The root problem dates at least back into the 1.1 series. We chased the problem down to the iof_flush call in the odls when a process terminates in something like Jan or Feb this year, at which point we #if 0'd the iof_flush out of the code pending a resolution (tickets were filed, emails flew, phone calls ensued - just took awhile for people to have time to deal with it). It is still "on" in 1.2 - just has been turned "off" in the trunk for months. [Actually, I did turn it back on briefly following r15390. Turned out the timing changed just enough to make it work most of the time with things that called orte_finalize, but always fail for programs that didn't, so we turned it back off again] So the problem of having clipped output has been around for quite some time. Since only Galen ever commented to me about being impacted by it, I gather nobody has really noticed. ;-) Hopefully, we'll be able to turn it back on again soon. On 7/18/07 6:02 AM, "Jeff Squyres" wrote: > BTW, the fix didn't occur over the weekend because of some merging > issues. > > I also didn't explain the problem well; you may see some clipped > output from your program or the orted may hang while everything is > shutting down. This is especially likely to occur for very short > applications. > > The problem is actually in the oob; the orted gets into a state where > it's waiting for some IOF OOB callbacks to occur for messages that > were already successfully sent, but the callbacks never occur due > to... well, it's a long story. The IOF is basically spinning during > the orted shutdown waiting for pending OOB callbacks that will never > occur. > > I can explain in more detail if anyone cares, but hopefully Brian > will be able to work the fix in within the next few days. > > > On Jul 13, 2007, at 5:04 PM, Jeff Squyres wrote: > >> FYI: there is an issue on the OMPI trunk right now that the tail >> end of output from applications may get clipped. The fix is coming >> this weekend. If you care, I'll explain, but I just wanted to give >> everyone heads up that if you see the tail end of your stdout/ >> stderr not show up, it's probably not your fault. :-) >> >> -- >> Jeff Squyres >> Cisco Systems >> >> >
[OMPI devel] LD_LIBRARY_PATH and process launch on a head node
Hi, With current trunk LD_LIBRARY_PATH is not set for ranks that are launched on the head node. This worked previously. -- Gleb.
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
On Wed, Jul 18, 2007 at 04:27:15PM +0300, Gleb Natapov wrote: > Hi, > > With current trunk LD_LIBRARY_PATH is not set for ranks that are > launched on the head node. This worked previously. > Same more info. I use rsh pls. elfit1# /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1 env | grep LD_LIBRARY_PATH gives nothing. The strange thing that I just found is that this one works elfit1# /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep LD_LIBRARY_PATH LD_LIBRARY_PATH=/home/glebn/openmpi/lib -- Gleb.
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
I believe that was fixed in r15405 - are you at that rev level? On 7/18/07 7:27 AM, "Gleb Natapov" wrote: > Hi, > > With current trunk LD_LIBRARY_PATH is not set for ranks that are > launched on the head node. This worked previously. > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
On Wed, Jul 18, 2007 at 07:48:17AM -0600, Ralph H Castain wrote: > I believe that was fixed in r15405 - are you at that rev level? I am on the latest revision. > > > On 7/18/07 7:27 AM, "Gleb Natapov" wrote: > > > Hi, > > > > With current trunk LD_LIBRARY_PATH is not set for ranks that are > > launched on the head node. This worked previously. > > > > -- > > Gleb. > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
It works for me in both cases, provided I give the fully qualified host name for your first example. In other words, these work: pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host localhost printenv | grep LD [pn1180961.lanl.gov:22021] [0.0] test of print_name OLDPWD=/Users/rhc/openmpi LD_LIBRARY_PATH=/Users/rhc/openmpi/lib:/Users/rhc/lib:/opt/local/lib:/usr/lo cal/lib: pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961.lanl.gov printenv | grep LD [pn1180961.lanl.gov:22012] [0.0] test of print_name OLDPWD=/Users/rhc/openmpi LD_LIBRARY_PATH=/Users/rhc/openmpi/lib:/Users/rhc/lib:/opt/local/lib:/usr/lo cal/lib: But this will lockup: pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep LD The reason is that the hostname in this last command doesn't match the hostname I get when I query my interfaces, so mpirun thinks it must be a remote host - and so we stick in ssh until that times out. Which could be quick on your machine, but takes awhile for me. Hope that helps Ralph On 7/18/07 7:45 AM, "Gleb Natapov" wrote: > On Wed, Jul 18, 2007 at 04:27:15PM +0300, Gleb Natapov wrote: >> Hi, >> >> With current trunk LD_LIBRARY_PATH is not set for ranks that are >> launched on the head node. This worked previously. >> > Same more info. I use rsh pls. > elfit1# /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1 env | grep > LD_LIBRARY_PATH > gives nothing. > > The strange thing that I just found is that this one works > elfit1# /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep > LD_LIBRARY_PATH > LD_LIBRARY_PATH=/home/glebn/openmpi/lib > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] optional fortran datatype fixes: 1.2.4?
Rainer -- Did you want to get r14818 and r15137 into 1.2.4? There's no CMR for them. Here's your commit messages: r14818: - The optional Fortran datatypes may not be available Do not initialize them, if not. If initializing them, check for the correct C-equivalent type to copy from... Issue a warning, when a type (e.g. REAL*16) is not available to build the type (here COMPLEX*32). This fixes issues with ompi and pacx. Works with intel-compiler and FCFLAGS="-i8 -r8" on ia32. r15137: - Add the missing parts: add MPI_REAL2 to the end of the list of Fortran datatypes (mpif-common.h) and the list of registered datatypes: MOOG(REAL2). Configure and Compilation with ia32/gcc just finished, naturally without real2. -- Jeff Squyres Cisco Systems
[OMPI devel] MPI_BOTTOM fixes: 1.2.4?
Rainer / George -- You guys made some fixes for MPI_BOTTOM et al. recently; did you want them in v1.2.4? There's no CMR. I *think* the changes span the following commits: https://svn.open-mpi.org/trac/ompi/changeset/15129 https://svn.open-mpi.org/trac/ompi/changeset/15030 -- Jeff Squyres Cisco Systems
Re: [OMPI devel] optional fortran datatype fixes: 1.2.4?
Sorry, I should have included links to the commits in question: https://svn.open-mpi.org/trac/ompi/changeset/14818 https://svn.open-mpi.org/trac/ompi/changeset/15137 On Jul 18, 2007, at 11:46 AM, Jeff Squyres wrote: Rainer -- Did you want to get r14818 and r15137 into 1.2.4? There's no CMR for them. Here's your commit messages: r14818: - The optional Fortran datatypes may not be available Do not initialize them, if not. If initializing them, check for the correct C-equivalent type to copy from... Issue a warning, when a type (e.g. REAL*16) is not available to build the type (here COMPLEX*32). This fixes issues with ompi and pacx. Works with intel-compiler and FCFLAGS="-i8 -r8" on ia32. r15137: - Add the missing parts: add MPI_REAL2 to the end of the list of Fortran datatypes (mpif-common.h) and the list of registered datatypes: MOOG(REAL2). Configure and Compilation with ia32/gcc just finished, naturally without real2. -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] optional fortran datatype fixes: 1.2.4?
Hi Jeff, r14818 yes --- but there has otherwise not been any requests for this patch... r15137 no, we agreed to put into 1.3 Nevertheless, I posted a CMR for r14818, it does apply cleanly in 1.2-svn. Thanks, Rainer On Wednesday 18 July 2007 17:46, Jeff Squyres wrote: > Rainer -- > > Did you want to get r14818 and r15137 into 1.2.4? There's no CMR for > them. Here's your commit messages: > > r14818: > - The optional Fortran datatypes may not be available > Do not initialize them, if not. > If initializing them, check for the correct C-equivalent type > to copy from... > Issue a warning, when a type (e.g. REAL*16) is not available to > build the type (here COMPLEX*32). > This fixes issues with ompi and pacx. > > Works with intel-compiler and FCFLAGS="-i8 -r8" on ia32. > > r15137: > - Add the missing parts: add MPI_REAL2 to the end of the list > of Fortran datatypes (mpif-common.h) and the list of registered > datatypes: MOOG(REAL2). > Configure and Compilation with ia32/gcc just finished, naturally > without real2. -- Dipl.-Inf. Rainer Keller http://www.hlrs.de/people/keller High Performance Computing Tel: ++49 (0)711-685 6 5858 Center Stuttgart (HLRS) Fax: ++49 (0)711-685 6 5832 POSTAL:Nobelstrasse 19 email: kel...@hlrs.de ACTUAL:Allmandring 30, R.O.030AIM:rusraink 70550 Stuttgart
Re: [OMPI devel] MPI_BOTTOM fixes: 1.2.4?
Hi Jeff, just checking the mails with Daniel/George back then. Yes, both would be required as stated in r15129; Should apply cleanly (except for NEWS). Thanks, Rainer On Wednesday 18 July 2007 17:48, Jeff Squyres wrote: > Rainer / George -- > > You guys made some fixes for MPI_BOTTOM et al. recently; did you want > them in v1.2.4? There's no CMR. I *think* the changes span the > following commits: > > https://svn.open-mpi.org/trac/ompi/changeset/15129 > https://svn.open-mpi.org/trac/ompi/changeset/15030 -- Dipl.-Inf. Rainer Keller http://www.hlrs.de/people/keller High Performance Computing Tel: ++49 (0)711-685 6 5858 Center Stuttgart (HLRS) Fax: ++49 (0)711-685 6 5832 POSTAL:Nobelstrasse 19 email: kel...@hlrs.de ACTUAL:Allmandring 30, R.O.030AIM:rusraink 70550 Stuttgart
Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node
On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote: > But this will lockup: > > pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep > LD > > The reason is that the hostname in this last command doesn't match the > hostname I get when I query my interfaces, so mpirun thinks it must be a > remote host - and so we stick in ssh until that times out. Which could be > quick on your machine, but takes awhile for me. > This is not my case. mpirun resolves hostname and runs env but LD_LIBRARY_PATH is not there. If I use full name like this # /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com env | grep LD_LIBRARY_PATH LD_LIBRARY_PATH=/home/glebn/openmpi/lib everything is OK. -- Gleb.
Re: [OMPI devel] devel Digest, Vol 802, Issue 1
Good suggestion, increasing the timeout to somewhere around 12 allowed the job to finish. Initial experimentation showed that I could get a factor of 3-4x improvement in performance using even larger timeouts, matching the times for 64 processes and 1/4 the data set. The cluster is presently having scheduler issues, I'll post again if I find anything else interesting. Thanks- -Neil > Date: Tue, 17 Jul 2007 10:14:44 +0300 > From: "Pavel Shamis (Pasha)" > Subject: Re: [OMPI devel] InfiniBand timeout errors > To: Open MPI Developers > Message-ID: <469c6c64.4040...@dev.mellanox.co.il> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hi, > Try to increase the IB time out parameter: --mca btl_mvapi_ib_timeout 14 > If the 14 will not work , try to increase little bit more (16) > > Thanks, > Pasha > > Neil Ludban wrote: > > Hi, > > > > I'm getting the errors below when calling MPI_Alltoallv() as part of > > a matrix transpose operation. It's 100% repeatable when testing with > > 16M matrix elements divided between 64 processes on 32 dual core nodes. > > There are never any errors with fewer processes or elements, including > > the same 32 nodes with only one process per node. If anyone wants > > any additional information or has suggestions to try, please let me > > know. Otherwise, I'll have the system rebooted and hope the problem > > goes away. > > > > -Neil > > > > > > > > [0,1,7][btl_mvapi_component.c:854:mca_btl_mvapi_component_progress] > > from c065 to: c077 [0,1,3][btl_mvapi_component.c:854: > > mca_btl_mvapi_component_progress] from c069 error polling HP > > CQ with status VAPI_RETRY_EXC_ERR status number 12 for Frag : > > 0x2ab6590200 to: c078 error polling HP CQ with status > > VAPI_RETRY_EXC_ERR status number 12 for Frag : 0x2ab61f6380 > > -- > > The retry count is a down counter initialized on creation of the QP. Retry > > count is defined in the InfiniBand Spec 1.2 (12.7.38): > > The total number of times that the sender wishes the receiver to retry tim- > > eout, packet sequence, etc. errors before posting a completion error. > > > > Note that two mca parameters are involved here: > > btl_mvapi_ib_retry_count - The number of times the sender will attempt to > > retry (defaulted to 7, the maximum value). > > > > btl_mvapi_ib_timeout - The local ack timeout parameter (defaulted to 10). > > The > > actual timeout value used is calculated as: > > (4.096 micro-seconds * 2^btl_mvapi_ib_timeout). > > See InfiniBand Spec 1.2 (12.7.34) for more details.
Re: [OMPI devel] Fwd: lsf support / farm use models
hi, first of all, thanks for the info bill! i think i'm really starting to piece things together now. you are right in that i'm working with a 6.x (6.2 with 6.1 devel libs ;) install here at cadence, without the HPC extensions AFAIK. also, i think that are customers are mostly in the same position -- i assume that the HPC extensions cost extra? or perhaps admins just don't bother to install them. so, there are at least three cases to consider: LSF 7.0 or greater LSF 6.x /w HPC LSF 6.x 'base' i'll try to gather more data, but my feeling it that the market penetration of both HPC and LSF 7.0 is low in our marker (EDA vendors and customers). i'd love to just stall until 7.0 is widely available, but perhaps in the mean time it would be nice to have some backward support for LSF 6.0 'base'. it seems like supporting LSF 6.x /w HPC might not be too useful, since: a) it's not clear that the 'built in' "bsub -n N -a openmpi foo" support will work with an MPI-2 dynamic-spawning application like mine (or does it?), b) i've heard that manually interfacing with the parallel application manager directly is tricky? c) most importantly, it's not clear than any of our customers have the HPC support, and certainly not all of them, so i need to support LSF 6.0 'base' anyway -- it only needs to work until 7.0 is widely available (< 1 year? i really have no idea ... will Platform end support for 6.x at some particular time? or otherwise push customers to upgrade? perhaps cadence can help there too ...) . under LSF 7.0 it looks like things are okay and that open-mpi will support it in a released version 'soon' (< 6 months? ). sooner than our customer wil have LSF 7.0 anyway, so that's fine. as for LSF 6.0 'base', there are two workarounds that i see, and a couple key questions that remain: 1) use bsub -n N, followed by N-1 ls_rtaske() calls (or similar). while ls_rtaske() may not 'force' me to follow the queuing rules, if i only launch on the proper machines, i should be okay, right? i don't think IO and process marshaling (i'm not sure exactly what you mean by that) are a problem since openmpi/orted handles those issues, i think? 2) use only bsub's of single processes, using some initial wrapper script that bsub's all the jobs (master + N-1 slaves) needed to reach the desired static allocation for openmpi. this seems to be what my internal guy is suggesting is 'required'. integration with openmpi might not be too hard, using suitable trickery. for example, the wrapper script launches some wrapper processes that are basically rexec daemons. the master waits for them to come up in the ras/lsf component (tcp notify, perhaps via the launcher machine to avoid needing to know the master hostname a priori), and then the pls/lsf component uses the thin rexec daemons to launch orted. seems like a bit of a silly workaround, but it does seem to both keep the queuing system happy as well as not need ls_rtaske() or similar. [ Note: (1) will fail if admins disable the ls_rexec() type of functionality, but on a LSF 6.0 'base' system, this would seem to disable all || job launching -- i.e. the shipped mpijob/pvmjob all use lsgrun and such, so they would be disabled -- is there any other way i could start the sub-processes within my allocation in that case? can i just have bsub start N copies of something (maybe orted?)? that seems like it might be hard to integrate with openmpi, though -- in that case, i'd probably just only impliment option (2)] Matt. On 7/17/07, Bill McMillan wrote: > there appear to be some overlaps between the ls_* and lsb_* functions, > but they seem basically compatible as far as i can tell. almost all > the functions have a command line version as well, for example: > lsb_submit()/bsub Like openmpi and orte, there are two layers in LSF. The ls_* API's talk to what is/was historically called "LSF Base" and the lsb_* API's talk to what is/was historically called "LSF Batch". [SNIP] Regards, Bill - Bill McMillan Principal Technical Product Manager Platform Computing
[OMPI devel] pathscale compilers and TLS
Crud. The Pathscale 3.0 compilers do not support thread-local data. This is what we've been fighting with https://svn.open-mpi.org/trac/ompi/ ticket/1025; QLogic just told us last week that their compiler does not support TLS (even though OMPI was not currently using it, glibc does, and something calls abort() deep within pthread_exit(NULL)). If you don't use the TLS glibc, everything works fine, but the TLS glibc is the default on many Linux systems. QLogic is looking into the problem and said they will get back to use (I'm unwilling to do horrid LD_PRELOAD tricks to get the non-TLS glibc, etc.). I'm guessing that this change will guarantee to make the pathscale 3.0 compilers not work at all. Is this change just to fix a memory leak? If so, could we add a configure test to see if the compiler is broken w.r.t. TLS? (I know, I know... :-( ) On Jul 18, 2007, at 4:25 PM, brbar...@osl.iu.edu wrote: Author: brbarret Date: 2007-07-18 16:25:01 EDT (Wed, 18 Jul 2007) New Revision: 15494 URL: https://svn.open-mpi.org/trac/ompi/changeset/15494 Log: Use thread specific data and static buffers for the return type of opal_net_get_hostname() rather than malloc, because no one was freeing the buffer and the common use case was for printfs, where calling free is a pain. Text files modified: trunk/opal/runtime/opal_finalize.c | 3 + trunk/opal/runtime/opal_init.c | 6 +++ trunk/opal/util/net.c |68 ++ + trunk/opal/util/net.h |28 +++ 4 files changed, 103 insertions(+), 2 deletions(-) Modified: trunk/opal/runtime/opal_finalize.c == --- trunk/opal/runtime/opal_finalize.c (original) +++ trunk/opal/runtime/opal_finalize.c 2007-07-18 16:25:01 EDT (Wed, 18 Jul 2007) @@ -25,6 +25,7 @@ #include "opal/util/output.h" #include "opal/util/malloc.h" #include "opal/util/if.h" +#include "opal/util/net.h" #include "opal/util/keyval_parse.h" #include "opal/memoryhooks/memory.h" #include "opal/mca/base/base.h" @@ -53,6 +54,8 @@ close when not opened internally */ opal_iffinalize(); +opal_net_finalize(); + /* keyval lex-based parser */ opal_util_keyval_parse_finalize(); Modified: trunk/opal/runtime/opal_init.c == --- trunk/opal/runtime/opal_init.c (original) +++ trunk/opal/runtime/opal_init.c 2007-07-18 16:25:01 EDT (Wed, 18 Jul 2007) @@ -28,6 +28,7 @@ #include "opal/memoryhooks/memory.h" #include "opal/mca/base/base.h" #include "opal/runtime/opal.h" +#include "opal/util/net.h" #include "opal/mca/installdirs/base/base.h" #include "opal/mca/memory/base/base.h" #include "opal/mca/memcpy/base/base.h" @@ -165,6 +166,11 @@ goto return_error; } +if (OPAL_SUCCESS != (ret = opal_net_init())) { +error = "opal_net_init"; +goto return_error; +} + /* Setup the parameter system */ if (OPAL_SUCCESS != (ret = mca_base_param_init())) { error = "mca_base_param_init"; Modified: trunk/opal/util/net.c == --- trunk/opal/util/net.c (original) +++ trunk/opal/util/net.c 2007-07-18 16:25:01 EDT (Wed, 18 Jul 2007) @@ -74,9 +74,62 @@ #include "opal/util/output.h" #include "opal/util/strncpy.h" #include "opal/constants.h" +#include "opal/threads/tsd.h" #ifdef HAVE_STRUCT_SOCKADDR_IN +#if OPAL_WANT_IPV6 +static opal_tsd_key_t hostname_tsd_key; + + +static void +hostname_cleanup(void *value) +{ +opal_output(0, "cleaning up buffer: 0x%lx", value); +if (NULL != value) free(value); +} + + +static char* +get_hostname_buffer(void) +{ +void *buffer; +int ret; + +ret = opal_tsd_getspecific(hostname_tsd_key, &buffer); +if (OPAL_SUCCESS != ret) return NULL; + +if (NULL == buffer) { +opal_output(0, "getting a buffer"); +buffer = (void*) malloc((NI_MAXHOST + 1) * sizeof(char)); +ret = opal_tsd_setspecific(hostname_tsd_key, buffer); +} + +opal_output(0, "returning buffer: 0x%lx", buffer); + +return (char*) buffer; +} +#endif + + +int +opal_net_init() +{ +#if OPAL_WANT_IPV6 +return opal_tsd_key_create(&hostname_tsd_key, hostname_cleanup); +#else +return OPAL_SUCCESS; +#endif +} + + +int +opal_net_finalize() +{ +return OPAL_SUCCESS; +} + + /* convert a CIDR prefixlen to netmask (in network byte order) */ uint32_t opal_net_prefix2netmask(uint32_t prefixlen) @@ -225,7 +278,7 @@ opal_net_get_hostname(struct sockaddr *addr) { #if OPAL_WANT_IPV6 -char *name = (char *)malloc((NI_MAXHOST + 1) * sizeof(char)); +char *name = get_hostname_buffer(); int error; socklen_t addrlen; @@ -297,6 +350,19 @@ #else /* HAVE_STRUCT_SOCKADDR_IN */ +int +opal_net_init() +{ +return OPAL_SUCCESS; +
[OMPI devel] problems with openib finalize
Background: Pasha added a call in the openib BTL finalize function that will only succeed if all registered memory has been released (ibv_dealloc_pd()). Since the test app didn't call MPI_FREE_MEM, there was some memory that was still registered, and therefore the call in finalize failed. We treated this as a fatal error. Last night's MTT runs turned up several apps that exhibited this fatal error. While we're examining this problem, Pasha has removed the call to ibv_dealloc_pd() in the trunk openib BTL finalize. I examined 1 of the tests that was failing last night in MTT: onesided/t.f90. This test has an MPI_ALLOC_MEM with no corresponding MPI_FREE_MEM. To investigate this problem, I restored the call to ibv_dealloc_pd() and re-ran the t.f90 test -- the problem still occurs. Good. However, once I got the right MPI_FREE_MEM call in t.f90, the test started passing. I.e., ibv_dealloc_pd(hca->ib_pd) succeeds because all registered memory has been released. Hence, the test itself was faulty. However, I don't think we should *error* if we fail to ibv_dealloc_pd (hca->ib_pd); it's a user error, but it's not catastrophic unless we're trying to do an HCA restart scenario. Specifically: during a normal MPI_FINALIZE, who cares? I think we should do the following: 1. If we're not doing an HCA restart/checkpoint and we fail to ibv_dealloc_pd(), just move on (i.e., it's not a warning/error unless we *want* a warning, such as if an MCA parameter btl_openib_warn_if_finalize_fail is enabled, or somesuch). 2. If we *are* doing an HCA restart/checkpoint and ibv_dealloc_pd() fails, then we have to gracefully fail to notify upper layers that Bad Things happened (I suspect that we need mpool finalize implemented to properly implement checkpointing for RDMA networks). 3. Add a new MCA parameter named mpi_show_mpi_alloc_mem_leaks that, when enabled, shows a warning in ompi_mpi_finalize() if there is still memory allocated by MPI_ALLOC_MEM that was not freed by MPI_FREE_MEM (this MCA parameter will parallel the already-existing mpi_show_handle_leaks MCA param which displays warnings if the app creates MPI objects but does not free them). My points: - leaked MPI_ALLOC_MEM memory should be reported by the MPI layer, not a BTL or mpool - failing to ibv_dealloc_pd() during MPI_FINALIZE should only trigger a warning if the user wants to see it - failing to ibv_dealloc_pd() during an HCA restart or checkpoint should gracefully fail upwards Comments? -- Jeff Squyres Cisco Systems