Re: [OMPI devel] iof / oob issues

2007-07-18 Thread Jeff Squyres
BTW, the fix didn't occur over the weekend because of some merging  
issues.


I also didn't explain the problem well; you may see some clipped  
output from your program or the orted may hang while everything is  
shutting down.  This is especially likely to occur for very short  
applications.


The problem is actually in the oob; the orted gets into a state where  
it's waiting for some IOF OOB callbacks to occur for messages that  
were already successfully sent, but the callbacks never occur due  
to... well, it's a long story.  The IOF is basically spinning during  
the orted shutdown waiting for pending OOB callbacks that will never  
occur.


I can explain in more detail if anyone cares, but hopefully Brian  
will be able to work the fix in within the next few days.



On Jul 13, 2007, at 5:04 PM, Jeff Squyres wrote:

FYI: there is an issue on the OMPI trunk right now that the tail  
end of output from applications may get clipped.  The fix is coming  
this weekend.  If you care, I'll explain, but I just wanted to give  
everyone heads up that if you see the tail end of your stdout/ 
stderr not show up, it's probably not your fault.  :-)


--
Jeff Squyres
Cisco Systems





--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] iof / oob issues

2007-07-18 Thread Ralph H Castain
Just to further clarify the clarification... ;-)

This condition has existed for the last several months. The root problem
dates at least back into the 1.1 series. We chased the problem down to the
iof_flush call in the odls when a process terminates in something like Jan
or Feb this year, at which point we #if 0'd the iof_flush out of the code
pending a resolution (tickets were filed, emails flew, phone calls ensued -
just took awhile for people to have time to deal with it). It is still "on"
in 1.2 - just has been turned "off" in the trunk for months.

[Actually, I did turn it back on briefly following r15390. Turned out the
timing changed just enough to make it work most of the time with things that
called orte_finalize, but always fail for programs that didn't, so we turned
it back off again]

So the problem of having clipped output has been around for quite some time.
Since only Galen ever commented to me about being impacted by it, I gather
nobody has really noticed. ;-)

Hopefully, we'll be able to turn it back on again soon.


On 7/18/07 6:02 AM, "Jeff Squyres"  wrote:

> BTW, the fix didn't occur over the weekend because of some merging
> issues.
> 
> I also didn't explain the problem well; you may see some clipped
> output from your program or the orted may hang while everything is
> shutting down.  This is especially likely to occur for very short
> applications.
> 
> The problem is actually in the oob; the orted gets into a state where
> it's waiting for some IOF OOB callbacks to occur for messages that
> were already successfully sent, but the callbacks never occur due
> to... well, it's a long story.  The IOF is basically spinning during
> the orted shutdown waiting for pending OOB callbacks that will never
> occur.
> 
> I can explain in more detail if anyone cares, but hopefully Brian
> will be able to work the fix in within the next few days.
> 
> 
> On Jul 13, 2007, at 5:04 PM, Jeff Squyres wrote:
> 
>> FYI: there is an issue on the OMPI trunk right now that the tail
>> end of output from applications may get clipped.  The fix is coming
>> this weekend.  If you care, I'll explain, but I just wanted to give
>> everyone heads up that if you see the tail end of your stdout/
>> stderr not show up, it's probably not your fault.  :-)
>> 
>> -- 
>> Jeff Squyres
>> Cisco Systems
>> 
>> 
> 




[OMPI devel] LD_LIBRARY_PATH and process launch on a head node

2007-07-18 Thread Gleb Natapov
Hi,

  With current trunk LD_LIBRARY_PATH is not set for ranks that are
launched on the head node. This worked previously.

--
Gleb.


Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node

2007-07-18 Thread Gleb Natapov
On Wed, Jul 18, 2007 at 04:27:15PM +0300, Gleb Natapov wrote:
> Hi,
> 
>   With current trunk LD_LIBRARY_PATH is not set for ranks that are
> launched on the head node. This worked previously.
> 
Same more info. I use rsh pls.
elfit1# /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1 env | grep 
LD_LIBRARY_PATH
gives nothing. 

The strange thing that I just found is that this one works
elfit1# /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep 
LD_LIBRARY_PATH
LD_LIBRARY_PATH=/home/glebn/openmpi/lib

--
Gleb.


Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node

2007-07-18 Thread Ralph H Castain
I believe that was fixed in r15405 - are you at that rev level?


On 7/18/07 7:27 AM, "Gleb Natapov"  wrote:

> Hi,
> 
>   With current trunk LD_LIBRARY_PATH is not set for ranks that are
> launched on the head node. This worked previously.
> 
> --
> Gleb.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node

2007-07-18 Thread Gleb Natapov
On Wed, Jul 18, 2007 at 07:48:17AM -0600, Ralph H Castain wrote:
> I believe that was fixed in r15405 - are you at that rev level?
I am on the latest revision.

> 
> 
> On 7/18/07 7:27 AM, "Gleb Natapov"  wrote:
> 
> > Hi,
> > 
> >   With current trunk LD_LIBRARY_PATH is not set for ranks that are
> > launched on the head node. This worked previously.
> > 
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node

2007-07-18 Thread Ralph H Castain
It works for me in both cases, provided I give the fully qualified host name
for your first example. In other words, these work:

pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host localhost printenv | grep
LD
[pn1180961.lanl.gov:22021] [0.0] test of print_name
OLDPWD=/Users/rhc/openmpi
LD_LIBRARY_PATH=/Users/rhc/openmpi/lib:/Users/rhc/lib:/opt/local/lib:/usr/lo
cal/lib:

pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961.lanl.gov printenv
| grep LD
[pn1180961.lanl.gov:22012] [0.0] test of print_name
OLDPWD=/Users/rhc/openmpi
LD_LIBRARY_PATH=/Users/rhc/openmpi/lib:/Users/rhc/lib:/opt/local/lib:/usr/lo
cal/lib:


But this will lockup:

pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep
LD

The reason is that the hostname in this last command doesn't match the
hostname I get when I query my interfaces, so mpirun thinks it must be a
remote host - and so we stick in ssh until that times out. Which could be
quick on your machine, but takes awhile for me.

Hope that helps
Ralph


On 7/18/07 7:45 AM, "Gleb Natapov"  wrote:

> On Wed, Jul 18, 2007 at 04:27:15PM +0300, Gleb Natapov wrote:
>> Hi,
>> 
>>   With current trunk LD_LIBRARY_PATH is not set for ranks that are
>> launched on the head node. This worked previously.
>> 
> Same more info. I use rsh pls.
> elfit1# /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1 env | grep
> LD_LIBRARY_PATH
> gives nothing. 
> 
> The strange thing that I just found is that this one works
> elfit1# /home/glebn/openmpi/bin/mpirun -np 1 -H localhost env | grep
> LD_LIBRARY_PATH
> LD_LIBRARY_PATH=/home/glebn/openmpi/lib
> 
> --
> Gleb.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] optional fortran datatype fixes: 1.2.4?

2007-07-18 Thread Jeff Squyres

Rainer --

Did you want to get r14818 and r15137 into 1.2.4?  There's no CMR for  
them.  Here's your commit messages:


r14818:
 - The optional Fortran datatypes may not be available
   Do not initialize them, if not.
   If initializing them, check for the correct C-equivalent type
   to copy from...
   Issue a warning, when a type (e.g. REAL*16) is not available to
   build the type (here COMPLEX*32).
   This fixes issues with ompi and pacx.

   Works with intel-compiler and FCFLAGS="-i8 -r8" on ia32.

r15137:
- Add the missing parts: add MPI_REAL2 to the end of the list
   of Fortran datatypes (mpif-common.h) and the list of registered
   datatypes: MOOG(REAL2).
   Configure and Compilation with ia32/gcc just finished, naturally
   without real2.

--
Jeff Squyres
Cisco Systems



[OMPI devel] MPI_BOTTOM fixes: 1.2.4?

2007-07-18 Thread Jeff Squyres

Rainer / George --

You guys made some fixes for MPI_BOTTOM et al. recently; did you want  
them in v1.2.4?  There's no CMR.  I *think* the changes span the  
following commits:


https://svn.open-mpi.org/trac/ompi/changeset/15129
https://svn.open-mpi.org/trac/ompi/changeset/15030

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] optional fortran datatype fixes: 1.2.4?

2007-07-18 Thread Jeff Squyres

Sorry, I should have included links to the commits in question:

https://svn.open-mpi.org/trac/ompi/changeset/14818
https://svn.open-mpi.org/trac/ompi/changeset/15137


On Jul 18, 2007, at 11:46 AM, Jeff Squyres wrote:


Rainer --

Did you want to get r14818 and r15137 into 1.2.4?  There's no CMR for
them.  Here's your commit messages:

r14818:
  - The optional Fortran datatypes may not be available
Do not initialize them, if not.
If initializing them, check for the correct C-equivalent type
to copy from...
Issue a warning, when a type (e.g. REAL*16) is not available to
build the type (here COMPLEX*32).
This fixes issues with ompi and pacx.

Works with intel-compiler and FCFLAGS="-i8 -r8" on ia32.

r15137:
- Add the missing parts: add MPI_REAL2 to the end of the list
of Fortran datatypes (mpif-common.h) and the list of registered
datatypes: MOOG(REAL2).
Configure and Compilation with ia32/gcc just finished, naturally
without real2.

--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] optional fortran datatype fixes: 1.2.4?

2007-07-18 Thread Rainer Keller
Hi Jeff,

r14818 yes --- but there has otherwise not been any requests for this patch...
r15137 no, we agreed to put into 1.3

Nevertheless, I posted a CMR for r14818, it does apply cleanly in 1.2-svn.

Thanks,
Rainer




On Wednesday 18 July 2007 17:46, Jeff Squyres wrote:
> Rainer --
>
> Did you want to get r14818 and r15137 into 1.2.4?  There's no CMR for
> them.  Here's your commit messages:
>
> r14818:
>   - The optional Fortran datatypes may not be available
> Do not initialize them, if not.
> If initializing them, check for the correct C-equivalent type
> to copy from...
> Issue a warning, when a type (e.g. REAL*16) is not available to
> build the type (here COMPLEX*32).
> This fixes issues with ompi and pacx.
>
> Works with intel-compiler and FCFLAGS="-i8 -r8" on ia32.
>
> r15137:
> - Add the missing parts: add MPI_REAL2 to the end of the list
> of Fortran datatypes (mpif-common.h) and the list of registered
> datatypes: MOOG(REAL2).
> Configure and Compilation with ia32/gcc just finished, naturally
> without real2.

-- 

Dipl.-Inf. Rainer Keller   http://www.hlrs.de/people/keller
 High Performance Computing   Tel: ++49 (0)711-685 6 5858
   Center Stuttgart (HLRS)   Fax: ++49 (0)711-685 6 5832
 POSTAL:Nobelstrasse 19 email: kel...@hlrs.de 
 ACTUAL:Allmandring 30, R.O.030AIM:rusraink
 70550 Stuttgart


Re: [OMPI devel] MPI_BOTTOM fixes: 1.2.4?

2007-07-18 Thread Rainer Keller
Hi Jeff,
just checking the mails with Daniel/George back then.

Yes, both would be required as stated in r15129;
Should apply cleanly (except for NEWS).

Thanks,
Rainer


On Wednesday 18 July 2007 17:48, Jeff Squyres wrote:
> Rainer / George --
>
> You guys made some fixes for MPI_BOTTOM et al. recently; did you want
> them in v1.2.4?  There's no CMR.  I *think* the changes span the
> following commits:
>
> https://svn.open-mpi.org/trac/ompi/changeset/15129
> https://svn.open-mpi.org/trac/ompi/changeset/15030

-- 

Dipl.-Inf. Rainer Keller   http://www.hlrs.de/people/keller
 High Performance Computing   Tel: ++49 (0)711-685 6 5858
   Center Stuttgart (HLRS)   Fax: ++49 (0)711-685 6 5832
 POSTAL:Nobelstrasse 19 email: kel...@hlrs.de 
 ACTUAL:Allmandring 30, R.O.030AIM:rusraink
 70550 Stuttgart


Re: [OMPI devel] LD_LIBRARY_PATH and process launch on a head node

2007-07-18 Thread Gleb Natapov
On Wed, Jul 18, 2007 at 09:08:47AM -0600, Ralph H Castain wrote:
> But this will lockup:
> 
> pn1180961:~/openmpi/trunk rhc$ mpirun -n 1 -host pn1180961 printenv | grep
> LD
> 
> The reason is that the hostname in this last command doesn't match the
> hostname I get when I query my interfaces, so mpirun thinks it must be a
> remote host - and so we stick in ssh until that times out. Which could be
> quick on your machine, but takes awhile for me.
> 
This is not my case. mpirun resolves hostname and runs env but
LD_LIBRARY_PATH is not there. If I use full name like this
# /home/glebn/openmpi/bin/mpirun -np 1 -H elfit1.voltaire.com env | grep 
LD_LIBRARY_PATH
LD_LIBRARY_PATH=/home/glebn/openmpi/lib

everything is OK.

--
Gleb.


Re: [OMPI devel] devel Digest, Vol 802, Issue 1

2007-07-18 Thread Neil Ludban
Good suggestion, increasing the timeout to somewhere around 12
allowed the job to finish.  Initial experimentation showed that
I could get a factor of 3-4x improvement in performance using
even larger timeouts, matching the times for 64 processes and
1/4 the data set.  The cluster is presently having scheduler
issues, I'll post again if I find anything else interesting.

Thanks-
-Neil

> Date: Tue, 17 Jul 2007 10:14:44 +0300
> From: "Pavel Shamis (Pasha)" 
> Subject: Re: [OMPI devel] InfiniBand timeout errors
> To: Open MPI Developers 
> Message-ID: <469c6c64.4040...@dev.mellanox.co.il>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Hi,
> Try to increase the IB time out parameter: --mca btl_mvapi_ib_timeout 14
> If the 14 will not work , try to increase little bit more (16)
> 
> Thanks,
> Pasha
> 
> Neil Ludban wrote:
> > Hi,
> >
> > I'm getting the errors below when calling MPI_Alltoallv() as part of
> > a matrix transpose operation.  It's 100% repeatable when testing with
> > 16M matrix elements divided between 64 processes on 32 dual core nodes.
> > There are never any errors with fewer processes or elements, including
> > the same 32 nodes with only one process per node.  If anyone wants
> > any additional information or has suggestions to try, please let me
> > know.  Otherwise, I'll have the system rebooted and hope the problem
> > goes away.
> >
> > -Neil
> >
> >
> >
> > [0,1,7][btl_mvapi_component.c:854:mca_btl_mvapi_component_progress]
> > from c065 to: c077 [0,1,3][btl_mvapi_component.c:854:
> > mca_btl_mvapi_component_progress] from c069 error polling HP
> > CQ with status VAPI_RETRY_EXC_ERR status number 12 for Frag :
> > 0x2ab6590200 to: c078 error polling HP CQ with status
> > VAPI_RETRY_EXC_ERR status number 12 for Frag : 0x2ab61f6380
> > --
> > The retry count is a down counter initialized on creation of the QP. Retry
> > count is defined in the InfiniBand Spec 1.2 (12.7.38): 
> > The total number of times that the sender wishes the receiver to retry tim- 
> > eout, packet sequence, etc. errors before posting a completion error.
> >
> > Note that two mca parameters are involved here: 
> > btl_mvapi_ib_retry_count - The number of times the sender will attempt to
> > retry  (defaulted to 7, the maximum value). 
> >
> > btl_mvapi_ib_timeout - The local ack timeout parameter (defaulted to 10). 
> > The
> > actual timeout value used is calculated as: 
> > (4.096 micro-seconds * 2^btl_mvapi_ib_timeout). 
> > See InfiniBand Spec 1.2 (12.7.34) for more details.


Re: [OMPI devel] Fwd: lsf support / farm use models

2007-07-18 Thread Matthew Moskewicz

hi,

first of all, thanks for the info bill! i think i'm really starting to
piece things together now. you are right in that i'm working with a
6.x (6.2 with 6.1 devel libs ;) install here at cadence, without the
HPC extensions AFAIK. also, i think that are customers are mostly in
the same position -- i assume that the HPC extensions cost extra? or
perhaps admins just don't bother to install them.

so, there are at least three cases to consider:
LSF 7.0 or greater
LSF 6.x /w HPC
LSF 6.x 'base'

i'll try to gather more data, but my feeling it that the market
penetration of both HPC and LSF 7.0 is low in our marker (EDA vendors
and customers). i'd love to just stall until 7.0 is widely available,
but perhaps in the mean time it would be nice to have some backward
support for LSF 6.0 'base'. it seems like supporting LSF 6.x /w HPC
might not be too useful, since:
a) it's not clear that the 'built in' "bsub -n N -a openmpi foo"
support will work with an MPI-2 dynamic-spawning application like mine
(or does it?),
b) i've heard that manually interfacing with the  parallel application
manager directly is tricky?
c) most importantly, it's not clear than any of our customers have the
HPC support, and certainly not all of them, so i need to support LSF
6.0 'base' anyway -- it only needs to work until 7.0 is widely
available (< 1 year? i really have no idea ... will Platform end
support for 6.x at some particular time? or otherwise push customers
to upgrade? perhaps cadence can help there too ...) .

under LSF 7.0 it looks like things are okay and that open-mpi will
support it in a released version 'soon' (< 6 months? ). sooner than
our customer wil have LSF 7.0 anyway, so that's fine.

as for LSF 6.0 'base', there are two workarounds that i see, and a
couple key questions that remain:

1) use bsub -n N, followed by N-1 ls_rtaske() calls (or similar).
while ls_rtaske() may not 'force' me to follow the queuing rules, if i
only launch on the proper machines, i should be okay, right? i don't
think IO and process marshaling (i'm not sure exactly what you mean by
that) are a problem since openmpi/orted handles those issues, i think?

2) use only bsub's of single processes, using some initial wrapper
script that bsub's all the jobs (master + N-1 slaves) needed to reach
the desired static allocation for openmpi. this seems to be what my
internal guy is suggesting is 'required'. integration with openmpi
might not be too hard, using suitable trickery. for example, the
wrapper script launches some wrapper processes that are basically
rexec daemons. the master waits for them to come up in the ras/lsf
component (tcp notify, perhaps via the launcher machine to avoid
needing to know the master hostname a priori), and then the pls/lsf
component uses the thin rexec daemons to launch orted. seems like a
bit of a silly workaround, but it does seem to both keep the queuing
system happy as well as not need ls_rtaske() or similar.

[ Note: (1) will fail if admins disable the ls_rexec() type of
functionality, but on a LSF 6.0 'base' system, this would seem to
disable all || job launching -- i.e. the shipped mpijob/pvmjob all use
lsgrun and such, so they would be disabled -- is there any other way i
could start the sub-processes within my allocation in that case? can i
just have bsub start N copies of something (maybe orted?)? that seems
like it might be hard to integrate with openmpi, though -- in that
case, i'd probably just only impliment option (2)]

Matt.

On 7/17/07, Bill McMillan  wrote:





> there appear to be some overlaps between the ls_* and lsb_* functions,
> but they seem basically compatible as far as i can tell. almost all
> the functions have a command line version as well, for example:
> lsb_submit()/bsub

  Like openmpi and orte, there are two layers in LSF.  The ls_* API's
  talk to what is/was historically called "LSF Base" and the lsb_* API's
  talk to what is/was historically called "LSF Batch".

[SNIP]

  Regards,
  Bill


-
Bill McMillan
Principal Technical Product Manager
Platform Computing



[OMPI devel] pathscale compilers and TLS

2007-07-18 Thread Jeff Squyres

Crud.

The Pathscale 3.0 compilers do not support thread-local data.  This  
is what we've been fighting with https://svn.open-mpi.org/trac/ompi/ 
ticket/1025; QLogic just told us last week that their compiler does  
not support TLS (even though OMPI was not currently using it, glibc  
does, and something calls abort() deep within pthread_exit(NULL)).   
If you don't use the TLS glibc, everything works fine, but the TLS  
glibc is the default on many Linux systems.


QLogic is looking into the problem and said they will get back to use  
(I'm unwilling to do horrid LD_PRELOAD tricks to get the non-TLS  
glibc, etc.).


I'm guessing that this change will guarantee to make the pathscale  
3.0 compilers not work at all.


Is this change just to fix a memory leak?  If so, could we add a  
configure test to see if the compiler is broken w.r.t. TLS?  (I know,  
I know... :-( )




On Jul 18, 2007, at 4:25 PM, brbar...@osl.iu.edu wrote:


Author: brbarret
Date: 2007-07-18 16:25:01 EDT (Wed, 18 Jul 2007)
New Revision: 15494
URL: https://svn.open-mpi.org/trac/ompi/changeset/15494

Log:
Use thread specific data and static buffers for the return type of
opal_net_get_hostname() rather than malloc, because no one was freeing
the buffer and the common use case was for printfs, where calling
free is a pain.

Text files modified:
   trunk/opal/runtime/opal_finalize.c | 3 +
   trunk/opal/runtime/opal_init.c | 6 +++
   trunk/opal/util/net.c  |68 ++ 
+

   trunk/opal/util/net.h  |28 +++
   4 files changed, 103 insertions(+), 2 deletions(-)

Modified: trunk/opal/runtime/opal_finalize.c
== 


--- trunk/opal/runtime/opal_finalize.c  (original)
+++ trunk/opal/runtime/opal_finalize.c	2007-07-18 16:25:01 EDT  
(Wed, 18 Jul 2007)

@@ -25,6 +25,7 @@
 #include "opal/util/output.h"
 #include "opal/util/malloc.h"
 #include "opal/util/if.h"
+#include "opal/util/net.h"
 #include "opal/util/keyval_parse.h"
 #include "opal/memoryhooks/memory.h"
 #include "opal/mca/base/base.h"
@@ -53,6 +54,8 @@
close when not opened internally */
 opal_iffinalize();

+opal_net_finalize();
+
 /* keyval lex-based parser */
 opal_util_keyval_parse_finalize();


Modified: trunk/opal/runtime/opal_init.c
== 


--- trunk/opal/runtime/opal_init.c  (original)
+++ trunk/opal/runtime/opal_init.c	2007-07-18 16:25:01 EDT (Wed, 18  
Jul 2007)

@@ -28,6 +28,7 @@
 #include "opal/memoryhooks/memory.h"
 #include "opal/mca/base/base.h"
 #include "opal/runtime/opal.h"
+#include "opal/util/net.h"
 #include "opal/mca/installdirs/base/base.h"
 #include "opal/mca/memory/base/base.h"
 #include "opal/mca/memcpy/base/base.h"
@@ -165,6 +166,11 @@
 goto return_error;
 }

+if (OPAL_SUCCESS != (ret = opal_net_init())) {
+error = "opal_net_init";
+goto return_error;
+}
+
 /* Setup the parameter system */
 if (OPAL_SUCCESS != (ret = mca_base_param_init())) {
 error = "mca_base_param_init";

Modified: trunk/opal/util/net.c
== 


--- trunk/opal/util/net.c   (original)
+++ trunk/opal/util/net.c   2007-07-18 16:25:01 EDT (Wed, 18 Jul 2007)
@@ -74,9 +74,62 @@
 #include "opal/util/output.h"
 #include "opal/util/strncpy.h"
 #include "opal/constants.h"
+#include "opal/threads/tsd.h"

 #ifdef HAVE_STRUCT_SOCKADDR_IN

+#if OPAL_WANT_IPV6
+static opal_tsd_key_t hostname_tsd_key;
+
+
+static void
+hostname_cleanup(void *value)
+{
+opal_output(0, "cleaning up buffer: 0x%lx", value);
+if (NULL != value) free(value);
+}
+
+
+static char*
+get_hostname_buffer(void)
+{
+void *buffer;
+int ret;
+
+ret = opal_tsd_getspecific(hostname_tsd_key, &buffer);
+if (OPAL_SUCCESS != ret) return NULL;
+
+if (NULL == buffer) {
+opal_output(0, "getting a buffer");
+buffer = (void*) malloc((NI_MAXHOST + 1) * sizeof(char));
+ret = opal_tsd_setspecific(hostname_tsd_key, buffer);
+}
+
+opal_output(0, "returning buffer: 0x%lx", buffer);
+
+return (char*) buffer;
+}
+#endif
+
+
+int
+opal_net_init()
+{
+#if OPAL_WANT_IPV6
+return opal_tsd_key_create(&hostname_tsd_key, hostname_cleanup);
+#else
+return OPAL_SUCCESS;
+#endif
+}
+
+
+int
+opal_net_finalize()
+{
+return OPAL_SUCCESS;
+}
+
+
 /* convert a CIDR prefixlen to netmask (in network byte order) */
 uint32_t
 opal_net_prefix2netmask(uint32_t prefixlen)
@@ -225,7 +278,7 @@
 opal_net_get_hostname(struct sockaddr *addr)
 {
 #if OPAL_WANT_IPV6
-char *name = (char *)malloc((NI_MAXHOST + 1) * sizeof(char));
+char *name = get_hostname_buffer();
 int error;
 socklen_t addrlen;

@@ -297,6 +350,19 @@

 #else /* HAVE_STRUCT_SOCKADDR_IN */

+int
+opal_net_init()
+{
+return OPAL_SUCCESS;
+

[OMPI devel] problems with openib finalize

2007-07-18 Thread Jeff Squyres
Background: Pasha added a call in the openib BTL finalize function  
that will only succeed if all registered memory has been released  
(ibv_dealloc_pd()).  Since the test app didn't call MPI_FREE_MEM,  
there was some memory that was still registered, and therefore the  
call in finalize failed.  We treated this as a fatal error.  Last  
night's MTT runs turned up several apps that exhibited this fatal error.


While we're examining this problem, Pasha has removed the call to  
ibv_dealloc_pd() in the trunk openib BTL finalize.


I examined 1 of the tests that was failing last night in MTT:  
onesided/t.f90.  This test has an MPI_ALLOC_MEM with no corresponding  
MPI_FREE_MEM.  To investigate this problem, I restored the call to  
ibv_dealloc_pd() and re-ran the t.f90 test -- the problem still  
occurs.  Good.


However, once I got the right MPI_FREE_MEM call in t.f90, the test  
started passing.  I.e., ibv_dealloc_pd(hca->ib_pd) succeeds because  
all registered memory has been released.  Hence, the test itself was  
faulty.


However, I don't think we should *error* if we fail to ibv_dealloc_pd 
(hca->ib_pd); it's a user error, but it's not catastrophic unless  
we're trying to do an HCA restart scenario.  Specifically: during a  
normal MPI_FINALIZE, who cares?


I think we should do the following:

1. If we're not doing an HCA restart/checkpoint and we fail to  
ibv_dealloc_pd(), just move on (i.e., it's not a warning/error unless  
we *want* a warning, such as if an MCA parameter  
btl_openib_warn_if_finalize_fail is enabled, or somesuch).


2. If we *are* doing an HCA restart/checkpoint and ibv_dealloc_pd()  
fails, then we have to gracefully fail to notify upper layers that  
Bad Things happened (I suspect that we need mpool finalize  
implemented to properly implement checkpointing for RDMA networks).


3. Add a new MCA parameter named mpi_show_mpi_alloc_mem_leaks that,  
when enabled, shows a warning in ompi_mpi_finalize() if there is  
still memory allocated by MPI_ALLOC_MEM that was not freed by  
MPI_FREE_MEM (this MCA parameter will parallel the already-existing  
mpi_show_handle_leaks MCA param which displays warnings if the app  
creates MPI objects but does not free them).


My points:
- leaked MPI_ALLOC_MEM memory should be reported by the MPI layer,  
not a BTL or mpool
- failing to ibv_dealloc_pd() during MPI_FINALIZE should only trigger  
a warning if the user wants to see it
- failing to ibv_dealloc_pd() during an HCA restart or checkpoint  
should gracefully fail upwards


Comments?

--
Jeff Squyres
Cisco Systems