[OMPI devel] Milestone for OMPI

2006-05-22 Thread Jeff Squyres (jsquyres)
In case you weren't paying attention, Open MPI passed r1 this
weekend.

First commit: jsquyres, 11/22/2003.
10,000th commit: bosilca, 5/21/2006

For comparison:

LAM first commit: trillium, 2/20/1990
LAM 10,000th commit: brbarret, 1/9/2005
LAM most recent commit (r10,332): brbarret, 5/18/2006

:-)

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



[OMPI devel] memory_malloc_hooks.c and dlclose()

2006-05-22 Thread Neil Ludban
Hello,

I'm getting a core dump when using openmpi-1.0.2 with the MPI extensions
we're developing for the MATLAB interpreter.  This same build of openmpi
is working great with C programs and our extensions for gnu octave.  The
machine is AMD64 running Linux:

Linux kodos 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:29:47 EST 2005 x86_64 x86_64 
x86_64 GNU/Linux

I believe there's a bug in that opal_memory_malloc_hooks_init() links
itself into the __free_hook chain during initialization, but then it
never unlinks itself at shutdown.  In the interpreter environment,
libopal.so is dlclose()d and unmapped from memory long before the
interpreter is done with dynamic memory.  A quick check of the nightly
trunk snapshot reveals some function name changes, but no new shutdown
code.

After running this trivial MPI program on a single processor:
MPI_Init();
MPI_Finalize();
I'm back to the MATLAB prompt, and break into the debugger:

>>> ^C
(gdb) info sharedlibrary
>FromTo  Syms Read   Shared Object Library
...
0x002aa0b50740  0x002aa0b50a28  Yes .../mexMPI_Init.mexa64
0x002aa0c52a50  0x002aa0c54318  Yes .../lib/libbcmpi.so.0
0x002aa0dcef90  0x002aa0e37398  Yes /usr/lib64/libstdc++.so.6
0x002aa0fa9ec0  0x002aa102e118  Yes .../lib/libmpi.so.0
0x002aa1178560  0x002aa11af708  Yes .../lib/liborte.so.0
0x002aa12cffb0  0x002aa12f2988  Yes .../lib/libopal.so.0
0x002aa1424180  0x002aa14249d8  Yes /lib64/libutil.so.1
0x002aa152a760  0x002aa1536368  Yes /lib64/libnsl.so.1
0x002aa3540b80  0x002aa3551077  Yes 
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/libvapi.so
0x002aa365e0a0  0x002aa3664a86  Yes 
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/libmosal.so
0x002aa470db50  0x002aa4719438  Yes 
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/librhhul.so
0x002ac4e508c0  0x002ac4e50ed8  Yes .../mexMPI_Constants.mexa64
0x002ac4f52740  0x002ac4f52a28  Yes .../mexMPI_Finalize.mexa64

(gdb) c
>> exit

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 182992729024 (LWP 21848)]
opal_mem_free_free_hook (ptr=0x7fbfff96d0, caller=0xa8d4f8) at 
memory_malloc_hooks.c:65

(gdb) info sharedlibrary
>FromTo  Syms Read   Shared Object Library
...
0x002aa1424180  0x002aa14249d8  Yes /lib64/libutil.so.1
0x002aa152a760  0x002aa1536368  Yes /lib64/libnsl.so.1
0x002aa3540b80  0x002aa3551077  Yes 
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/libvapi.so
0x002aa365e0a0  0x002aa3664a86  Yes 
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/libmosal.so
0x002aa470db50  0x002aa4719438  Yes 
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/librhhul.so

(gdb) list
63  static void
64  opal_mem_free_free_hook (void *ptr, const void *caller)
65  {
66  /* dispatch about the pending free */
67  opal_mem_free_release_hook(ptr, malloc_usable_size(ptr));
68
69  __free_hook = old_free_hook;
70
71  /* call the next chain down */
72  free(ptr);
73
74  /* save the hooks again and restore our hook again */

(gdb) print ptr
$2 = (void *) 0x7fbfff96d0
(gdb) print caller
$3 = (const void *) 0xa8d4f8
(gdb) print __free_hook
$4 = (void (*)(void *, const void *)) 0x2aa12f1d79 
(gdb) print old_free_hook
Cannot access memory at address 0x2aa1422800


Before I start blindly hacking a workaround, can somebody who's familiar
with the openmpi internals verify that this is a real bug, suggest a
correct fix, and/or comment on other potential problems with running in
an interpreter.

Thanks-

-Neil