Hello,

I'm getting a core dump when using openmpi-1.0.2 with the MPI extensions
we're developing for the MATLAB interpreter.  This same build of openmpi
is working great with C programs and our extensions for gnu octave.  The
machine is AMD64 running Linux:

Linux kodos 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:29:47 EST 2005 x86_64 x86_64 
x86_64 GNU/Linux

I believe there's a bug in that opal_memory_malloc_hooks_init() links
itself into the __free_hook chain during initialization, but then it
never unlinks itself at shutdown.  In the interpreter environment,
libopal.so is dlclose()d and unmapped from memory long before the
interpreter is done with dynamic memory.  A quick check of the nightly
trunk snapshot reveals some function name changes, but no new shutdown
code.

After running this trivial MPI program on a single processor:
        MPI_Init();
        MPI_Finalize();
I'm back to the MATLAB prompt, and break into the debugger:

>>> ^C
(gdb) info sharedlibrary
>From                To                  Syms Read   Shared Object Library
...
0x0000002aa0b50740  0x0000002aa0b50a28  Yes         .../mexMPI_Init.mexa64
0x0000002aa0c52a50  0x0000002aa0c54318  Yes         .../lib/libbcmpi.so.0
0x0000002aa0dcef90  0x0000002aa0e37398  Yes         /usr/lib64/libstdc++.so.6
0x0000002aa0fa9ec0  0x0000002aa102e118  Yes         .../lib/libmpi.so.0
0x0000002aa1178560  0x0000002aa11af708  Yes         .../lib/liborte.so.0
0x0000002aa12cffb0  0x0000002aa12f2988  Yes         .../lib/libopal.so.0
0x0000002aa1424180  0x0000002aa14249d8  Yes         /lib64/libutil.so.1
0x0000002aa152a760  0x0000002aa1536368  Yes         /lib64/libnsl.so.1
0x0000002aa3540b80  0x0000002aa3551077  Yes         
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/libvapi.so
0x0000002aa365e0a0  0x0000002aa3664a86  Yes         
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/libmosal.so
0x0000002aa470db50  0x0000002aa4719438  Yes         
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/librhhul.so
0x0000002ac4e508c0  0x0000002ac4e50ed8  Yes         .../mexMPI_Constants.mexa64
0x0000002ac4f52740  0x0000002ac4f52a28  Yes         .../mexMPI_Finalize.mexa64

(gdb) c
>> exit

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 182992729024 (LWP 21848)]
opal_mem_free_free_hook (ptr=0x7fbfff96d0, caller=0xa8d4f8) at 
memory_malloc_hooks.c:65

(gdb) info sharedlibrary
>From                To                  Syms Read   Shared Object Library
...
0x0000002aa1424180  0x0000002aa14249d8  Yes         /lib64/libutil.so.1
0x0000002aa152a760  0x0000002aa1536368  Yes         /lib64/libnsl.so.1
0x0000002aa3540b80  0x0000002aa3551077  Yes         
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/libvapi.so
0x0000002aa365e0a0  0x0000002aa3664a86  Yes         
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/libmosal.so
0x0000002aa470db50  0x0000002aa4719438  Yes         
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/librhhul.so

(gdb) list
63      static void
64      opal_mem_free_free_hook (void *ptr, const void *caller)
65      {
66          /* dispatch about the pending free */
67          opal_mem_free_release_hook(ptr, malloc_usable_size(ptr));
68
69          __free_hook = old_free_hook;
70
71          /* call the next chain down */
72          free(ptr);
73
74          /* save the hooks again and restore our hook again */

(gdb) print ptr
$2 = (void *) 0x7fbfff96d0
(gdb) print caller
$3 = (const void *) 0xa8d4f8
(gdb) print __free_hook
$4 = (void (*)(void *, const void *)) 0x2aa12f1d79 <opal_mem_free_free_hook>
(gdb) print old_free_hook
Cannot access memory at address 0x2aa1422800


Before I start blindly hacking a workaround, can somebody who's familiar
with the openmpi internals verify that this is a real bug, suggest a
correct fix, and/or comment on other potential problems with running in
an interpreter.

Thanks-

-Neil

Reply via email to