Hello,
I'm getting a core dump when using openmpi-1.0.2 with the MPI extensions
we're developing for the MATLAB interpreter. This same build of openmpi
is working great with C programs and our extensions for gnu octave. The
machine is AMD64 running Linux:
Linux kodos 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:29:47 EST 2005 x86_64 x86_64
x86_64 GNU/Linux
I believe there's a bug in that opal_memory_malloc_hooks_init() links
itself into the __free_hook chain during initialization, but then it
never unlinks itself at shutdown. In the interpreter environment,
libopal.so is dlclose()d and unmapped from memory long before the
interpreter is done with dynamic memory. A quick check of the nightly
trunk snapshot reveals some function name changes, but no new shutdown
code.
After running this trivial MPI program on a single processor:
MPI_Init();
MPI_Finalize();
I'm back to the MATLAB prompt, and break into the debugger:
>>> ^C
(gdb) info sharedlibrary
>FromTo Syms Read Shared Object Library
...
0x002aa0b50740 0x002aa0b50a28 Yes .../mexMPI_Init.mexa64
0x002aa0c52a50 0x002aa0c54318 Yes .../lib/libbcmpi.so.0
0x002aa0dcef90 0x002aa0e37398 Yes /usr/lib64/libstdc++.so.6
0x002aa0fa9ec0 0x002aa102e118 Yes .../lib/libmpi.so.0
0x002aa1178560 0x002aa11af708 Yes .../lib/liborte.so.0
0x002aa12cffb0 0x002aa12f2988 Yes .../lib/libopal.so.0
0x002aa1424180 0x002aa14249d8 Yes /lib64/libutil.so.1
0x002aa152a760 0x002aa1536368 Yes /lib64/libnsl.so.1
0x002aa3540b80 0x002aa3551077 Yes
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/libvapi.so
0x002aa365e0a0 0x002aa3664a86 Yes
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/libmosal.so
0x002aa470db50 0x002aa4719438 Yes
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/librhhul.so
0x002ac4e508c0 0x002ac4e50ed8 Yes .../mexMPI_Constants.mexa64
0x002ac4f52740 0x002ac4f52a28 Yes .../mexMPI_Finalize.mexa64
(gdb) c
>> exit
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 182992729024 (LWP 21848)]
opal_mem_free_free_hook (ptr=0x7fbfff96d0, caller=0xa8d4f8) at
memory_malloc_hooks.c:65
(gdb) info sharedlibrary
>FromTo Syms Read Shared Object Library
...
0x002aa1424180 0x002aa14249d8 Yes /lib64/libutil.so.1
0x002aa152a760 0x002aa1536368 Yes /lib64/libnsl.so.1
0x002aa3540b80 0x002aa3551077 Yes
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/libvapi.so
0x002aa365e0a0 0x002aa3664a86 Yes
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/libmosal.so
0x002aa470db50 0x002aa4719438 Yes
/usr/local/ibgd-1.8.0/driver/infinihost/lib64/librhhul.so
(gdb) list
63 static void
64 opal_mem_free_free_hook (void *ptr, const void *caller)
65 {
66 /* dispatch about the pending free */
67 opal_mem_free_release_hook(ptr, malloc_usable_size(ptr));
68
69 __free_hook = old_free_hook;
70
71 /* call the next chain down */
72 free(ptr);
73
74 /* save the hooks again and restore our hook again */
(gdb) print ptr
$2 = (void *) 0x7fbfff96d0
(gdb) print caller
$3 = (const void *) 0xa8d4f8
(gdb) print __free_hook
$4 = (void (*)(void *, const void *)) 0x2aa12f1d79
(gdb) print old_free_hook
Cannot access memory at address 0x2aa1422800
Before I start blindly hacking a workaround, can somebody who's familiar
with the openmpi internals verify that this is a real bug, suggest a
correct fix, and/or comment on other potential problems with running in
an interpreter.
Thanks-
-Neil