I noticed MTT failures from last night and then reproduced this morning on 1.8 
branch.  Looks like maybe a double free.  I assume it is related to fixes for 
aborting programs. Maybe related to 
https://svn.open-mpi.org/trac/ompi/changeset/32508 but not sure.

[rvandevaart@drossetti-ivy0 environment]$ pwd
/ivylogin/home/rvandevaart/tests/ompi-tests/trunk/ibm/environment
[rvandevaart@drossetti-ivy0 environment]$ mpirun --mca odls_base_verbose 20 -np 
2 abort
[...stuff deleted...]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to tag 30 
on child [[58714,1],0]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to tag 30 
on child [[58714,1],1]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to tag 30 
on child [[58714,1],0]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to tag 30 
on child [[58714,1],1]
**************************************************************************
This program tests MPI_ABORT and generates error messages
ERRORS ARE EXPECTED AND NORMAL IN THIS PROGRAM!!
**************************************************************************
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 3.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:wait_local_proc child 
process [[58714,1],0] pid 14955 terminated
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired child 
[[58714,1],0] exit code 3
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired checking 
abort file /tmp/openmpi-sessions-rvandevaart@drossetti-ivy0_0/58714/1/0/aborted 
for child [[58714,1],0]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired child 
[[58714,1],0] died by call to abort
*** glibc detected *** mpirun: double free or corruption (fasttop): 
0x000000000130e210 ***

>From gdb:
gdb) where
#0  0x00007f75ede138e5 in raise () from /lib64/libc.so.6
#1  0x00007f75ede1504d in abort () from /lib64/libc.so.6
#2  0x00007f75ede517f7 in __libc_message () from /lib64/libc.so.6
#3  0x00007f75ede57126 in malloc_printerr () from /lib64/libc.so.6
#4  0x00007f75eef9eac4 in odls_base_default_wait_local_proc (pid=14955, 
status=768, cbdata=0x0)
    at ../../../../orte/mca/odls/base/odls_base_default_fns.c:2007
#5  0x00007f75eef60a78 in do_waitall (options=0) at 
../../orte/runtime/orte_wait.c:554
#6  0x00007f75eef60712 in orte_wait_signal_callback (fd=17, event=8, 
arg=0x7f75ef201400) at ../../orte/runtime/orte_wait.c:421
#7  0x00007f75eecaecbe in event_signal_closure (base=0x1278370, 
ev=0x7f75ef201400)
    at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1081
#8  0x00007f75eecaf7e0 in event_process_active_single_queue (base=0x1278370, 
activeq=0x12788f0)
    at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1359
#9  0x00007f75eecafaca in event_process_active (base=0x1278370)
    at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
#10 0x00007f75eecb0148 in opal_libevent2021_event_base_loop (base=0x1278370, 
flags=1)
    at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
#11 0x0000000000405572 in orterun (argc=7, argv=0x7fffbdf1dd08) at 
../../../../orte/tools/orterun/orterun.c:1078
#12 0x0000000000403904 in main (argc=7, argv=0x7fffbdf1dd08) at 
../../../../orte/tools/orterun/main.c:13
(gdb) up
#1  0x00007f75ede1504d in abort () from /lib64/libc.so.6
(gdb) up
#2  0x00007f75ede517f7 in __libc_message () from /lib64/libc.so.6
(gdb) up
#3  0x00007f75ede57126 in malloc_printerr () from /lib64/libc.so.6
(gdb) up
#4  0x00007f75eef9eac4 in odls_base_default_wait_local_proc (pid=14955, 
status=768, cbdata=0x0)
    at ../../../../orte/mca/odls/base/odls_base_default_fns.c:2007
2007                free(abortfile);
(gdb) print abortfile
$1 = 0x130e210 ""
(gdb) 
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to