Re: [OMPI devel] Errors on aborting programs on 1.8 r32515

2014-08-13 Thread Ralph Castain
Fixed - just a lingering free that should have been removed



On Wed, Aug 13, 2014 at 8:21 AM, Rolf vandeVaart 
wrote:

> I noticed MTT failures from last night and then reproduced this morning on
> 1.8 branch.  Looks like maybe a double free.  I assume it is related to
> fixes for aborting programs. Maybe related to
> https://svn.open-mpi.org/trac/ompi/changeset/32508 but not sure.
>
> [rvandevaart@drossetti-ivy0 environment]$ pwd
> /ivylogin/home/rvandevaart/tests/ompi-tests/trunk/ibm/environment
> [rvandevaart@drossetti-ivy0 environment]$ mpirun --mca odls_base_verbose
> 20 -np 2 abort
> [...stuff deleted...]
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to
> tag 30 on child [[58714,1],0]
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to
> tag 30 on child [[58714,1],1]
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to
> tag 30 on child [[58714,1],0]
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to
> tag 30 on child [[58714,1],1]
> **
> This program tests MPI_ABORT and generates error messages
> ERRORS ARE EXPECTED AND NORMAL IN THIS PROGRAM!!
> **
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 3.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:wait_local_proc
> child process [[58714,1],0] pid 14955 terminated
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired child
> [[58714,1],0] exit code 3
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired
> checking abort file 
> /tmp/openmpi-sessions-rvandevaart@drossetti-ivy0_0/58714/1/0/aborted
> for child [[58714,1],0]
> [drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired child
> [[58714,1],0] died by call to abort
> *** glibc detected *** mpirun: double free or corruption (fasttop):
> 0x0130e210 ***
>
> From gdb:
> gdb) where
> #0  0x7f75ede138e5 in raise () from /lib64/libc.so.6
> #1  0x7f75ede1504d in abort () from /lib64/libc.so.6
> #2  0x7f75ede517f7 in __libc_message () from /lib64/libc.so.6
> #3  0x7f75ede57126 in malloc_printerr () from /lib64/libc.so.6
> #4  0x7f75eef9eac4 in odls_base_default_wait_local_proc (pid=14955,
> status=768, cbdata=0x0)
> at ../../../../orte/mca/odls/base/odls_base_default_fns.c:2007
> #5  0x7f75eef60a78 in do_waitall (options=0) at
> ../../orte/runtime/orte_wait.c:554
> #6  0x7f75eef60712 in orte_wait_signal_callback (fd=17, event=8,
> arg=0x7f75ef201400) at ../../orte/runtime/orte_wait.c:421
> #7  0x7f75eecaecbe in event_signal_closure (base=0x1278370,
> ev=0x7f75ef201400)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1081
> #8  0x7f75eecaf7e0 in event_process_active_single_queue
> (base=0x1278370, activeq=0x12788f0)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1359
> #9  0x7f75eecafaca in event_process_active (base=0x1278370)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
> #10 0x7f75eecb0148 in opal_libevent2021_event_base_loop
> (base=0x1278370, flags=1)
> at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
> #11 0x00405572 in orterun (argc=7, argv=0x7fffbdf1dd08) at
> ../../../../orte/tools/orterun/orterun.c:1078
> #12 0x00403904 in main (argc=7, argv=0x7fffbdf1dd08) at
> ../../../../orte/tools/orterun/main.c:13
> (gdb) up
> #1  0x7f75ede1504d in abort () from /lib64/libc.so.6
> (gdb) up
> #2  0x7f75ede517f7 in __libc_message () from /lib64/libc.so.6
> (gdb) up
> #3  0x7f75ede57126 in malloc_printerr () from /lib64/libc.so.6
> (gdb) up
> #4  0x7f75eef9eac4 in odls_base_default_wait_local_proc (pid=14955,
> status=768, cbdata=0x0)
> at ../../../../orte/mca/odls/base/odls_base_default_fns.c:2007
> 2007free(abortfile);
> (gdb) print abortfile
> $1 = 0x130e210 ""
> (gdb)
>
> ---
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> ---
> ___
> devel mailing list
> de...@open-mpi.org
> Sub

[OMPI devel] Errors on aborting programs on 1.8 r32515

2014-08-13 Thread Rolf vandeVaart
I noticed MTT failures from last night and then reproduced this morning on 1.8 
branch.  Looks like maybe a double free.  I assume it is related to fixes for 
aborting programs. Maybe related to 
https://svn.open-mpi.org/trac/ompi/changeset/32508 but not sure.

[rvandevaart@drossetti-ivy0 environment]$ pwd
/ivylogin/home/rvandevaart/tests/ompi-tests/trunk/ibm/environment
[rvandevaart@drossetti-ivy0 environment]$ mpirun --mca odls_base_verbose 20 -np 
2 abort
[...stuff deleted...]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to tag 30 
on child [[58714,1],0]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to tag 30 
on child [[58714,1],1]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to tag 30 
on child [[58714,1],0]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls: sending message to tag 30 
on child [[58714,1],1]
**
This program tests MPI_ABORT and generates error messages
ERRORS ARE EXPECTED AND NORMAL IN THIS PROGRAM!!
**
--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 3.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:wait_local_proc child 
process [[58714,1],0] pid 14955 terminated
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired child 
[[58714,1],0] exit code 3
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired checking 
abort file /tmp/openmpi-sessions-rvandevaart@drossetti-ivy0_0/58714/1/0/aborted 
for child [[58714,1],0]
[drossetti-ivy0.nvidia.com:14953] [[58714,0],0] odls:waitpid_fired child 
[[58714,1],0] died by call to abort
*** glibc detected *** mpirun: double free or corruption (fasttop): 
0x0130e210 ***

>From gdb:
gdb) where
#0  0x7f75ede138e5 in raise () from /lib64/libc.so.6
#1  0x7f75ede1504d in abort () from /lib64/libc.so.6
#2  0x7f75ede517f7 in __libc_message () from /lib64/libc.so.6
#3  0x7f75ede57126 in malloc_printerr () from /lib64/libc.so.6
#4  0x7f75eef9eac4 in odls_base_default_wait_local_proc (pid=14955, 
status=768, cbdata=0x0)
at ../../../../orte/mca/odls/base/odls_base_default_fns.c:2007
#5  0x7f75eef60a78 in do_waitall (options=0) at 
../../orte/runtime/orte_wait.c:554
#6  0x7f75eef60712 in orte_wait_signal_callback (fd=17, event=8, 
arg=0x7f75ef201400) at ../../orte/runtime/orte_wait.c:421
#7  0x7f75eecaecbe in event_signal_closure (base=0x1278370, 
ev=0x7f75ef201400)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1081
#8  0x7f75eecaf7e0 in event_process_active_single_queue (base=0x1278370, 
activeq=0x12788f0)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1359
#9  0x7f75eecafaca in event_process_active (base=0x1278370)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
#10 0x7f75eecb0148 in opal_libevent2021_event_base_loop (base=0x1278370, 
flags=1)
at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
#11 0x00405572 in orterun (argc=7, argv=0x7fffbdf1dd08) at 
../../../../orte/tools/orterun/orterun.c:1078
#12 0x00403904 in main (argc=7, argv=0x7fffbdf1dd08) at 
../../../../orte/tools/orterun/main.c:13
(gdb) up
#1  0x7f75ede1504d in abort () from /lib64/libc.so.6
(gdb) up
#2  0x7f75ede517f7 in __libc_message () from /lib64/libc.so.6
(gdb) up
#3  0x7f75ede57126 in malloc_printerr () from /lib64/libc.so.6
(gdb) up
#4  0x7f75eef9eac4 in odls_base_default_wait_local_proc (pid=14955, 
status=768, cbdata=0x0)
at ../../../../orte/mca/odls/base/odls_base_default_fns.c:2007
2007free(abortfile);
(gdb) print abortfile
$1 = 0x130e210 ""
(gdb) 
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---