Hi Jeff,

I've reproduced your test here, with the same results.  Moreover, if I
put the nodes with rank>0 into a blocking MPI call (MPI_Bcast or
MPI_Barrier) I still get the same behavior; namely, rank 0's calling
abort() generates a core file and leads to termination, which is the
behavior I want.  I'll look at my code a bit more, but the only
difference I see now is that in my code a floating point exception
triggers a signal-handler that calls abort().   I don't see why that
should be different from your test.

Thanks for your help.

David

On Mon, 2010-08-16 at 09:54 -0700, Jeff Squyres wrote:
> FWIW, I'm unable to replicate your behavior.  This is with Open MPI 1.4.2 on 
> RHEL5:
> 
> ----
> [9:52] svbu-mpi:~/mpi % cat abort.c
> #include <stdio.h>
> #include <stdlib.h>
> #include <mpi.h>
> 
> int main(int argc, char **argv)
> {
>     int rank;
> 
>     MPI_Init(&argc, &argv);
>     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>     if (0 == rank) {
>         abort();
>     }
>     printf("Rank %d sleeping...\n", rank);
>     sleep(600);
>     printf("Rank %d finalizing...\n", rank);
>     MPI_Finalize();
>     return 0;
> }
> [9:52] svbu-mpi:~/mpi % mpicc abort.c -o abort
> [9:52] svbu-mpi:~/mpi % ls -l core*
> ls: No match.
> [9:52] svbu-mpi:~/mpi % mpirun -np 4 --bynode --host svbu-mpi055,svbu-mpi056 
> ./abort
> Rank 1 sleeping...
> [svbu-mpi055:03991] *** Process received signal ***
> [svbu-mpi055:03991] Signal: Aborted (6)
> [svbu-mpi055:03991] Signal code:  (-6)
> [svbu-mpi055:03991] [ 0] /lib64/libpthread.so.0 [0x2b45caac87c0]
> [svbu-mpi055:03991] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x2b45cad05265]
> [svbu-mpi055:03991] [ 2] /lib64/libc.so.6(abort+0x110) [0x2b45cad06d10]
> [svbu-mpi055:03991] [ 3] ./abort(main+0x36) [0x4008ee]
> [svbu-mpi055:03991] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4) 
> [0x2b45cacf2994]
> [svbu-mpi055:03991] [ 5] ./abort [0x400809]
> [svbu-mpi055:03991] *** End of error message ***
> Rank 3 sleeping...
> Rank 2 sleeping...
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 3991 on node svbu-mpi055 exited 
> on signal 6 (Aborted).
> --------------------------------------------------------------------------
> [9:52] svbu-mpi:~/mpi % ls -l core*
> -rw------- 1 jsquyres eng5 26009600 Aug 16 09:52 core.abort-1281977540-3991
> [9:52] svbu-mpi:~/mpi % file core.abort-1281977540-3991 
> core.abort-1281977540-3991: ELF 64-bit LSB core file AMD x86-64, version 1 
> (SYSV), SVR4-style, from 'abort'
> [9:52] svbu-mpi:~/mpi % 
> -----
> 
> You can see that all processes die immediately, and I get a corefile from the 
> process that called abort().
> 
> 
> On Aug 16, 2010, at 9:25 AM, David Ronis wrote:
> 
> > I've tried both--as you said, MPI_Abort doesn't drop a core file, but
> > does kill off the entire MPI job.   abort() drops core when I'm running
> > on 1 processor, but not in a multiprocessor run.  In addition, a node
> > calling abort() doesn't lead to the entire run being killed off.
> > 
> > David
> > O
> > n Mon, 2010-08-16 at 08:51 -0700, Jeff Squyres wrote:
> >> On Aug 13, 2010, at 12:53 PM, David Ronis wrote:
> >> 
> >>> I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
> >>> with an intel i7).  coresize is unlimited:
> >>> 
> >>> ulimit -a
> >>> core file size          (blocks, -c) unlimited
> >> 
> >> That looks good.
> >> 
> >> In reviewing the email thread, it's not entirely clear: are you calling 
> >> abort() or MPI_Abort()?  MPI_Abort() won't drop a core file.  abort() 
> >> should.
> >> 
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 

Reply via email to