Hi Jeff, I've reproduced your test here, with the same results. Moreover, if I put the nodes with rank>0 into a blocking MPI call (MPI_Bcast or MPI_Barrier) I still get the same behavior; namely, rank 0's calling abort() generates a core file and leads to termination, which is the behavior I want. I'll look at my code a bit more, but the only difference I see now is that in my code a floating point exception triggers a signal-handler that calls abort(). I don't see why that should be different from your test.
Thanks for your help. David On Mon, 2010-08-16 at 09:54 -0700, Jeff Squyres wrote: > FWIW, I'm unable to replicate your behavior. This is with Open MPI 1.4.2 on > RHEL5: > > ---- > [9:52] svbu-mpi:~/mpi % cat abort.c > #include <stdio.h> > #include <stdlib.h> > #include <mpi.h> > > int main(int argc, char **argv) > { > int rank; > > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > if (0 == rank) { > abort(); > } > printf("Rank %d sleeping...\n", rank); > sleep(600); > printf("Rank %d finalizing...\n", rank); > MPI_Finalize(); > return 0; > } > [9:52] svbu-mpi:~/mpi % mpicc abort.c -o abort > [9:52] svbu-mpi:~/mpi % ls -l core* > ls: No match. > [9:52] svbu-mpi:~/mpi % mpirun -np 4 --bynode --host svbu-mpi055,svbu-mpi056 > ./abort > Rank 1 sleeping... > [svbu-mpi055:03991] *** Process received signal *** > [svbu-mpi055:03991] Signal: Aborted (6) > [svbu-mpi055:03991] Signal code: (-6) > [svbu-mpi055:03991] [ 0] /lib64/libpthread.so.0 [0x2b45caac87c0] > [svbu-mpi055:03991] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x2b45cad05265] > [svbu-mpi055:03991] [ 2] /lib64/libc.so.6(abort+0x110) [0x2b45cad06d10] > [svbu-mpi055:03991] [ 3] ./abort(main+0x36) [0x4008ee] > [svbu-mpi055:03991] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x2b45cacf2994] > [svbu-mpi055:03991] [ 5] ./abort [0x400809] > [svbu-mpi055:03991] *** End of error message *** > Rank 3 sleeping... > Rank 2 sleeping... > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 3991 on node svbu-mpi055 exited > on signal 6 (Aborted). > -------------------------------------------------------------------------- > [9:52] svbu-mpi:~/mpi % ls -l core* > -rw------- 1 jsquyres eng5 26009600 Aug 16 09:52 core.abort-1281977540-3991 > [9:52] svbu-mpi:~/mpi % file core.abort-1281977540-3991 > core.abort-1281977540-3991: ELF 64-bit LSB core file AMD x86-64, version 1 > (SYSV), SVR4-style, from 'abort' > [9:52] svbu-mpi:~/mpi % > ----- > > You can see that all processes die immediately, and I get a corefile from the > process that called abort(). > > > On Aug 16, 2010, at 9:25 AM, David Ronis wrote: > > > I've tried both--as you said, MPI_Abort doesn't drop a core file, but > > does kill off the entire MPI job. abort() drops core when I'm running > > on 1 processor, but not in a multiprocessor run. In addition, a node > > calling abort() doesn't lead to the entire run being killed off. > > > > David > > O > > n Mon, 2010-08-16 at 08:51 -0700, Jeff Squyres wrote: > >> On Aug 13, 2010, at 12:53 PM, David Ronis wrote: > >> > >>> I'm using mpirun and the nodes are all on the same machin (a 8 cpu box > >>> with an intel i7). coresize is unlimited: > >>> > >>> ulimit -a > >>> core file size (blocks, -c) unlimited > >> > >> That looks good. > >> > >> In reviewing the email thread, it's not entirely clear: are you calling > >> abort() or MPI_Abort()? MPI_Abort() won't drop a core file. abort() > >> should. > >> > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > >