On Mon, Oct 24, 2005 at 06:03:02PM -0500, Troy Benjegerdes wrote: > troy@opteron1:/usr/src/netpipe3-dev$ mpirun -np 2 -mca btl_base_exclude > openib NPmpi > 1: opteron1 > 0: opteron1 > mpirun noticed that job rank 1 with PID 352 on node "localhost" exited > on signal 11. > 1 process killed (possibly by Open MPI) > > This is debian-amd64 (from > deb http://mirror.espri.arizona.edu/debian-amd64/debian/ etch main ) > > On Mon, Oct 24, 2005 at 10:36:29AM -0500, Brian Barrett wrote: > > That's a really weird backtrace - it seems to indicate that the > > datatype engine is improperly calling free(). Can you try running > > without openib (add "-mca btl_base_exclude openib" to the mpirun > > arguments) and see if the problem goes away? Also, what platform was > > this on?
Okay.. here's another backtrace, this time with no openib. 0x00002aaaab6fb365 in malloc_usable_size () from /lib/libc.so.6 (gdb) bt #0 0x00002aaaab6fb365 in malloc_usable_size () from /lib/libc.so.6 #1 0x00002aaaaaecb016 in opal_mem_free_free_hook () from /usr/local/lib/libopal.so.0 #2 0x00002aaaaac0c663 in ompi_convertor_cleanup () from /usr/local/lib/libmpi.so.0 #3 0x00002aaaaeb41dbe in mca_pml_ob1_match_completion_cache () from /usr/local/lib/openmpi/mca_pml_ob1.so #4 0x00002aaaaf179c7b in mca_btl_sm_component_progress () from /usr/local/lib/openmpi/mca_btl_sm.so #5 0x00002aaaaee5eefe in mca_bml_r2_progress () from /usr/local/lib/openmpi/mca_bml_r2.so #6 0x00002aaaaeb3dd4e in mca_pml_ob1_progress () from /usr/local/lib/openmpi/mca_pml_ob1.so #7 0x00002aaaaaeb5c4a in opal_progress () from /usr/local/lib/libopal.so.0 #8 0x00002aaaaeb3c265 in mca_pml_ob1_recv () from /usr/local/lib/openmpi/mca_pml_ob1.so #9 0x00002aaaaf6a0936 in mca_coll_basic_barrier_intra_lin () from /usr/local/lib/openmpi/mca_coll_basic.so #10 0x00002aaaaac1f3b8 in PMPI_Barrier () from /usr/local/lib/libmpi.so.0 #11 0x00000000004030a2 in Sync (p=0x10053d900) at src/mpi.c:89 #12 0x0000000000401f83 in main (argc=2, argv=0x7fffffe30ae8) at src/netpipe.c:463