I'm assuming that this is a production version of NP, right? (i.e., not a development version)

Can you run the MPI processes through valgrind to see where the error really occurs? This corefile only shows the final results, not the actual cause.


Troy Benjegerdes wrote:
On Mon, Oct 24, 2005 at 06:03:02PM -0500, Troy Benjegerdes wrote:

troy@opteron1:/usr/src/netpipe3-dev$ mpirun -np 2 -mca btl_base_exclude
openib NPmpi
1: opteron1
0: opteron1
mpirun noticed that job rank 1 with PID 352 on node "localhost" exited
on signal 11.
1 process killed (possibly by Open MPI)

This is debian-amd64 (from deb http://mirror.espri.arizona.edu/debian-amd64/debian/ etch main )

On Mon, Oct 24, 2005 at 10:36:29AM -0500, Brian Barrett wrote:

That's a really weird backtrace - it seems to indicate that the datatype engine is improperly calling free(). Can you try running without openib (add "-mca btl_base_exclude openib" to the mpirun arguments) and see if the problem goes away? Also, what platform was this on?


Okay.. here's another backtrace, this time with no openib.

0x00002aaaab6fb365 in malloc_usable_size () from /lib/libc.so.6
(gdb) bt
#0  0x00002aaaab6fb365 in malloc_usable_size () from /lib/libc.so.6
#1  0x00002aaaaaecb016 in opal_mem_free_free_hook ()
   from /usr/local/lib/libopal.so.0
#2  0x00002aaaaac0c663 in ompi_convertor_cleanup ()
   from /usr/local/lib/libmpi.so.0
#3  0x00002aaaaeb41dbe in mca_pml_ob1_match_completion_cache ()
   from /usr/local/lib/openmpi/mca_pml_ob1.so
#4  0x00002aaaaf179c7b in mca_btl_sm_component_progress ()
   from /usr/local/lib/openmpi/mca_btl_sm.so
#5  0x00002aaaaee5eefe in mca_bml_r2_progress ()
   from /usr/local/lib/openmpi/mca_bml_r2.so
#6  0x00002aaaaeb3dd4e in mca_pml_ob1_progress ()
   from /usr/local/lib/openmpi/mca_pml_ob1.so
#7  0x00002aaaaaeb5c4a in opal_progress () from
/usr/local/lib/libopal.so.0
#8  0x00002aaaaeb3c265 in mca_pml_ob1_recv ()
   from /usr/local/lib/openmpi/mca_pml_ob1.so
#9  0x00002aaaaf6a0936 in mca_coll_basic_barrier_intra_lin ()
   from /usr/local/lib/openmpi/mca_coll_basic.so
#10 0x00002aaaaac1f3b8 in PMPI_Barrier () from
/usr/local/lib/libmpi.so.0
#11 0x00000000004030a2 in Sync (p=0x10053d900) at src/mpi.c:89
#12 0x0000000000401f83 in main (argc=2, argv=0x7fffffe30ae8)
    at src/netpipe.c:463

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

Reply via email to