troy@opteron1:/usr/src/netpipe3-dev$ mpirun -np 2 -mca btl_base_exclude openib NPmpi 1: opteron1 0: opteron1 mpirun noticed that job rank 1 with PID 352 on node "localhost" exited on signal 11. 1 process killed (possibly by Open MPI)
This is debian-amd64 (from deb http://mirror.espri.arizona.edu/debian-amd64/debian/ etch main ) On Mon, Oct 24, 2005 at 10:36:29AM -0500, Brian Barrett wrote: > That's a really weird backtrace - it seems to indicate that the > datatype engine is improperly calling free(). Can you try running > without openib (add "-mca btl_base_exclude openib" to the mpirun > arguments) and see if the problem goes away? Also, what platform was > this on? > > Thanks, > > Brian > > On Oct 21, 2005, at 6:37 PM, Troy Benjegerdes wrote: > > > On Fri, Oct 21, 2005 at 06:26:32PM -0500, Troy Benjegerdes wrote: > > > >> On Fri, Oct 21, 2005 at 04:12:05PM -0500, Andrew Friedley wrote: > >> > >>> I just committed a fix to the trunk to fix your original segfault > >>> down > >>> in opal_show_help() - this is the same problem Ken posted. This fix > >>> should make it into the v1.0 branch eventually. Even so, you are > >>> going > >>> to run into the real problem you were handling - this fix is just > >>> for > >>> proper error handling/output. > >>> > >>> The error below looks like a word size mismatch - one thing is > >>> compiled > >>> 64bit, the other is compiled 32bit. Make sure everything is > >>> compiled > >>> either 32bit or 64bit. > >>> > >> > >> Another note.. I think I may have had some problems because I > >> built with > >> 'make -j16'.. has anyone else tried parallel make builds? > >> > >> I have a working mpirun now. > >> > >> Now I'm back to having NetPIPE segfault when I run it.. > >> > > > > And here's a backtrace: > > > > 0x00002aaaab6fb365 in malloc_usable_size () from /lib/libc.so.6 > > #0 0x00002aaaab6fb365 in malloc_usable_size () from /lib/libc.so.6 > > #1 0x00002aaaaaecb016 in opal_mem_free_free_hook () > > from /usr/local/lib/libopal.so.0 > > #2 0x00002aaaaac0c663 in ompi_convertor_cleanup () > > from /usr/local/lib/libmpi.so.0 > > #3 0x00002aaaaeb41dbe in mca_pml_ob1_match_completion_cache () > > from /usr/local/lib/openmpi/mca_pml_ob1.so > > #4 0x00002aaaaf179c7b in mca_btl_sm_component_progress () > > from /usr/local/lib/openmpi/mca_btl_sm.so > > #5 0x00002aaaaee5eefe in mca_bml_r2_progress () > > from /usr/local/lib/openmpi/mca_bml_r2.so > > #6 0x00002aaaaeb3dd4e in mca_pml_ob1_progress () > > from /usr/local/lib/openmpi/mca_pml_ob1.so > > #7 0x00002aaaaaeb5c4a in opal_progress () from > > /usr/local/lib/libopal.so.0 > > #8 0x00002aaaaeb3c265 in mca_pml_ob1_recv () > > from /usr/local/lib/openmpi/mca_pml_ob1.so > > #9 0x00002aaaaf6a0936 in mca_coll_basic_barrier_intra_lin () > > from /usr/local/lib/openmpi/mca_coll_basic.so > > #10 0x00002aaaaac1f3b8 in PMPI_Barrier () from > > /usr/local/lib/libmpi.so.0 > > ---Type <return> to continue, or q <return> to quit---#11 > > 0x0000000000403016 inSync () > > #12 0x0000000000401ef8 in main () > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- -------------------------------------------------------------------------- Troy Benjegerdes 'da hozer' ho...@hozed.org Somone asked me why I work on this free (http://www.fsf.org/philosophy/) software stuff and not get a real job. Charles Shultz had the best answer: "Why do musicians compose symphonies and poets write poems? They do it because life wouldn't have any meaning for them if they didn't. That's why I draw cartoons. It's my life." -- Charles Shultz