Re: [O-MPI devel] ppc64 linux bustage?
Hi Troy, * Troy Benjegerdes wrote on Tue, Oct 25, 2005 at 03:30:46AM CEST: > On Mon, Oct 24, 2005 at 07:30:54PM -0500, Brian Barrett wrote: > > On Oct 24, 2005, at 6:56 PM, Troy Benjegerdes wrote: > > > > >> | configure: /bin/sh './configure' *failed* for opal/libltdl > > >> > > >> Troy, could you show opal/libltdl/config.log, or, if that does not > > >> reveal anything suspicious, the corresponding part of toplevel > > >> config.log (above message should be recorded there)? Thanks. > > >> > > > > > > AR.. libltdl3-dev seemed to not be installed. > > > > > > Any way to make a check for this more explicit in autogen.sh? > > > > We don't use the system-installed libltdl, and always build our own. > > It looks like you should only need the libtool package, which we > > should check for already in autogen.sh. Was there any useful error > > message along the way? > > I recall some error about libtool/libltdl, but it looked like it > succeeded. If you don't have the exact output any more: could you please rerun autogen.sh and post it? > Do you have a debian system you can remove the libltdl3 and libltdl3-dev > packages? Yes, I could try that tonight (my timezone), but.. > It seems like there's some strange depenency on this. I don't either think this is the cause of the error. What is the libtoolize version autogen.sh picks up? Cheers, Ralf
Re: [O-MPI devel] Segfaults on startup? (ORTE_ERROR_LOG)
Hi Jeff, Troy, * Jeff Squyres wrote on Mon, Oct 24, 2005 at 06:43:38PM CEST: > Troy Benjegerdes wrote: > > > Another note.. I think I may have had some problems because I built with > > 'make -j16'.. has anyone else tried parallel make builds? > > I am jumping into this thread late -- but FWIW: Me too. :) > 1. Yes, we build VPATH (with both relative and absolute flavors) every > night. So the build works fine. If gdb can't find stuff, that's a > different issue -- I don't know if the linker stores VPATH information > properly for debuggers to find files properly or not (this is part of > the Automake magic that we rely on). Finding sources is usually not an autotools or make issue, but as far as I know, an issue of how sophisticated the compiler is with creating debugging information (and whether the binary format allows to store them). In doubt, gdb's `directory' command may be used to specify additional directories to have it search for source files in. > 2. Yes, we do parallel builds all the time (and every night). Now that Troy mentions it: I vaguely remember bug reports about parallel builds failing with some versions of GNU make and autotools, but have never been able to reproduce any of them (except for some that turned out to have different cause). If this error persists, and you are willing to help, I would like to know: exact versions of make, all autotools, exact source tree (svn version of OpenMPI), build system details (OS, kernel version, number of processors) and a complete log of configure and make output. Thanks! Cheers, Ralf
Re: [O-MPI devel] MPI_Barrier in Netpipe causes segfault
I'm assuming that this is a production version of NP, right? (i.e., not a development version) Can you run the MPI processes through valgrind to see where the error really occurs? This corefile only shows the final results, not the actual cause. Troy Benjegerdes wrote: On Mon, Oct 24, 2005 at 06:03:02PM -0500, Troy Benjegerdes wrote: troy@opteron1:/usr/src/netpipe3-dev$ mpirun -np 2 -mca btl_base_exclude openib NPmpi 1: opteron1 0: opteron1 mpirun noticed that job rank 1 with PID 352 on node "localhost" exited on signal 11. 1 process killed (possibly by Open MPI) This is debian-amd64 (from deb http://mirror.espri.arizona.edu/debian-amd64/debian/ etch main ) On Mon, Oct 24, 2005 at 10:36:29AM -0500, Brian Barrett wrote: That's a really weird backtrace - it seems to indicate that the datatype engine is improperly calling free(). Can you try running without openib (add "-mca btl_base_exclude openib" to the mpirun arguments) and see if the problem goes away? Also, what platform was this on? Okay.. here's another backtrace, this time with no openib. 0x2b6fb365 in malloc_usable_size () from /lib/libc.so.6 (gdb) bt #0 0x2b6fb365 in malloc_usable_size () from /lib/libc.so.6 #1 0x2aecb016 in opal_mem_free_free_hook () from /usr/local/lib/libopal.so.0 #2 0x2ac0c663 in ompi_convertor_cleanup () from /usr/local/lib/libmpi.so.0 #3 0x2eb41dbe in mca_pml_ob1_match_completion_cache () from /usr/local/lib/openmpi/mca_pml_ob1.so #4 0x2f179c7b in mca_btl_sm_component_progress () from /usr/local/lib/openmpi/mca_btl_sm.so #5 0x2ee5eefe in mca_bml_r2_progress () from /usr/local/lib/openmpi/mca_bml_r2.so #6 0x2eb3dd4e in mca_pml_ob1_progress () from /usr/local/lib/openmpi/mca_pml_ob1.so #7 0x2aeb5c4a in opal_progress () from /usr/local/lib/libopal.so.0 #8 0x2eb3c265 in mca_pml_ob1_recv () from /usr/local/lib/openmpi/mca_pml_ob1.so #9 0x2f6a0936 in mca_coll_basic_barrier_intra_lin () from /usr/local/lib/openmpi/mca_coll_basic.so #10 0x2ac1f3b8 in PMPI_Barrier () from /usr/local/lib/libmpi.so.0 #11 0x004030a2 in Sync (p=0x10053d900) at src/mpi.c:89 #12 0x00401f83 in main (argc=2, argv=0x7fe30ae8) at src/netpipe.c:463 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI devel] MPI_Barrier in Netpipe causes segfault
Troy -- We've managed to replicate this problem and are looking into it. Thanks for reporting it! Troy Benjegerdes wrote: On Mon, Oct 24, 2005 at 06:03:02PM -0500, Troy Benjegerdes wrote: troy@opteron1:/usr/src/netpipe3-dev$ mpirun -np 2 -mca btl_base_exclude openib NPmpi 1: opteron1 0: opteron1 mpirun noticed that job rank 1 with PID 352 on node "localhost" exited on signal 11. 1 process killed (possibly by Open MPI) This is debian-amd64 (from deb http://mirror.espri.arizona.edu/debian-amd64/debian/ etch main ) On Mon, Oct 24, 2005 at 10:36:29AM -0500, Brian Barrett wrote: That's a really weird backtrace - it seems to indicate that the datatype engine is improperly calling free(). Can you try running without openib (add "-mca btl_base_exclude openib" to the mpirun arguments) and see if the problem goes away? Also, what platform was this on? Okay.. here's another backtrace, this time with no openib. 0x2b6fb365 in malloc_usable_size () from /lib/libc.so.6 (gdb) bt #0 0x2b6fb365 in malloc_usable_size () from /lib/libc.so.6 #1 0x2aecb016 in opal_mem_free_free_hook () from /usr/local/lib/libopal.so.0 #2 0x2ac0c663 in ompi_convertor_cleanup () from /usr/local/lib/libmpi.so.0 #3 0x2eb41dbe in mca_pml_ob1_match_completion_cache () from /usr/local/lib/openmpi/mca_pml_ob1.so #4 0x2f179c7b in mca_btl_sm_component_progress () from /usr/local/lib/openmpi/mca_btl_sm.so #5 0x2ee5eefe in mca_bml_r2_progress () from /usr/local/lib/openmpi/mca_bml_r2.so #6 0x2eb3dd4e in mca_pml_ob1_progress () from /usr/local/lib/openmpi/mca_pml_ob1.so #7 0x2aeb5c4a in opal_progress () from /usr/local/lib/libopal.so.0 #8 0x2eb3c265 in mca_pml_ob1_recv () from /usr/local/lib/openmpi/mca_pml_ob1.so #9 0x2f6a0936 in mca_coll_basic_barrier_intra_lin () from /usr/local/lib/openmpi/mca_coll_basic.so #10 0x2ac1f3b8 in PMPI_Barrier () from /usr/local/lib/libmpi.so.0 #11 0x004030a2 in Sync (p=0x10053d900) at src/mpi.c:89 #12 0x00401f83 in main (argc=2, argv=0x7fe30ae8) at src/netpipe.c:463 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/