Hi Jeff, Thanks, the option --mca btl ^openib works fine !
Half of the cluster has Infiniband/OpenFabrics (from node49 to node96) and the other half (nodes from 01 to 48) doesn't. I just wanted to make openmpi run over ethernet/tcp first. I will try to make it run using OpenFabrics but I guess I need to recompile another package to do it so ? If I mix some nodes with OpenFabrics and some other which don't have OpenFabrics, I should use the option "--mca btl ^openib" right ? And if I use exclusively similar nodes (either non OpenFabrics and only OpenFabrics), I don't have to use the option anymore. But over OpenFabrics, does openmpi will use automatically the Infiniband hardware ??? Thanks a lot. SB users-requ...@open-mpi.org wrote: > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 5 Mar 2009 17:25:34 -0500 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] Runtime error only on one node. > To: "Open MPI Users" <us...@open-mpi.org> > Message-ID: <70d31c29-b711-419f-9973-73b41feb0...@cisco.com> > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes > > Whoops; we shouldn't be seg faulting. :-\ > > The warning is exactly what it implies -- it found the OpenFabrics > network stack by no functioning OpenFabrics-capable hardware. You can > disable it (and the segv) by disabling the openfabrics BTL from running: > > mpirun --mca btl ^openib > > But what I don't see is why we're segv'ing when calling > ibv_destroy_srq(). This is a function in the shutdown sequence of the > openib BTL, but that shouldn't be getting called with the error > message that you're seeing. Are you getting corefiles, perchance? > Could you get a stack trace with the file and line numbers in OMPI > where this is happening, perchance? > > Do you have OpenFabrics hardware on your cluster? According to your > error message, node18 is the one that doesn't find an OF-capable > hardware, but node66 is the one that segv's, which is darn weird... > > > On Mar 5, 2009, at 12:13 AM, Shinta Bonnefoy wrote: > > >> Hi, >> >> I am the admin of a small cluster (server running under SLES 10.1 and >> nodes on OSS 10.3). >> and I have just installed openmpi 1.3 on it. >> >> I'm trying to get a simple program (like hello world) running but it >> fails all the time on on of the node but never on the others. >> >> I don't think it's related to the program since it's the simplest on >> you >> can write. >> >> All the nodes are sharing the openmpi install directory (trhough) nfs >> and have all the same profile. >> >> Here is the runtime code error I've got : >> mpirun -machinefile no -np 6 ~/hello.x >> -------------------------------------------------------------------------- >> [[6735,1],0]: A high-performance Open MPI point-to-point messaging >> module >> was unable to find any relevant network interfaces: >> >> Module: OpenFabrics (openib) >> Host: node18 >> >> Another transport will be used instead, although this may result in >> lower performance. >> -------------------------------------------------------------------------- >> Hello world from process 3 of 6 >> Hello world from process 1 of 6 >> Hello world from process 4 of 6 >> Hello world from process 2 of 6 >> Hello world from process 5 of 6 >> Hello world from process 0 of 6 >> [node66:03997] *** Process received signal *** >> [node66:03997] Signal: Segmentation fault (11) >> [node66:03997] Signal code: Address not mapped (1) >> [node66:03997] Failing at address: (nil) >> [node66:03997] [ 0] /lib64/libpthread.so.0 [0x2b5e227a4fb0] >> [node66:03997] [ 1] /usr/lib64/libibverbs.so.1(ibv_destroy_srq+0) >> [0x2b5e24ee0fa0] >> [node66:03997] [ 2] >> /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_btl_openib.so >> [0x2b5e250eb2dd] >> [node66:03997] [ 3] >> /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(mca_btl_base_close >> +0x87) >> [0x2b5e21aa2a67] >> [node66:03997] [ 4] >> /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_bml_r2.so >> [0x2b5e24cc39d2] >> [node66:03997] [ 5] >> /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_pml_ob1.so >> [0x2b5e24aa2d0e] >> [node66:03997] [ 6] >> /opt/cluster/software/openmpi/1.3/lib/libmpi.so. >> 0(mca_pml_base_finalize+0x1b) >> [0x2b5e21aacd2f] >> [node66:03997] [ 7] /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0 >> [0x2b5e21a66a7b] >> [node66:03997] [ 8] >> /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(MPI_Finalize+0x17) >> [0x2b5e21a84207] >> [node66:03997] [ 9] /home/donald/hello.x(main+0x6d) [0x401bd5] >> [node66:03997] [10] /lib64/libc.so.6(__libc_start_main+0xf4) >> [0x2b5e229cfb54] >> [node66:03997] [11] /home/donald/hello.x [0x401ad9] >> [node66:03997] *** End of error message *** >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 5 with PID 3997 on node node66 exited >> on signal 11 (Segmentation fault). >> -------------------------------------------------------------------------- >> [node72:07895] 4 more processes have sent help message >> help-mpi-btl-base.txt / btl:no-nics >> [node72:07895] Set MCA parameter "orte_base_help_aggregate" to 0 to >> see >> all help / error messages >> >> >> >> >> Please advise, >> Thanks and regards, >> SB >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > >