Hi, I am the admin of a small cluster (server running under SLES 10.1 and nodes on OSS 10.3). and I have just installed openmpi 1.3 on it.
I'm trying to get a simple program (like hello world) running but it fails all the time on on of the node but never on the others. I don't think it's related to the program since it's the simplest on you can write. All the nodes are sharing the openmpi install directory (trhough) nfs and have all the same profile. Here is the runtime code error I've got : mpirun -machinefile no -np 6 ~/hello.x -------------------------------------------------------------------------- [[6735,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: node18 Another transport will be used instead, although this may result in lower performance. -------------------------------------------------------------------------- Hello world from process 3 of 6 Hello world from process 1 of 6 Hello world from process 4 of 6 Hello world from process 2 of 6 Hello world from process 5 of 6 Hello world from process 0 of 6 [node66:03997] *** Process received signal *** [node66:03997] Signal: Segmentation fault (11) [node66:03997] Signal code: Address not mapped (1) [node66:03997] Failing at address: (nil) [node66:03997] [ 0] /lib64/libpthread.so.0 [0x2b5e227a4fb0] [node66:03997] [ 1] /usr/lib64/libibverbs.so.1(ibv_destroy_srq+0) [0x2b5e24ee0fa0] [node66:03997] [ 2] /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_btl_openib.so [0x2b5e250eb2dd] [node66:03997] [ 3] /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(mca_btl_base_close+0x87) [0x2b5e21aa2a67] [node66:03997] [ 4] /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_bml_r2.so [0x2b5e24cc39d2] [node66:03997] [ 5] /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_pml_ob1.so [0x2b5e24aa2d0e] [node66:03997] [ 6] /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(mca_pml_base_finalize+0x1b) [0x2b5e21aacd2f] [node66:03997] [ 7] /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0 [0x2b5e21a66a7b] [node66:03997] [ 8] /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(MPI_Finalize+0x17) [0x2b5e21a84207] [node66:03997] [ 9] /home/donald/hello.x(main+0x6d) [0x401bd5] [node66:03997] [10] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b5e229cfb54] [node66:03997] [11] /home/donald/hello.x [0x401ad9] [node66:03997] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 5 with PID 3997 on node node66 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- [node72:07895] 4 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [node72:07895] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages Please advise, Thanks and regards, SB