Hi,

I am the admin of a small cluster (server running under SLES 10.1 and
nodes on OSS 10.3).
and I have just installed openmpi 1.3 on it.

I'm trying to get a simple program (like hello world) running but it
fails all the time on on of the node but never on the others.

I don't think it's related to the program since it's the simplest on you
can write.

All the nodes are sharing the openmpi install directory (trhough) nfs
and have all the same profile.

Here is the runtime code error I've got :
mpirun -machinefile no  -np 6 ~/hello.x
--------------------------------------------------------------------------
[[6735,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: node18

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
Hello world from process 3 of 6
Hello world from process 1 of 6
Hello world from process 4 of 6
Hello world from process 2 of 6
Hello world from process 5 of 6
Hello world from process 0 of 6
[node66:03997] *** Process received signal ***
[node66:03997] Signal: Segmentation fault (11)
[node66:03997] Signal code: Address not mapped (1)
[node66:03997] Failing at address: (nil)
[node66:03997] [ 0] /lib64/libpthread.so.0 [0x2b5e227a4fb0]
[node66:03997] [ 1] /usr/lib64/libibverbs.so.1(ibv_destroy_srq+0)
[0x2b5e24ee0fa0]
[node66:03997] [ 2]
/opt/cluster/software/openmpi/1.3/lib/openmpi/mca_btl_openib.so
[0x2b5e250eb2dd]
[node66:03997] [ 3]
/opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(mca_btl_base_close+0x87)
[0x2b5e21aa2a67]
[node66:03997] [ 4]
/opt/cluster/software/openmpi/1.3/lib/openmpi/mca_bml_r2.so [0x2b5e24cc39d2]
[node66:03997] [ 5]
/opt/cluster/software/openmpi/1.3/lib/openmpi/mca_pml_ob1.so
[0x2b5e24aa2d0e]
[node66:03997] [ 6]
/opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(mca_pml_base_finalize+0x1b)
[0x2b5e21aacd2f]
[node66:03997] [ 7] /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0
[0x2b5e21a66a7b]
[node66:03997] [ 8]
/opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(MPI_Finalize+0x17)
[0x2b5e21a84207]
[node66:03997] [ 9] /home/donald/hello.x(main+0x6d) [0x401bd5]
[node66:03997] [10] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x2b5e229cfb54]
[node66:03997] [11] /home/donald/hello.x [0x401ad9]
[node66:03997] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 3997 on node node66 exited
on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[node72:07895] 4 more processes have sent help message
help-mpi-btl-base.txt / btl:no-nics
[node72:07895] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages




Please advise,
Thanks and regards,
SB


Reply via email to