Tim Mattox wrote:
For your runs with Open MPI over InfiniBand, try using openib,sm,self
for the BTL setting, so that shared memory communications are used
within a node. It would give us another datapoint to help diagnose
the problem. As for other things we would need to help diagnose the
problem, please follow the advice on this FAQ entry, and the help page:
http://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot
http://www.open-mpi.org/community/help/
Dear Tim,
thank you for this pointer.
1) Ofed: It's 1.2.5, from the OpenFabrics website
2) Linux version: scientific linux (RH enterprise remaster) v. 4.2,
kernel 2.6.9-55.0.12.ELsmp
3) Subnet manager: OpenSM
4)ibv_devinfo
hca_id: mthca0
fw_ver: 1.0.800
node_guid: 0002:c902:0022:b398
sys_image_guid: 0002:c902:0022:b39b
vendor_id: 0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id: MT_03B0120002
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 9
port_lid: 97
port_lmc: 0x00
(no node is different from the others, as far as the problem is concerned)
5) ifconfig:
eth0 Link encap:Ethernet HWaddr 00:17:31:E3:89:4A
inet addr:10.0.0.12 Bcast:10.0.0.255 Mask:255.255.255.0
inet6 addr: fe80::217:31ff:fee3:894a/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:23348585 errors:0 dropped:0 overruns:0 frame:0
TX packets:17247486 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:19410724189 (18.0 GiB) TX bytes:14981325997 (13.9 GiB)
Interrupt:209
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:5088 errors:0 dropped:0 overruns:0 frame:0
TX packets:5088 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:2468843 (2.3 MiB) TX bytes:2468843 (2.3 MiB)
6) ulimit -l
8388608
(this is more than the physical memory on the node)
7) output of ompi_info attached (I have tried also earlier releases)
8) description of the problem: a program seems to communicate correctly
over the TCP network, but not over the ifiniband network. The program is
structured in such a way that if the communication does not happen, a
loop become infinite. So there is no error message, just a program
entering an infinite loop.
The command line used are:
The command line I use is
mpirun -mca btl openib,sm,self <executable>
(with openib replaced by tcp in the case of communication over ethernet).
I could include the path and the value of the variable LD_LIBRARY_PATH,
but it won't tell too much, since the installation directory is
non-standard (/opt/ompi128-intel/bin for the path and
/opt/ompi128-intel/lib for the libs).
I hope to have provided all the required info, if you need more or some
of them in more detail, please let me know.
Many thanks,
Biagio Lucini
Open MPI: 1.2.8
Open MPI SVN revision: r19718
Open RTE: 1.2.8
Open RTE SVN revision: r19718
OPAL: 1.2.8
OPAL SVN revision: r19718
Prefix: /opt/ompi128-intel
Configured architecture: x86_64-unknown-linux-gnu
Configured by: root
Configured on: Tue Dec 23 12:33:51 GMT 2008
Configure host: master.cluster
Built by: root
Built on: Tue Dec 23 12:38:34 GMT 2008
Built host: master.cluster
C bindings: yes
C++ bindings: yes
Fortran77 bindings: yes (all)
Fortran90 bindings: yes
Fortran90 bindings size: small
C compiler: icc
C compiler absolute: /opt/intel/cce/9.1.045/bin/icc
C++ compiler: icpc
C++ compiler absolute: /opt/intel/cce/9.1.045/bin/icpc
Fortran77 compiler: ifort
Fortran77 compiler abs: /opt/intel/fce/9.1.040/bin/ifort
Fortran90 compiler: ifort
Fortran90 compiler abs: /opt/intel/fce/9.1.040/bin/ifort
C profiling: yes
C++ profiling: yes
Fortran77 profiling: yes
Fortran90 profiling: yes
C++ exceptions: no
Thread support: posix (mpi: no, progress: no)
Internal debug support: no
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
libltdl support: yes
Heterogeneous support: yes
mpirun default --prefix: no
MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.8)
MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.8)
MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.8)
MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.8)
MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.2.8)
MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.8)
MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.8)
MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.8)
MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.8)
MCA coll: self (MCA v1.0, API v1.0, Component v1.2.8)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.8)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.8)
MCA io: romio (MCA v1.0, API v1.0, Component v1.2.8)
MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.8)
MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.8)
MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.8)
MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.8)
MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.8)
MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.8)
MCA btl: openib (MCA v1.0, API v1.0.1, Component v1.2.8)
MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.8)
MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.8)
MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.8)
MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.8)
MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.8)
MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.8)
MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.8)
MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.8)
MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.8)
MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.8)
MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.8)
MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.8)
MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.8)
MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.8)
MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.8)
MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.8)
MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.8)
MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.8)
MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.8)
MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.8)
MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.8)
MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.8)
MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.8)
MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.8)
MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.8)
MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.8)
MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.8)
MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.8)
MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2.8)
MCA sds: env (MCA v1.0, API v1.0, Component v1.2.8)
MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.8)
MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.8)
MCA sds: singleton (MCA v1.0, API v1.0, Component v1.2.8)
MCA sds: slurm (MCA v1.0, API v1.0, Component v1.2.8)