Tim Mattox wrote:
For your runs with Open MPI over InfiniBand, try using openib,sm,self
for the BTL setting, so that shared memory communications are used
within a node.  It would give us another datapoint to help diagnose
the problem.  As for other things we would need to help diagnose the
problem, please follow the advice on this FAQ entry, and the help page:
http://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot
http://www.open-mpi.org/community/help/
Dear Tim,

thank you for this pointer.

1) Ofed: It's 1.2.5, from the OpenFabrics website
2) Linux version: scientific linux (RH enterprise remaster) v. 4.2, kernel 2.6.9-55.0.12.ELsmp
3) Subnet manager: OpenSM
4)ibv_devinfo
hca_id:    mthca0
   fw_ver:                1.0.800
   node_guid:            0002:c902:0022:b398
   sys_image_guid:            0002:c902:0022:b39b
   vendor_id:            0x02c9
   vendor_part_id:            25204
   hw_ver:                0xA0
   board_id:            MT_03B0120002
   phys_port_cnt:            1
       port:    1
           state:            PORT_ACTIVE (4)
           max_mtu:        2048 (4)
           active_mtu:        2048 (4)
           sm_lid:            9
           port_lid:        97
           port_lmc:        0x00

(no node is different from the others, as far as the problem is concerned)

5) ifconfig:
eth0 Link encap:Ethernet HWaddr 00:17:31:E3:89:4A inet addr:10.0.0.12 Bcast:10.0.0.255 Mask:255.255.255.0
         inet6 addr: fe80::217:31ff:fee3:894a/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
         RX packets:23348585 errors:0 dropped:0 overruns:0 frame:0
         TX packets:17247486 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:19410724189 (18.0 GiB)  TX bytes:14981325997 (13.9 GiB)
         Interrupt:209

lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0
         inet6 addr: ::1/128 Scope:Host
         UP LOOPBACK RUNNING  MTU:16436  Metric:1
         RX packets:5088 errors:0 dropped:0 overruns:0 frame:0
         TX packets:5088 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:0
         RX bytes:2468843 (2.3 MiB)  TX bytes:2468843 (2.3 MiB)

6) ulimit -l
8388608
(this is more than the physical memory on the node)

7) output of ompi_info attached (I have tried also earlier releases)

8) description of the problem: a program seems to communicate correctly over the TCP network, but not over the ifiniband network. The program is structured in such a way that if the communication does not happen, a loop become infinite. So there is no error message, just a program entering an infinite loop.

The command line used are:

The command line I use is

mpirun -mca btl  openib,sm,self  <executable>

(with openib replaced by tcp in the case of communication over ethernet).

I could include the path and the value of the variable LD_LIBRARY_PATH, but it won't tell too much, since the installation directory is non-standard (/opt/ompi128-intel/bin for the path and /opt/ompi128-intel/lib for the libs).

I hope to have provided all the required info, if you need more or some of them in more detail, please let me know.

Many thanks,
Biagio Lucini
                Open MPI: 1.2.8
   Open MPI SVN revision: r19718
                Open RTE: 1.2.8
   Open RTE SVN revision: r19718
                    OPAL: 1.2.8
       OPAL SVN revision: r19718
                  Prefix: /opt/ompi128-intel
 Configured architecture: x86_64-unknown-linux-gnu
           Configured by: root
           Configured on: Tue Dec 23 12:33:51 GMT 2008
          Configure host: master.cluster
                Built by: root
                Built on: Tue Dec 23 12:38:34 GMT 2008
              Built host: master.cluster
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: icc
     C compiler absolute: /opt/intel/cce/9.1.045/bin/icc
            C++ compiler: icpc
   C++ compiler absolute: /opt/intel/cce/9.1.045/bin/icpc
      Fortran77 compiler: ifort
  Fortran77 compiler abs: /opt/intel/fce/9.1.040/bin/ifort
      Fortran90 compiler: ifort
  Fortran90 compiler abs: /opt/intel/fce/9.1.040/bin/ifort
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (mpi: no, progress: no)
  Internal debug support: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
           MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.8)
              MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.2.8)
           MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.8)
           MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.8)
           MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.2.8)
               MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.8)
         MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.8)
         MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.8)
           MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
           MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
                MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.8)
                MCA coll: self (MCA v1.0, API v1.0, Component v1.2.8)
                MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.8)
                MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.8)
                  MCA io: romio (MCA v1.0, API v1.0, Component v1.2.8)
               MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.8)
               MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.8)
              MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA btl: openib (MCA v1.0, API v1.0.1, Component v1.2.8)
                 MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.8)
                 MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.8)
                 MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
                MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.8)
              MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.8)
              MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.8)
              MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.8)
                 MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA iof: svc (MCA v1.0, API v1.0, Component v1.2.8)
                  MCA ns: proxy (MCA v1.0, API v2.0, Component v1.2.8)
                  MCA ns: replica (MCA v1.0, API v2.0, Component v1.2.8)
                 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
                 MCA ras: dash_host (MCA v1.0, API v1.3, Component v1.2.8)
                 MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.8)
                 MCA ras: localhost (MCA v1.0, API v1.3, Component v1.2.8)
                 MCA ras: slurm (MCA v1.0, API v1.3, Component v1.2.8)
                 MCA rds: hostfile (MCA v1.0, API v1.3, Component v1.2.8)
                 MCA rds: proxy (MCA v1.0, API v1.3, Component v1.2.8)
                 MCA rds: resfile (MCA v1.0, API v1.3, Component v1.2.8)
               MCA rmaps: round_robin (MCA v1.0, API v1.3, Component v1.2.8)
                MCA rmgr: proxy (MCA v1.0, API v2.0, Component v1.2.8)
                MCA rmgr: urm (MCA v1.0, API v2.0, Component v1.2.8)
                 MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.8)
                 MCA pls: proxy (MCA v1.0, API v1.3, Component v1.2.8)
                 MCA pls: rsh (MCA v1.0, API v1.3, Component v1.2.8)
                 MCA pls: slurm (MCA v1.0, API v1.3, Component v1.2.8)
                 MCA sds: env (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA sds: pipe (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA sds: seed (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA sds: singleton (MCA v1.0, API v1.0, Component v1.2.8)
                 MCA sds: slurm (MCA v1.0, API v1.0, Component v1.2.8)

Reply via email to