Hi, I hope this is the right forum for my questions.  I am running into
a problem when scaling >512 cores on a infiniband cluster which has
14,336 cores. I am new to openmpi and trying to figure out the right
-mca options to pass to avoid the "mca_oob_tcp_peer_complete_connect:
connection failed:" on a cluster which has infiniband HCAs and OFED
v1.3GA release.  Other MPI implementation like Intel MPI and mvapich
work fine using uDAPL or VERBs IB layers for MPI communications.

I find it difficult to understand which network interface or IB layer
being used. When I explicitly state not to use eth0,lo,ib1, or ib1:0
interfaces with the cmdline option "-mca oob_tcp_exclude" openmpi will
continue to probe these interfaces.  For all MPI traffic openmpi should
use IB0 which is the 10.148 network. But with debugging enabled I see
references trying the 10.149 network which is IB1.  Below is the
ifconfig network device output for a compute node.


1. Is there away to determine which network device is being used and not
have openmpi fallback to another device? With Intel MPI or HP MPI you
can state not to use a fallback device.  I thought "-mca
oob_tcp_exclude" would be the correct option to pass but I maybe wrong. 

2. How can I determine infiniband openib device is actually being used?
When running a MPI app I continue to see counters for in/out packets at
a tcp level increasing when it should be using the IB RDMA device for
all MPI comms over the IB0 or mtcha0 device? OpenMPI was bundled with
OFED v1.3 so I am assuming the openib interface should work.  Running
ompi_info shows btl_open_* references. 

/usr/mpi/openmpi-1.2-2/intel/bin/mpiexec -mca
btl_openib_warn_default_gid_prefix 0 -mca oob_tcp_exclude
eth0,lo,ib1,ib1:0  -mca btl openib,sm,self -machinefile mpd.hosts.$$ -np
1024 ~/bin/test_ompi < input1

3. When trying to avoid the "mca_oob_tcp_peer_complete_connect:
connection failed:" message I tried using "-mca btl openib,sm,self" and
"-mca btl ^tcp" but I still get these error messages.  In cases with
using the "-mca btl openib,sm,self" openmpi will retry to use the IB1
(10.149 net) fabric to establish a connection with a node.  What are my
options to avoid these connection failed messages?  I suspect openmpi is
overflowing the tcp buffer on the clients based on large core count of
this job since I see lots of tcp buffer errors based on netstat -s
output. I reviewed all of the online FAQs and I am not sure what options
to pass to get around this issue.

OBTW, I did check the
/usr/mpi/openmpi-1.2-2/intel/etc/openmpi-mca-params.conf file and no
defaults are being specified.


                Open MPI: 1.2.2
   Open MPI SVN revision: r14613
                Open RTE: 1.2.2
   Open RTE SVN revision: r14613
                    OPAL: 1.2.2
       OPAL SVN revision: r14613
                  Prefix: /usr/mpi/openmpi-1.2-2/intel
 Configured architecture: x86_64-suse-linux-gnu


Following is the cluster configuration:
1792 nodes with 8 cores per node = 14336 cores
Ofed Rel: OFED-1.3-rc1
IB Device(s): mthca0 FW=1.2.0 Rate=20 Gb/sec (4X DDR) mthca1 FW=1.2.0
Rate=20 Gb/sec (4X DDR) 
Processors: 2 x 4 Cores Intel(R) Xeon(R) CPU X5365 @ 3.00GHz 8192KB
Cache FSB:1333MHz
Total Mem: 16342776 KB    
OS Release: SUSE Linux Enterprise Server 10 (x86_64) VERSION = 10 SP1 
Kernel Ver:


Ifconfig output:
eth0      Link encap:Ethernet  HWaddr 00:30:48:7B:A7:AC  
          inet addr:  Bcast:
          inet6 addr: fe80::230:48ff:fe7b:a7ac/64 Scope:Link
          RX packets:1215826 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1342035 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:787514337 (751.0 Mb)  TX bytes:170968505 (163.0 Mb)
          Base address:0x2000 Memory:dfa00000-dfa20000 

ib0       Link encap:UNSPEC  HWaddr
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::230:487b:a7ac:1/64 Scope:Link
          RX packets:20823896 errors:0 dropped:0 overruns:0 frame:0
          TX packets:19276836 errors:0 dropped:42 overruns:0 carrier:0
          collisions:0 txqueuelen:256 
          RX bytes:176581223103 (168400.9 Mb)  TX bytes:182691213682
(174227.9 Mb)

ib1       Link encap:UNSPEC  HWaddr
          inet addr:  Bcast:
          inet6 addr: fe80::230:487b:a7ad:1/64 Scope:Link
          RX packets:175609 errors:0 dropped:0 overruns:0 frame:0
          TX packets:31175 errors:0 dropped:6 overruns:0 carrier:0
          collisions:0 txqueuelen:256 
          RX bytes:139196236 (132.7 Mb)  TX bytes:4515680 (4.3 Mb)

ib1:0     Link encap:UNSPEC  HWaddr
          inet addr:  Bcast:  Mask:

lo        Link encap:Local Loopback  
          inet addr:  Mask:
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:30554 errors:0 dropped:0 overruns:0 frame:0
          TX packets:30554 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:54170543 (51.6 Mb)  TX bytes:54170543 (51.6 Mb)


Ibstatus output:
Infiniband device 'mthca0' port 1 status:
        default gid:     fe80:0000:0000:0000:0030:487c:04b4:0001
        base lid:        0x4fb
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)

Infiniband device 'mthca1' port 1 status:
        default gid:     fe80:0000:0000:0000:0030:487c:04b5:0001
        base lid:        0x50c
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            20 Gb/sec (4X DDR)


Thanks in advance,

