Hi there,

I am running jobs on clusters with Infiniband connection. They installed 
OpenMPI v1.5.4 via REDHAT 6 yum package). My problem is that although my jobs 
gets queued and started by PBS PRO quickly, most of the time they don't really 
run (occasionally they really run) and give error info like this (even though 
there are a lot of CPU/IB resource available):

[r2i6n7][[25564,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
 connect() to 192.168.159.156 failed: Connection refused (111)

And even though when a job gets started and runs well, it prompts this error:
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
  Local host:   r1i2n6
  Local device: mlx4_0
--------------------------------------------------------------------------

1. Here is the info from one of the compute nodes:
-bash-4.1$ /sbin/ifconfig
eth0      Link encap:Ethernet  HWaddr 8C:89:A5:E3:D2:96
          inet addr:192.168.159.205  Bcast:192.168.159.255  Mask:255.255.255.0
          inet6 addr: fe80::8e89:a5ff:fee3:d296/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:48879864 errors:0 dropped:0 overruns:17 frame:0
          TX packets:39286060 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:54771093645 (51.0 GiB)  TX bytes:37512462596 (34.9 GiB)
          Memory:dfc00000-dfc20000

Ifconfig uses the ioctl access method to get the full address information, 
which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed 
correctly.
Ifconfig is obsolete! For replacement check ip.
ib0       Link encap:InfiniBand  HWaddr 
80:00:00:48:FE:C0:00:00:00:00:00:00:00:00:00:00:00:00:00:00
          inet addr:10.148.0.114  Bcast:10.148.255.255  Mask:255.255.0.0
          inet6 addr: fe80::202:c903:fb:3489/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
          RX packets:43807414 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10534050 errors:0 dropped:24 overruns:0 carrier:0
          collisions:0 txqueuelen:256
          RX bytes:47824448125 (44.5 GiB)  TX bytes:44764010514 (41.6 GiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:17292 errors:0 dropped:0 overruns:0 frame:0
          TX packets:17292 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1492453 (1.4 MiB)  TX bytes:1492453 (1.4 MiB)

-bash-4.1$ chkconfig --list iptables
iptables        0:off   1:off   2:on    3:on    4:on    5:on    6:off

2. I tried various parameters below but none of them can assure my jobs get 
initialized and run:
#TCP="--mca btl ^tcp"
#TCP="--mca btl self,openib"
#TCP="--mca btl_tcp_if_exclude lo"
#TCP="--mca btl_tcp_if_include eth0"
#TCP="--mca btl_tcp_if_include eth0, ib0"
#TCP="--mca btl_tcp_if_exclude 192.168.0.0/24,127.0.0.1/8 --mca 
oob_tcp_if_exclude 192.168.0.0/24,127.0.0.1/8"
#TCP="--mca btl_tcp_if_include 10.148.0.0/16"
mpirun $TCP -hostfile $PBS_NODEFILE -np 8 ./paraEllip3d input.txt

3. Then I turned to Intel MPI, which surprisingly starts and runs my job 
correctly each time (though it is a little slower than OpenMPI, maybe 15% 
slower, but it works each time).

Can you please advise? Many thanks.

Sincerely,
Beichuan Yan


Reply via email to