Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

Jeff Squyres Mon, 8 Sep 2008 16:43:38 -0400

Are you able to upgrade to Open MPI v1.2.7?

There were *many* bug fixes and changes in the 1.2 series compared tothe 1.1 series, some, in particular, were dealing with TCP sockettimeouts (which are important when dealing with large numbers of MPIprocesses).




On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote:

Hi,
I am trying to run a test mpiHelloWorld program that simplyinitializes the MPI environment on all the nodes, prints thehostname and rank of each node in the MPI process group and exits.
I am using MPI 1.1.2 and am running 997 processes on 499 nodes(Nodes have 2 dual core CPUs).
I get the following error messages when I run my program as follows:mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
.....
.....
.....
[0,1,380][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] [0,1,142][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect][0,1,140][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] [0,1,390][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]connect() failed with errno=113connect() failed with errno=113connect() failed witherrno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] [0,1,144][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]connect() failed with errno=113[0,1,388][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed witherrno=113[0,1,386][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed witherrno=113[0,1,139][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect] connect() failed witherrno=113
connect() failed with errno=113
.....
.....
The main thing is that I get these error messages around 3-4 timesout of 10 attempts with the rest all completing successfully. I havelooked into the FAQs in detail and also checked the tcp btl settingsbut am not able to figure it out.
All the 499 nodes have only eth0 active and I get the error evenwhen I run the following: mpirun -np 997 -bynode –hostfile nodelist--mca btl_tcp_if_include eth0 /main/mpiHelloWorld
I have attached the output of ompi_info —all.
The following is the output of /sbin/ifconfig on the node where Istart the mpi process (it is one of the 499 nodes)
eth0      Link encap:Ethernet  HWaddr 00:03:25:44:8F:D6
          inet addr:10.12.1.11  Bcast:10.12.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:1978724556 errors:17 dropped:0 overruns:0 frame:17TX packets:1767028063 errors:0 dropped:0 overruns:0carrier:0
          collisions:0 txqueuelen:1000
RX bytes:580938897359 (554026.5 Mb) TX bytes:689318600552(657385.4 Mb)
          Interrupt:22 Base address:0xc000

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
          TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:339687635 (323.9 Mb)  TX bytes:339687635 (323.9 Mb)


Kindly help.

Regards,

Prasanna.

<ompi_info.rtf>_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems

Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

Reply via email to