Hi,

 I have upgraded to 1.2.7 and am still noticing the issue.

 Kindly help.

> 
> Message: 1
> Date: Mon, 8 Sep 2008 16:43:33 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] Need help resolving No route to host error
> with    OpenMPI 1.1.2
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <af302d68-0d30-469e-afd3-566ff9628...@cisco.com>
> Content-Type: text/plain; charset=WINDOWS-1252; format=flowed;
> delsp=yes
> 
> Are you able to upgrade to Open MPI v1.2.7?
> 
> There were *many* bug fixes and changes in the 1.2 series compared to
> the 1.1 series, some, in particular, were dealing with TCP socket
> timeouts (which are important when dealing with large numbers of MPI
> processes).
> 
> 
> 
> On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote:
> 
>> Hi,
>> 
>> I am trying to run a test mpiHelloWorld program that simply
>> initializes the MPI environment on all the nodes, prints the
>> hostname and rank of each node in the MPI process group and exits.
>> 
>> I am using MPI 1.1.2 and am running 997 processes on 499 nodes
>> (Nodes have 2 dual core CPUs).
>> 
>> I get the following error messages when I run my program as follows:
>> mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
>> .....
>> .....
>> .....
>> [0,1,380][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,142]
>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>> [0,1,140][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,390]
>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=113
>> connect() failed with errno=113connect() failed with
>> errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,144]
>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=113
>> [0,1,388][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>> errno=113[0,1,386][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>> errno=113
>> [0,1,139][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>> errno=113
>> connect() failed with errno=113
>> .....
>> .....
>> 
>> The main thing is that I get these error messages around 3-4 times
>> out of 10 attempts with the rest all completing successfully. I have
>> looked into the FAQs in detail and also checked the tcp btl settings
>> but am not able to figure it out.
>> 
>> All the 499 nodes have only eth0 active and I get the error even
>> when I run the following: mpirun -np 997 -bynode ?hostfile nodelist
>> --mca btl_tcp_if_include eth0 /main/mpiHelloWorld
>> 
>> I have attached the output of ompi_info ?all.
>> 
>> The following is the output of /sbin/ifconfig on the node where I
>> start the mpi process (it is one of the 499 nodes)
>> 
>> eth0      Link encap:Ethernet  HWaddr 00:03:25:44:8F:D6
>>           inet addr:10.12.1.11  Bcast:10.12.255.255  Mask:255.255.0.0
>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>           RX packets:1978724556 errors:17 dropped:0 overruns:0 frame:
>> 17
>>           TX packets:1767028063 errors:0 dropped:0 overruns:0
>> carrier:0
>>           collisions:0 txqueuelen:1000
>>           RX bytes:580938897359 (554026.5 Mb)  TX bytes:689318600552
>> (657385.4 Mb)
>>           Interrupt:22 Base address:0xc000
>> 
>> lo        Link encap:Local Loopback
>>           inet addr:127.0.0.1  Mask:255.0.0.0
>>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>           RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
>>           TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
>>           collisions:0 txqueuelen:0
>>           RX bytes:339687635 (323.9 Mb)  TX bytes:339687635 (323.9 Mb)
>> 
>> 
>> Kindly help.
>> 
>> Regards,
>> 
>> Prasanna.
>> 
>> <ompi_info.rtf>_______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to