
 I have upgraded to 1.2.7 and am still noticing the issue.

 Kindly help.

> Are you able to upgrade to Open MPI v1.2.7?
> There were *many* bug fixes and changes in the 1.2 series compared to
> the 1.1 series, some, in particular, were dealing with TCP socket
> timeouts (which are important when dealing with large numbers of MPI
> processes).
> On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote:
>> Hi,
>> I am trying to run a test mpiHelloWorld program that simply
>> initializes the MPI environment on all the nodes, prints the
>> hostname and rank of each node in the MPI process group and exits.
>> I am using MPI 1.1.2 and am running 997 processes on 499 nodes
>> (Nodes have 2 dual core CPUs).
>> I get the following error messages when I run my program as follows:
>> mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
>> .....
>> .....
>> .....
>> [0,1,380][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,142]
>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>> [0,1,140][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,390]
>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=113
>> connect() failed with errno=113connect() failed with
>> errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] [0,1,144]
>> [btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=113
>> [0,1,388][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>> errno=113[0,1,386][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>> errno=113
>> [0,1,139][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
>> errno=113
>> connect() failed with errno=113
>> .....
>> .....
>> The main thing is that I get these error messages around 3-4 times
>> out of 10 attempts with the rest all completing successfully. I have
>> looked into the FAQs in detail and also checked the tcp btl settings
>> but am not able to figure it out.
>> All the 499 nodes have only eth0 active and I get the error even
>> when I run the following: mpirun -np 997 -bynode ?hostfile nodelist
>> --mca btl_tcp_if_include eth0 /main/mpiHelloWorld
>> I have attached the output of ompi_info ?all.
>> The following is the output of /sbin/ifconfig on the node where I
>> start the mpi process (it is one of the 499 nodes)
>> eth0      Link encap:Ethernet  HWaddr 00:03:25:44:8F:D6
>>           inet addr:  Bcast:  Mask:
>>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>           RX packets:1978724556 errors:17 dropped:0 overruns:0 frame:
>> 17
>>           TX packets:1767028063 errors:0 dropped:0 overruns:0
>> carrier:0
>>           collisions:0 txqueuelen:1000
>>           RX bytes:580938897359 (554026.5 Mb)  TX bytes:689318600552
>> (657385.4 Mb)
>>           Interrupt:22 Base address:0xc000
>> lo        Link encap:Local Loopback
>>           inet addr:  Mask:
>>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>           RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
>>           TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
>>           collisions:0 txqueuelen:0
>>           RX bytes:339687635 (323.9 Mb)  TX bytes:339687635 (323.9 Mb)
>> Kindly help.
>> Regards,
>> Prasanna.
