Hi,
I am trying to run a test mpiHelloWorld program that simply
initializes the MPI environment on all the nodes, prints the
hostname and rank of each node in the MPI process group and exits.
I am using MPI 1.1.2 and am running 997 processes on 499 nodes
(Nodes have 2 dual core CPUs).
I get the following error messages when I run my program as
follows:
mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
.....
.....
.....
[0,1,380][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,142]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
[0,1,140][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,390]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
connect() failed with errno=113connect() failed with
errno=113connect() failed with errno=113[0,1,138]
[btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,144]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[0,1,388][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113[0,1,386][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113
[0,1,139][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113
connect() failed with errno=113
.....
.....
The main thing is that I get these error messages around 3-4 times
out of 10 attempts with the rest all completing successfully. I
have
looked into the FAQs in detail and also checked the tcp btl
settings
but am not able to figure it out.
All the 499 nodes have only eth0 active and I get the error even
when I run the following: mpirun -np 997 -bynode ?hostfile nodelist
--mca btl_tcp_if_include eth0 /main/mpiHelloWorld
I have attached the output of ompi_info ?all.
The following is the output of /sbin/ifconfig on the node where I
start the mpi process (it is one of the 499 nodes)
eth0 Link encap:Ethernet HWaddr 00:03:25:44:8F:D6
inet addr:10.12.1.11 Bcast:10.12.255.255 Mask:
255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1978724556 errors:17 dropped:0 overruns:0
frame:
17
TX packets:1767028063 errors:0 dropped:0 overruns:0
carrier:0
collisions:0 txqueuelen:1000
RX bytes:580938897359 (554026.5 Mb) TX bytes:
689318600552
(657385.4 Mb)
Interrupt:22 Base address:0xc000
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:339687635 (323.9 Mb) TX bytes:339687635
(323.9 Mb)
Kindly help.
Regards,
Prasanna.
<ompi_info.rtf>_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users