Hi,
I am having a problem in running mpirun  over multiple nodes.
To run a job  over two 8-core processors, I generated a hostfile as follows:
 yethiraj30 slots=8 max_slots=8
  yethiraj31 slots=8 max_slots=8

These two machines are intra-connected and I have installed openmpi 1.3.3.
Then If I try to run the replica exchange simulation using the following
command:
mpirun -np 16 --hostfile  hostfile  mdrun_4mpi -s topol_.tpr -multi 16
-replex 100 >& log_replica_test

But I find following error and job does not proceed at all :
btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect] connect() to
192.168.0.31 failed: No route to host (113)

Here is the full details:

NNODES=16, MYRANK=0, HOSTNAME=yethiraj30
NNODES=16, MYRANK=1, HOSTNAME=yethiraj30
NNODES=16, MYRANK=4, HOSTNAME=yethiraj30
NNODES=16, MYRANK=2, HOSTNAME=yethiraj30
NNODES=16, MYRANK=6, HOSTNAME=yethiraj30
NNODES=16, MYRANK=3, HOSTNAME=yethiraj30
NNODES=16, MYRANK=5, HOSTNAME=yethiraj30
NNODES=16, MYRANK=7, HOSTNAME=yethiraj30
[yethiraj30][[22604,1],0][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.0.31 failed: No route to host (113)
[yethiraj30][[22604,1],4][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.0.31 failed: No route to host (113)
[yethiraj30][[22604,1],6][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.0.31 failed: No route to host (113)
[yethiraj30][[22604,1],1][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.0.31 failed: No route to host (113)
[yethiraj30][[22604,1],3][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.0.31 failed: No route to host (113)
[yethiraj30][[22604,1],2][btl_tcp_endpoint.c:636:mca_btl_tcp_endpoint_complete_connect]
connect() to 192.168.0.31 failed: No route to host (113)
NNODES=16, MYRANK=10, HOSTNAME=yethiraj31
NNODES=16, MYRANK=12, HOSTNAME=yethiraj31

I am not sure how to resolve this issue. In general, I can go from one
machine to another without any problem using ssh. But, when I am trying to
run openmpi over both the machines, I get this error. Any help will be
appreciated.

Jagannath

Reply via email to