On Mar 23, 2007, at 1:58 PM, Walker, David T. wrote:
I am presently trying to get OpenMPI up and running on a small cluster
of MacPros (dual dual-core Xeons) using TCP. Opne MPI was compiled
using
the intel Fortran Compiler (9.1) and gcc. When I try to launch a
job on
a remote node, orted starts on the remote node but then times out.
I am
guessing that the problem is SSH related. Any thoughts?
When I hear scenarios like this, the first thought that comes into my
head is: firewall issues. Open MPI requires the ability to open
random TCP ports from all hosts used in the MPI job. Have you
disabled the firewalls between all machines, or specifically allowed
those machines to open random TCP sockets between each other?
node01 1246% mpirun --debug-daemons -hostfile machinefile -np 5
Hello_World_Fortran
Calling MPI_INIT
Calling MPI_INIT
Calling MPI_INIT
Calling MPI_INIT
[node03:02422] [0,0,1]-[0,0,0] mca_oob_tcp_peer_send_blocking: send()
failed with errno=57
This error message supports the above theory (that there is a
firewall / port blocking software package in the way) -- the remote
daemon tried to open a socket back to mpirun and failed.
[node01.local:21427] ERROR: A daemon on node node03 failed to start as
expected.
This is [effectively] mpirun reporting that it timed out while
waiting for the remote orted to call back and say "I'm here!".
Check your firewall / port blocking settings and see if disabling /
selectively allowing trust between your machines solves the issue.
--
Jeff Squyres
Cisco Systems