On Mar 23, 2007, at 1:58 PM, Walker, David T. wrote:

I am presently trying to get OpenMPI up and running on a small cluster
of MacPros (dual dual-core Xeons) using TCP. Opne MPI was compiled using the intel Fortran Compiler (9.1) and gcc. When I try to launch a job on a remote node, orted starts on the remote node but then times out. I am
guessing that the problem is SSH related.  Any thoughts?

When I hear scenarios like this, the first thought that comes into my head is: firewall issues. Open MPI requires the ability to open random TCP ports from all hosts used in the MPI job. Have you disabled the firewalls between all machines, or specifically allowed those machines to open random TCP sockets between each other?

node01 1246% mpirun --debug-daemons -hostfile machinefile -np 5
Hello_World_Fortran
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
 Calling MPI_INIT
[node03:02422] [0,0,1]-[0,0,0] mca_oob_tcp_peer_send_blocking: send()
failed with errno=57

This error message supports the above theory (that there is a firewall / port blocking software package in the way) -- the remote daemon tried to open a socket back to mpirun and failed.

[node01.local:21427] ERROR: A daemon on node node03 failed to start as
expected.

This is [effectively] mpirun reporting that it timed out while waiting for the remote orted to call back and say "I'm here!".

Check your firewall / port blocking settings and see if disabling / selectively allowing trust between your machines solves the issue.

--
Jeff Squyres
Cisco Systems

Reply via email to