Ethan:

Can you run just "hostname" successfully? In other words, a non-MPI program. If that does not work, then we know the problem is in the runtime. If it does works, then there is something with the way the MPI library is setting up its connections.

Is there more than one interface on the nodes?

Rolf

On 09/21/10 14:41, Ethan Deneault wrote:
Prentice Bisbal wrote:


I'm assuming you already tested ssh connectivity and verified everything
is working as it should. (You did test all that, right?)

Yes. I am able to log in remotely to all nodes from the master, and to each node from each node without a password. Each node mounts the same /home directory from the master, so they have the same copy of all the ssh and rsh keys.

This sounds like configuration problem on one of the nodes, or a problem
with ssh. I suspect it's not a problem with the number of processes, but
  whichever node is the 4th in your machinefile has a connectivity or
configuration issue:

I would try the following:

1. reorder the list of hosts in your machine file.
> 3. Change your machinefile to include 4 completely different hosts.

This does not seem to have any beneficial effect.

The test program run from the master (pleiades) with any combination of 3 other nodes hangs during communication. This includes not using --machinefile and using -host; i.e.

$ mpirun -host merope,electra,atlas -np 4 ./test.out (hangs)
$ mpirun -host merope,electra,atlas -np 3 ./test.out (hangs)
$ mpirun -host merope,electra -np 3 ./test.out
 node           1 : Hello world
 node           0 : Hello world
 node           2 : Hello world

2. Run the mpirun command from a different host. I'd try running it from
several different hosts.

The mpirun command does not seem to work when launched from one of the nodes. As an example:

Running on node asterope:

asterope$ mpirun -debug-daemons -host atlas,electra -np 4 ./test.out

Daemon was launched on atlas - beginning to initialize
Daemon was launched on electra - beginning to initialize
Daemon [[54956,0],1] checking in as pid 2716 on host atlas
Daemon [[54956,0],1] not using static ports
Daemon [[54956,0],2] checking in as pid 2741 on host electra
Daemon [[54956,0],2] not using static ports

(hangs)

I think someone else recommended that you should be specifying the
number of process with -np. I second that.

If the above fails, you might want to post your machine file your using.

The machine file is a simple list of hostnames, as an example:

m43
taygeta
asterope



Cheers,
Ethan


Reply via email to