Rolf vandeVaart wrote:
Ethan:

Can you run just "hostname" successfully? In other words, a non-MPI program. If that does not work, then we know the problem is in the runtime. If it does works, then there is something with the way the MPI library is setting up its connections.

Interesting. I did not try this.

From the master:
$ mpirun -debug-daemons -host merope,asterope -np 2 hostname
asterope
merope

$ mpirun -host merope,asterope,electra -np 3 hostname
asterope
merope

(hangs)

$ mpirun -host electra,asterope,merope -np 3 hostname
asterope
electra

(hangs)

I cannot get 3 nodes to work together. Each node does work if in a pair of two. I can get three -processes- to work, if I include the master:

$ mpirun -host pleiades,electra,asterope -np 3 hostname
pleiades
electra
asterope

But 4 processes does not:

$ mpirun -host pleiades,electra,asterope,merope -np 4 hostname
pleiades
electra
asterope

(hangs)

Is there more than one interface on the nodes?

Each node only has eth0, and a static DHCP address.

Is there something in the way that I have the nodes set up? They boot via PXE from an image on the master, so they should all have the same basic filesystem.

Cheers,
Ethan








Rolf

On 09/21/10 14:41, Ethan Deneault wrote:
Prentice Bisbal wrote:


I'm assuming you already tested ssh connectivity and verified everything
is working as it should. (You did test all that, right?)

Yes. I am able to log in remotely to all nodes from the master, and to each node from each node without a password. Each node mounts the same /home directory from the master, so they have the same copy of all the ssh and rsh keys.

This sounds like configuration problem on one of the nodes, or a problem
with ssh. I suspect it's not a problem with the number of processes, but
  whichever node is the 4th in your machinefile has a connectivity or
configuration issue:

I would try the following:

1. reorder the list of hosts in your machine file.
> 3. Change your machinefile to include 4 completely different hosts.

This does not seem to have any beneficial effect.

The test program run from the master (pleiades) with any combination of 3 other nodes hangs during communication. This includes not using --machinefile and using -host; i.e.

$ mpirun -host merope,electra,atlas -np 4 ./test.out (hangs)
$ mpirun -host merope,electra,atlas -np 3 ./test.out (hangs)
$ mpirun -host merope,electra -np 3 ./test.out
 node           1 : Hello world
 node           0 : Hello world
 node           2 : Hello world

2. Run the mpirun command from a different host. I'd try running it from
several different hosts.

The mpirun command does not seem to work when launched from one of the nodes. As an example:

Running on node asterope:

asterope$ mpirun -debug-daemons -host atlas,electra -np 4 ./test.out

Daemon was launched on atlas - beginning to initialize
Daemon was launched on electra - beginning to initialize
Daemon [[54956,0],1] checking in as pid 2716 on host atlas
Daemon [[54956,0],1] not using static ports
Daemon [[54956,0],2] checking in as pid 2741 on host electra
Daemon [[54956,0],2] not using static ports

(hangs)

I think someone else recommended that you should be specifying the
number of process with -np. I second that.

If the above fails, you might want to post your machine file your using.

The machine file is a simple list of hostnames, as an example:

m43
taygeta
asterope



Cheers,
Ethan


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Dr. Ethan Deneault
Assistant Professor of Physics
SC-234
University of Tampa
Tampa, FL 33615
Office: (813) 257-3555

Reply via email to