Prentice Bisbal wrote:
Ethan Deneault wrote:
All,

I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the
/usr/lib/openmpi/1.4-gcc/ directory. I know this is typically
/opt/openmpi, but Red Hat does things differently. I have my PATH and
LD_LIBRARY_PATH set correctly; because the test program does compile and
run.

The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is
a AMD x86_64 machine which serves the diskless node images and /home as
an NFS mount. I compile all of my programs as 32-bit.

My code is a simple hello world:
$ more test.f
      program test

      include 'mpif.h'
      integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)

      call MPI_INIT(ierror)
      call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
      call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
      print*, 'node', rank, ': Hello world'
      call MPI_FINALIZE(ierror)
      end

If I run this program with:

$ mpirun --machinefile testfile ./test.out
 node           0 : Hello world
 node           2 : Hello world
 node           1 : Hello world

This is the expected output. Here, testfile contains the master node:
'pleiades', and two slave nodes: 'taygeta' and 'm43'

If I add another machine to testfile, say 'asterope', it hangs until I
ctrl-c it. I have tried every machine, and as long as I do not include
more than 3 hosts, the program will not hang.

I have run the debug-daemons flag with it as well, and I don't see what
is wrong specifically.


I'm assuming you already tested ssh connectivity and verified everything
is working as it should. (You did test all that, right?)

This sounds like configuration problem on one of the nodes, or a problem
with ssh. I suspect it's not a problem with the number of processes, but
  whichever node is the 4th in your machinefile has a connectivity or
configuration issue:

I would try the following:

1. reorder the list of hosts in your machine file.

2. Run the mpirun command from a different host. I'd try running it from
several different hosts.

3. Change your machinefile to include 4 completely different hosts.

I think someone else recommended that you should be specifying the
number of process with -np. I second that.

If the above fails, you might want to post your machine file your using.


Hi Ethan

What your program prints is process number, not the host name.
To make sure all nodes are responding, you can try this:

http://www.open-mpi.org/faq/?category=running#mpirun-host

For the hostfile/machinefile structure,
including the number of slots/cores/processors, see "man mpiexec".

The OpenMPI FAQ have answers for many of these initial setup questions.
Worth taking a look.

I hope it helps,
Gus Correa

Reply via email to