Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
Prentice Bisbal wrote: Ashley Pittman wrote: This smacks of a firewall issue, I thought you'd said you weren't using one but now I read back your emails I can't see anywhere where you say that. Are you running a flrewall or any iptables rules on any of the nodes? It looks to me like you may have some setup from on the worker nodes. Ashley. I agree with Ashley. To make sure it's not an IP tables or SELinux problem on one of the nodes, run these two commands on all teh nodes and then try again: service iptables stop setenforce 0 This fix worked. Delving in deeper, it turns out that there was a typo in the iptables file for the nodes: they were accepting all traffic on eth1 instead of eth0. Only the master has an eth1 port. When I checked the tables earlier, I didn't notice the discrepancy. Thank you all so much! Cheers, Ethan -- Dr. Ethan Deneault Assistant Professor of Physics SC-234 University of Tampa Tampa, FL 33615 Office: (813) 257-3555
Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
Ashley Pittman wrote: > This smacks of a firewall issue, I thought you'd said you weren't using one > but now I read back your emails I can't see anywhere where you say that. Are > you running a flrewall or any iptables rules on any of the nodes? It looks > to me like you may have some setup from on the worker nodes. > > Ashley. > I agree with Ashley. To make sure it's not an IP tables or SELinux problem on one of the nodes, run these two commands on all teh nodes and then try again: service iptables stop setenforce 0 -- Prentice
Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
This smacks of a firewall issue, I thought you'd said you weren't using one but now I read back your emails I can't see anywhere where you say that. Are you running a flrewall or any iptables rules on any of the nodes? It looks to me like you may have some setup from on the worker nodes. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
Rolf vandeVaart wrote: Ethan: Can you run just "hostname" successfully? In other words, a non-MPI program. If that does not work, then we know the problem is in the runtime. If it does works, then there is something with the way the MPI library is setting up its connections. Interesting. I did not try this. From the master: $ mpirun -debug-daemons -host merope,asterope -np 2 hostname asterope merope $ mpirun -host merope,asterope,electra -np 3 hostname asterope merope (hangs) $ mpirun -host electra,asterope,merope -np 3 hostname asterope electra (hangs) I cannot get 3 nodes to work together. Each node does work if in a pair of two. I can get three -processes- to work, if I include the master: $ mpirun -host pleiades,electra,asterope -np 3 hostname pleiades electra asterope But 4 processes does not: $ mpirun -host pleiades,electra,asterope,merope -np 4 hostname pleiades electra asterope (hangs) Is there more than one interface on the nodes? Each node only has eth0, and a static DHCP address. Is there something in the way that I have the nodes set up? They boot via PXE from an image on the master, so they should all have the same basic filesystem. Cheers, Ethan Rolf On 09/21/10 14:41, Ethan Deneault wrote: Prentice Bisbal wrote: I'm assuming you already tested ssh connectivity and verified everything is working as it should. (You did test all that, right?) Yes. I am able to log in remotely to all nodes from the master, and to each node from each node without a password. Each node mounts the same /home directory from the master, so they have the same copy of all the ssh and rsh keys. This sounds like configuration problem on one of the nodes, or a problem with ssh. I suspect it's not a problem with the number of processes, but whichever node is the 4th in your machinefile has a connectivity or configuration issue: I would try the following: 1. reorder the list of hosts in your machine file. > 3. Change your machinefile to include 4 completely different hosts. This does not seem to have any beneficial effect. The test program run from the master (pleiades) with any combination of 3 other nodes hangs during communication. This includes not using --machinefile and using -host; i.e. $ mpirun -host merope,electra,atlas -np 4 ./test.out (hangs) $ mpirun -host merope,electra,atlas -np 3 ./test.out (hangs) $ mpirun -host merope,electra -np 3 ./test.out node 1 : Hello world node 0 : Hello world node 2 : Hello world 2. Run the mpirun command from a different host. I'd try running it from several different hosts. The mpirun command does not seem to work when launched from one of the nodes. As an example: Running on node asterope: asterope$ mpirun -debug-daemons -host atlas,electra -np 4 ./test.out Daemon was launched on atlas - beginning to initialize Daemon was launched on electra - beginning to initialize Daemon [[54956,0],1] checking in as pid 2716 on host atlas Daemon [[54956,0],1] not using static ports Daemon [[54956,0],2] checking in as pid 2741 on host electra Daemon [[54956,0],2] not using static ports (hangs) I think someone else recommended that you should be specifying the number of process with -np. I second that. If the above fails, you might want to post your machine file your using. The machine file is a simple list of hostnames, as an example: m43 taygeta asterope Cheers, Ethan ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Dr. Ethan Deneault Assistant Professor of Physics SC-234 University of Tampa Tampa, FL 33615 Office: (813) 257-3555
Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
Ethan: Can you run just "hostname" successfully? In other words, a non-MPI program. If that does not work, then we know the problem is in the runtime. If it does works, then there is something with the way the MPI library is setting up its connections. Is there more than one interface on the nodes? Rolf On 09/21/10 14:41, Ethan Deneault wrote: Prentice Bisbal wrote: I'm assuming you already tested ssh connectivity and verified everything is working as it should. (You did test all that, right?) Yes. I am able to log in remotely to all nodes from the master, and to each node from each node without a password. Each node mounts the same /home directory from the master, so they have the same copy of all the ssh and rsh keys. This sounds like configuration problem on one of the nodes, or a problem with ssh. I suspect it's not a problem with the number of processes, but whichever node is the 4th in your machinefile has a connectivity or configuration issue: I would try the following: 1. reorder the list of hosts in your machine file. > 3. Change your machinefile to include 4 completely different hosts. This does not seem to have any beneficial effect. The test program run from the master (pleiades) with any combination of 3 other nodes hangs during communication. This includes not using --machinefile and using -host; i.e. $ mpirun -host merope,electra,atlas -np 4 ./test.out (hangs) $ mpirun -host merope,electra,atlas -np 3 ./test.out (hangs) $ mpirun -host merope,electra -np 3 ./test.out node 1 : Hello world node 0 : Hello world node 2 : Hello world 2. Run the mpirun command from a different host. I'd try running it from several different hosts. The mpirun command does not seem to work when launched from one of the nodes. As an example: Running on node asterope: asterope$ mpirun -debug-daemons -host atlas,electra -np 4 ./test.out Daemon was launched on atlas - beginning to initialize Daemon was launched on electra - beginning to initialize Daemon [[54956,0],1] checking in as pid 2716 on host atlas Daemon [[54956,0],1] not using static ports Daemon [[54956,0],2] checking in as pid 2741 on host electra Daemon [[54956,0],2] not using static ports (hangs) I think someone else recommended that you should be specifying the number of process with -np. I second that. If the above fails, you might want to post your machine file your using. The machine file is a simple list of hostnames, as an example: m43 taygeta asterope Cheers, Ethan
Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
Prentice Bisbal wrote: I'm assuming you already tested ssh connectivity and verified everything is working as it should. (You did test all that, right?) Yes. I am able to log in remotely to all nodes from the master, and to each node from each node without a password. Each node mounts the same /home directory from the master, so they have the same copy of all the ssh and rsh keys. This sounds like configuration problem on one of the nodes, or a problem with ssh. I suspect it's not a problem with the number of processes, but whichever node is the 4th in your machinefile has a connectivity or configuration issue: I would try the following: 1. reorder the list of hosts in your machine file. > 3. Change your machinefile to include 4 completely different hosts. This does not seem to have any beneficial effect. The test program run from the master (pleiades) with any combination of 3 other nodes hangs during communication. This includes not using --machinefile and using -host; i.e. $ mpirun -host merope,electra,atlas -np 4 ./test.out (hangs) $ mpirun -host merope,electra,atlas -np 3 ./test.out (hangs) $ mpirun -host merope,electra -np 3 ./test.out node 1 : Hello world node 0 : Hello world node 2 : Hello world 2. Run the mpirun command from a different host. I'd try running it from several different hosts. The mpirun command does not seem to work when launched from one of the nodes. As an example: Running on node asterope: asterope$ mpirun -debug-daemons -host atlas,electra -np 4 ./test.out Daemon was launched on atlas - beginning to initialize Daemon was launched on electra - beginning to initialize Daemon [[54956,0],1] checking in as pid 2716 on host atlas Daemon [[54956,0],1] not using static ports Daemon [[54956,0],2] checking in as pid 2741 on host electra Daemon [[54956,0],2] not using static ports (hangs) I think someone else recommended that you should be specifying the number of process with -np. I second that. If the above fails, you might want to post your machine file your using. The machine file is a simple list of hostnames, as an example: m43 taygeta asterope Cheers, Ethan -- Dr. Ethan Deneault Assistant Professor of Physics SC-234 University of Tampa Tampa, FL 33615 Office: (813) 257-3555
Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
Prentice Bisbal wrote: Ethan Deneault wrote: All, I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically /opt/openmpi, but Red Hat does things differently. I have my PATH and LD_LIBRARY_PATH set correctly; because the test program does compile and run. The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is a AMD x86_64 machine which serves the diskless node images and /home as an NFS mount. I compile all of my programs as 32-bit. My code is a simple hello world: $ more test.f program test include 'mpif.h' integer rank, size, ierror, tag, status(MPI_STATUS_SIZE) call MPI_INIT(ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) print*, 'node', rank, ': Hello world' call MPI_FINALIZE(ierror) end If I run this program with: $ mpirun --machinefile testfile ./test.out node 0 : Hello world node 2 : Hello world node 1 : Hello world This is the expected output. Here, testfile contains the master node: 'pleiades', and two slave nodes: 'taygeta' and 'm43' If I add another machine to testfile, say 'asterope', it hangs until I ctrl-c it. I have tried every machine, and as long as I do not include more than 3 hosts, the program will not hang. I have run the debug-daemons flag with it as well, and I don't see what is wrong specifically. I'm assuming you already tested ssh connectivity and verified everything is working as it should. (You did test all that, right?) This sounds like configuration problem on one of the nodes, or a problem with ssh. I suspect it's not a problem with the number of processes, but whichever node is the 4th in your machinefile has a connectivity or configuration issue: I would try the following: 1. reorder the list of hosts in your machine file. 2. Run the mpirun command from a different host. I'd try running it from several different hosts. 3. Change your machinefile to include 4 completely different hosts. I think someone else recommended that you should be specifying the number of process with -np. I second that. If the above fails, you might want to post your machine file your using. Hi Ethan What your program prints is process number, not the host name. To make sure all nodes are responding, you can try this: http://www.open-mpi.org/faq/?category=running#mpirun-host For the hostfile/machinefile structure, including the number of slots/cores/processors, see "man mpiexec". The OpenMPI FAQ have answers for many of these initial setup questions. Worth taking a look. I hope it helps, Gus Correa
Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
Ethan Deneault wrote: > All, > > I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the > /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically > /opt/openmpi, but Red Hat does things differently. I have my PATH and > LD_LIBRARY_PATH set correctly; because the test program does compile and > run. > > The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is > a AMD x86_64 machine which serves the diskless node images and /home as > an NFS mount. I compile all of my programs as 32-bit. > > My code is a simple hello world: > $ more test.f > program test > > include 'mpif.h' > integer rank, size, ierror, tag, status(MPI_STATUS_SIZE) > > call MPI_INIT(ierror) > call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) > call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) > print*, 'node', rank, ': Hello world' > call MPI_FINALIZE(ierror) > end > > If I run this program with: > > $ mpirun --machinefile testfile ./test.out > node 0 : Hello world > node 2 : Hello world > node 1 : Hello world > > This is the expected output. Here, testfile contains the master node: > 'pleiades', and two slave nodes: 'taygeta' and 'm43' > > If I add another machine to testfile, say 'asterope', it hangs until I > ctrl-c it. I have tried every machine, and as long as I do not include > more than 3 hosts, the program will not hang. > > I have run the debug-daemons flag with it as well, and I don't see what > is wrong specifically. > I'm assuming you already tested ssh connectivity and verified everything is working as it should. (You did test all that, right?) This sounds like configuration problem on one of the nodes, or a problem with ssh. I suspect it's not a problem with the number of processes, but whichever node is the 4th in your machinefile has a connectivity or configuration issue: I would try the following: 1. reorder the list of hosts in your machine file. 2. Run the mpirun command from a different host. I'd try running it from several different hosts. 3. Change your machinefile to include 4 completely different hosts. I think someone else recommended that you should be specifying the number of process with -np. I second that. If the above fails, you might want to post your machine file your using. -- Prentice
Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
David, I did try that after I sent the original mail, but the -np 4 flag doesn't fix the problem, the program still hangs. I've also double checked the iptables for the image and for the master node, and all ports are set to accept. Cheers, Ethan -- Dr. Ethan Deneault Assistant Professor of Physics SC 234 University of Tampa Tampa, FL 33606 -Original Message- From: users-boun...@open-mpi.org on behalf of David Zhang Sent: Mon 9/20/2010 9:58 PM To: Open MPI Users Subject: Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes. I don't know if this will help, but try mpirun --machinefile testfile -np 4 ./test.out for running 4 processes On Mon, Sep 20, 2010 at 3:00 PM, Ethan Deneault wrote: > All, > > I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the > /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically /opt/openmpi, > but Red Hat does things differently. I have my PATH and LD_LIBRARY_PATH set > correctly; because the test program does compile and run. > > The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is a > AMD x86_64 machine which serves the diskless node images and /home as an NFS > mount. I compile all of my programs as 32-bit. > > My code is a simple hello world: > $ more test.f > program test > > include 'mpif.h' > integer rank, size, ierror, tag, status(MPI_STATUS_SIZE) > > call MPI_INIT(ierror) > call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) > call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) > print*, 'node', rank, ': Hello world' > call MPI_FINALIZE(ierror) > end > > If I run this program with: > > $ mpirun --machinefile testfile ./test.out > node 0 : Hello world > node 2 : Hello world > node 1 : Hello world > > This is the expected output. Here, testfile contains the master node: > 'pleiades', and two slave nodes: 'taygeta' and 'm43' > > If I add another machine to testfile, say 'asterope', it hangs until I > ctrl-c it. I have tried every machine, and as long as I do not include more > than 3 hosts, the program will not hang. > > I have run the debug-daemons flag with it as well, and I don't see what is > wrong specifically. > > Working output: pleiades (master) and 2 nodes. > > $ mpirun --debug-daemons --machinefile testfile ./test.out > Daemon was launched on m43 - beginning to initialize > Daemon was launched on taygeta - beginning to initialize > Daemon [[46344,0],2] checking in as pid 2140 on host m43 > Daemon [[46344,0],2] not using static ports > [m43:02140] [[46344,0],2] orted: up and running - waiting for commands! > [pleiades:19178] [[46344,0],0] node[0].name pleiades daemon 0 arch ffca0200 > [pleiades:19178] [[46344,0],0] node[1].name taygeta daemon 1 arch ffca0200 > [pleiades:19178] [[46344,0],0] node[2].name m43 daemon 2 arch ffca0200 > [pleiades:19178] [[46344,0],0] orted_cmd: received add_local_procs > [m43:02140] [[46344,0],2] node[0].name pleiades daemon 0 arch ffca0200 > [m43:02140] [[46344,0],2] node[1].name taygeta daemon 1 arch ffca0200 > [m43:02140] [[46344,0],2] node[2].name m43 daemon 2 arch ffca0200 > [m43:02140] [[46344,0],2] orted_cmd: received add_local_procs > Daemon [[46344,0],1] checking in as pid 2317 on host taygeta > Daemon [[46344,0],1] not using static ports > [taygeta:02317] [[46344,0],1] orted: up and running - waiting for commands! > [taygeta:02317] [[46344,0],1] node[0].name pleiades daemon 0 arch ffca0200 > [taygeta:02317] [[46344,0],1] node[1].name taygeta daemon 1 arch ffca0200 > [taygeta:02317] [[46344,0],1] node[2].name m43 daemon 2 arch ffca0200 > [taygeta:02317] [[46344,0],1] orted_cmd: received add_local_procs > [pleiades:19178] [[46344,0],0] orted_recv: received sync+nidmap from local > proc [[46344,1],0] > [m43:02140] [[46344,0],2] orted_recv: received sync+nidmap from local proc > [[46344,1],2] > [taygeta:02317] [[46344,0],1] orted_recv: received sync+nidmap from local > proc [[46344,1],1] > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd > [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs > [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs > [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [m4
Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
I don't know if this will help, but try mpirun --machinefile testfile -np 4 ./test.out for running 4 processes On Mon, Sep 20, 2010 at 3:00 PM, Ethan Deneault wrote: > All, > > I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the > /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically /opt/openmpi, > but Red Hat does things differently. I have my PATH and LD_LIBRARY_PATH set > correctly; because the test program does compile and run. > > The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is a > AMD x86_64 machine which serves the diskless node images and /home as an NFS > mount. I compile all of my programs as 32-bit. > > My code is a simple hello world: > $ more test.f > program test > > include 'mpif.h' > integer rank, size, ierror, tag, status(MPI_STATUS_SIZE) > > call MPI_INIT(ierror) > call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) > call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) > print*, 'node', rank, ': Hello world' > call MPI_FINALIZE(ierror) > end > > If I run this program with: > > $ mpirun --machinefile testfile ./test.out > node 0 : Hello world > node 2 : Hello world > node 1 : Hello world > > This is the expected output. Here, testfile contains the master node: > 'pleiades', and two slave nodes: 'taygeta' and 'm43' > > If I add another machine to testfile, say 'asterope', it hangs until I > ctrl-c it. I have tried every machine, and as long as I do not include more > than 3 hosts, the program will not hang. > > I have run the debug-daemons flag with it as well, and I don't see what is > wrong specifically. > > Working output: pleiades (master) and 2 nodes. > > $ mpirun --debug-daemons --machinefile testfile ./test.out > Daemon was launched on m43 - beginning to initialize > Daemon was launched on taygeta - beginning to initialize > Daemon [[46344,0],2] checking in as pid 2140 on host m43 > Daemon [[46344,0],2] not using static ports > [m43:02140] [[46344,0],2] orted: up and running - waiting for commands! > [pleiades:19178] [[46344,0],0] node[0].name pleiades daemon 0 arch ffca0200 > [pleiades:19178] [[46344,0],0] node[1].name taygeta daemon 1 arch ffca0200 > [pleiades:19178] [[46344,0],0] node[2].name m43 daemon 2 arch ffca0200 > [pleiades:19178] [[46344,0],0] orted_cmd: received add_local_procs > [m43:02140] [[46344,0],2] node[0].name pleiades daemon 0 arch ffca0200 > [m43:02140] [[46344,0],2] node[1].name taygeta daemon 1 arch ffca0200 > [m43:02140] [[46344,0],2] node[2].name m43 daemon 2 arch ffca0200 > [m43:02140] [[46344,0],2] orted_cmd: received add_local_procs > Daemon [[46344,0],1] checking in as pid 2317 on host taygeta > Daemon [[46344,0],1] not using static ports > [taygeta:02317] [[46344,0],1] orted: up and running - waiting for commands! > [taygeta:02317] [[46344,0],1] node[0].name pleiades daemon 0 arch ffca0200 > [taygeta:02317] [[46344,0],1] node[1].name taygeta daemon 1 arch ffca0200 > [taygeta:02317] [[46344,0],1] node[2].name m43 daemon 2 arch ffca0200 > [taygeta:02317] [[46344,0],1] orted_cmd: received add_local_procs > [pleiades:19178] [[46344,0],0] orted_recv: received sync+nidmap from local > proc [[46344,1],0] > [m43:02140] [[46344,0],2] orted_recv: received sync+nidmap from local proc > [[46344,1],2] > [taygeta:02317] [[46344,0],1] orted_recv: received sync+nidmap from local > proc [[46344,1],1] > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd > [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs > [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs > [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs > [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd > [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs > [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs > node 0 : Hello world > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > node 2 : Hello world > node 1 : Hello world > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd > [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs > [taygeta:02317] [[46344,0],1] orte
[OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.
All, I am running Scientific Linux 5.5, with OpenMPI 1.4 installed into the /usr/lib/openmpi/1.4-gcc/ directory. I know this is typically /opt/openmpi, but Red Hat does things differently. I have my PATH and LD_LIBRARY_PATH set correctly; because the test program does compile and run. The cluster consists of 10 Intel Pentium 4 diskless nodes. The master is a AMD x86_64 machine which serves the diskless node images and /home as an NFS mount. I compile all of my programs as 32-bit. My code is a simple hello world: $ more test.f program test include 'mpif.h' integer rank, size, ierror, tag, status(MPI_STATUS_SIZE) call MPI_INIT(ierror) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) print*, 'node', rank, ': Hello world' call MPI_FINALIZE(ierror) end If I run this program with: $ mpirun --machinefile testfile ./test.out node 0 : Hello world node 2 : Hello world node 1 : Hello world This is the expected output. Here, testfile contains the master node: 'pleiades', and two slave nodes: 'taygeta' and 'm43' If I add another machine to testfile, say 'asterope', it hangs until I ctrl-c it. I have tried every machine, and as long as I do not include more than 3 hosts, the program will not hang. I have run the debug-daemons flag with it as well, and I don't see what is wrong specifically. Working output: pleiades (master) and 2 nodes. $ mpirun --debug-daemons --machinefile testfile ./test.out Daemon was launched on m43 - beginning to initialize Daemon was launched on taygeta - beginning to initialize Daemon [[46344,0],2] checking in as pid 2140 on host m43 Daemon [[46344,0],2] not using static ports [m43:02140] [[46344,0],2] orted: up and running - waiting for commands! [pleiades:19178] [[46344,0],0] node[0].name pleiades daemon 0 arch ffca0200 [pleiades:19178] [[46344,0],0] node[1].name taygeta daemon 1 arch ffca0200 [pleiades:19178] [[46344,0],0] node[2].name m43 daemon 2 arch ffca0200 [pleiades:19178] [[46344,0],0] orted_cmd: received add_local_procs [m43:02140] [[46344,0],2] node[0].name pleiades daemon 0 arch ffca0200 [m43:02140] [[46344,0],2] node[1].name taygeta daemon 1 arch ffca0200 [m43:02140] [[46344,0],2] node[2].name m43 daemon 2 arch ffca0200 [m43:02140] [[46344,0],2] orted_cmd: received add_local_procs Daemon [[46344,0],1] checking in as pid 2317 on host taygeta Daemon [[46344,0],1] not using static ports [taygeta:02317] [[46344,0],1] orted: up and running - waiting for commands! [taygeta:02317] [[46344,0],1] node[0].name pleiades daemon 0 arch ffca0200 [taygeta:02317] [[46344,0],1] node[1].name taygeta daemon 1 arch ffca0200 [taygeta:02317] [[46344,0],1] node[2].name m43 daemon 2 arch ffca0200 [taygeta:02317] [[46344,0],1] orted_cmd: received add_local_procs [pleiades:19178] [[46344,0],0] orted_recv: received sync+nidmap from local proc [[46344,1],0] [m43:02140] [[46344,0],2] orted_recv: received sync+nidmap from local proc [[46344,1],2] [taygeta:02317] [[46344,0],1] orted_recv: received sync+nidmap from local proc [[46344,1],1] [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs node 0 : Hello world [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd node 2 : Hello world node 1 : Hello world [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd [pleiades:19178] [[46344,0],0] orted_cmd: received collective data cmd [pleiades:19178] [[46344,0],0] orted_cmd: received message_local_procs [taygeta:02317] [[46344,0],1] orted_cmd: received collective data cmd [taygeta:02317] [[46344,0],1] orted_cmd: received message_local_procs [m43:02140] [[46344,0],2] orted_cmd: received collective data cmd [m43:02140] [[46344,0],2] orted_cmd: received message_local_procs [pleiades:19178] [[46344,0],0] orted_recv: received sync from local proc [[46344,1],0