I am having trouble running my MPI program on multiple nodes. I can run multiple processes on a single node, and I can spawn processes on on remote nodes, but when I call Send from a remote node, the node never returns, even though there is an appropriate Recv waiting. I'm pretty sure this is an issue with my configuration, not my code. I've tried some other sample programs I found and had the same problem of hanging on a send from one host to another.
Here's an in depth description: I wrote a quick test program where each process with rank > 1 sends an int to the master (rank 0), and the master receives until it gets something from every other process. My test program works fine when I run multiple processes on a single machine. either the local node: $ ./mpirun -n 4 ./mpi-test Hi I'm localhost:2 Hi I'm localhost:1 localhost:1 sending 11... localhost:2 sending 12... localhost:2 sent 12 localhost:1 sent 11 Hi I'm localhost:0 localhost:0 received 11 from 1 localhost:0 received 12 from 2 Hi I'm localhost:3 localhost:3 sending 13... localhost:3 sent 13 localhost:0 received 13 from 3 all workers checked in! or a remote one: $ ./mpirun -np 2 -host remotehost ./mpi-test Hi I'm remotehost:0 remotehost:0 received 11 from 1 all workers checked in! Hi I'm remotehost:1 remotehost:1 sending 11... remotehost:1 sent 11 But when I try to run the master locally and the worker(s) remotely (this is the way I am actually interested in running it), Send never returns and it hangs indefinitely. $ ./mpirun -np 2 -host localhost,remotehost ./mpi-test Hi I'm localhost:0 Hi I'm remotehost:1 remotehost:1 sending 11... Just to see if it would work, I tried spawning the master on the remotehost and the worker on the localhost. $ ./mpirun -np 2 -host remotehost,localhost ./mpi-test Hi I'm localhost:1 localhost:1 sending 11... localhost:1 sent 11 Hi I'm remotehost:0 remotehost:0 received 0 from 1 all workers checked in! It doesn't hang on Send, but the wrong value is received. Any idea what's going on? I've attached my code, my config.log, ifconfig output, and ompi_info output. Thanks, Keith
mpi.tgz
Description: GNU Zip compressed data